Steam-powered Turing Machine University of Washington Computer Science & Engineering
 UW CSE Speaker Abstract
  CSE Home   About Us    Search    Contact Info 

 CSE News Page
 Colloquia
 Colloquia newsgroup
 TV-talks
 Search UW CSE Colloquia
 Visitor SchedulesCSE only
 Room ReservationsCSE only
    Shaojun Wang (University of Alberta)
Host: Etzioni
Exploiting Syntactic, Semantic and Lexical Regularities in Language Modeling
Monday, December 12, 2005
11:00 am, CSE-403

Abstract

Language modeling -- accurately calculating the probability of naturally occurring word sequences in human natural language -- lies at the heart of some of the most exciting developments in computer science, such as speech recognition, machine translation, information retrieval and bioinformatics. I will present two pieces of my research for statistical language modeling which simultaneously incorporate various aspects of natural language, such as local word interaction, syntactic structure and semantic document information.

The first piece of work is based on a new machine learning technique we have proposed --- the latent maximum entropy principle --- which allows relationships over hidden features to be effectively captured in a unified model. Our work extends previous research on maximum entropy methods for language modeling, which only allow observed features to be modeled. The ability to conveniently incorporate hidden variables allows us to extend the expressiveness of language models while alleviating the necessity of pre-processing the data to obtain explicitly observed features. We then use these techniques to combine two standard forms of language models:
local lexical models (trigram models) and global document-level semantic models (probabilistic latent semantic analysis, PLSA).

The second piece of work is aimed at encoding syntactic structure into semantic n-gram language model with tractable parameter estimation algorithm. We propose a directed Markov random field (MRF) model that combines n-gram models, PCFGs and PLSA. The composite directed MRF model has an exponential number of loops and becomes context sensitive grammar, nevertheless we are able to estimate its parameters in cubic time using an efficient modified EM method, the generalized inside-outside algorithm, which extends inside-outside algorithm to incorporate the effects of the n-gram and PLSA language models.

Our experimental results on the Wall Street Journal corpus show that both approaches induce significant reductions in perplexity over current state-of-art technique.


CSE logo Computer Science & Engineering
University of Washington
Box 352350
Seattle, WA  98195-2350
(206) 543-1695 voice, (206) 543-2969 FAX
[comments to talks-info]