|
CSE Home | About Us | Search | Contact Info |
|
Shaojun Wang (University of Alberta) Host: Etzioni Exploiting Syntactic, Semantic and Lexical Regularities in Language Modeling Monday, December 12, 2005 11:00 am, CSE-403 AbstractLanguage modeling -- accurately calculating the probability of naturally occurring word sequences in human natural language -- lies at the heart of some of the most exciting developments in computer science, such as speech recognition, machine translation, information retrieval and bioinformatics. I will present two pieces of my research for statistical language modeling which simultaneously incorporate various aspects of natural language, such as local word interaction, syntactic structure and semantic document information.
The first piece of work is based on a new machine learning technique we have proposed --- the latent
maximum entropy principle --- which allows relationships over hidden features to be effectively
captured in a unified model. Our work extends previous research on maximum entropy methods for language
modeling, which only allow observed features to be modeled. The ability to conveniently incorporate
hidden variables allows us to extend the expressiveness of language models while alleviating the
necessity of pre-processing the data to obtain explicitly observed features. We then use these techniques
to combine two standard forms of language models: The second piece of work is aimed at encoding syntactic structure into semantic n-gram language model with tractable parameter estimation algorithm. We propose a directed Markov random field (MRF) model that combines n-gram models, PCFGs and PLSA. The composite directed MRF model has an exponential number of loops and becomes context sensitive grammar, nevertheless we are able to estimate its parameters in cubic time using an efficient modified EM method, the generalized inside-outside algorithm, which extends inside-outside algorithm to incorporate the effects of the n-gram and PLSA language models. Our experimental results on the Wall Street Journal corpus show that both approaches induce significant reductions in perplexity over current state-of-art technique. |
Computer Science & Engineering University of Washington Box 352350 Seattle, WA 98195-2350 (206) 543-1695 voice, (206) 543-2969 FAX [comments to talks-info] |