Turing Center at University of Washington

Investigating problems at the crossroads of natural language processing, data mining, Web search, and the Semantic Web.

Turing Center Home Turing Center People Turing Center Publications Turing Center Press Turing Center Events Turing Center Jobs Turing Center Contact
 

Previous Events

2007

Symposium

Eleventh UW/Microsoft Quarterly Symposium in Computational Linguistics
February 16 (Friday), 3:30 pm - 5:30 pm, Mary Gates 241

You are invited to take advantage of this opportunity to connect with the computational linguistics community at Microsoft and the University of Washington. Sponsored by the UW Departments of Linguistics, Electrical Engineering, and Computer Science and Engineering; the MSR NLP Group; the Microsoft Natural Language (NLG) and Speech Components groups; and UW alumni at Microsoft. The symposium consists of two invited talks, followed by an informal reception.

Presentations: Hisami Suzuki and Kristina Toutanova, "Generating Morphologically Rich Languages in MT"; Efthimis N. Efthimiadis, David G. Hendry, and Chong-Ki Tsang, "Experiments in Query Expansion". See the announcement referenced above for details.

Turing Talk

Chris Brew (Linguistics and Cognitive Science, Ohio State)
Using Unlabeled and Lightly Labeled Data for NLP and Lexical Acquisition
February 20 (Tuesday), 2:30 pm - 3:20 pm, Paul G. Allen Center for Computer Science and Engineering 403

Abstract

Among the major obstacles to automatic processing of text is the difficulty of obtaining reliable information about the meaning and behavior of words. This problem is especially acute in technical or rapidly changing fields, because existing dictionaries are unlikely to suffice. In addition, since new and technical vocabulary marks out commercially interesting sub-populations of the general web audience, the ability to handle this vocabulary can translate into the ability to target the interests, concerns and desires of the corresponding sub-populations.

One of the standard techniques for lexical acquisition is to prepare, then exploit, large collections of labeled text. An example is the Penn Treebank, which contains roughly 50,000 sentences. Detailed linguistic labeling of corpora of this size, even with good tool support, is a substantial multi-year effort. Corpora created more recently are even bigger: up to gigaword size samples of English, Arabic and Mandarin are now available. Exhaustive hand-labeling is completely infeasible at this scale. Therefore, my research focus is to develop techniques that can learn interesting and useful information from unlabeled or very lightly labeled corpora. I will illustrate with examples from a line of work on clustering and classification of verbs. This will include both models that use linguistic information directly and methods that borrow techniques from computer vision to reduce the impact of noise and sparse data. In addition, should time permit, I hope to sketch, and perhaps demonstrate, reasons to be optimistic about their scaling behavior.

Turing Talk

Timothy Baldwin (Computer Science and Software Engineering, Melbourne)
Deep Lexical Acquisition in the Wild: A Little Language Goes a Long Way
March 5 (Monday), 3:30 pm - 4:30 pm, Paul G. Allen Center for Computer Science and Engineering 303

Abstract

Since the dawn of computational linguistic time, research has focused on a select handful of languages, and largely ignored the remainder of the world's abundance of languages. This is due to the double-edged sword of there being fewer funding opportunities to work with languages deemed to be economically and politically "uninteresting", and there being scant language resources for those languages on which to base any CL research. In this talk, I will describe various strands of research on deep lexical acquisition (DLA), i.e. the (semi-)automatic creation of linguistically-rich lexical language resources, focusing particularly on DLA for precision grammars. I will describe various bootstrapping methods, and investigate the correlation between the type of lexical language resource we wish to create and the quantity/type of seed data available to bootstrap off.

Speaker

Timothy Baldwin is a Senior Lecturer in the Department of Computer Science and Software Engineering, University of Melbourne. His research--funded by NSF, NTT, ARC, NICTA and others--has included multiword expressions, deep lexical acquisition, information extraction, and web mining. He has given invited talks at various conferences, summer schools, and universities worldwide and is the author of nearly 100 technical papers. He is currently on the editorial board of Computational Linguistics (2006-2008), a series editor for CSLI Publications, and a member of the Deep Linguistic Processing with HPSG Initiative (DELPH-IN). He has reviewed for journals such as Natural Language Engineering, Journal of Computational Intelligence, and Computer Speech and Language, as well as various book publishers and institutions such as the National Science Foundation and the Australian Research Council.

Information School Research Conversation

Jimmy Lin (Information Studies and Computational Linguistics and Information Processing Laboratory, Maryland)
Beyond "Bag of Words": Towards a Framework for Conceptual Retrieval
March 29 (Thursday), 3:30 pm - 4:30 pm, Mary Gates 420

Turing Talk

Stephan Oepen (University of Oslo; NTNU Trondheim; CSLI, Stanford)
Grammar-Based Processing for Precision Machine Translation
April 5 (Thursday), 12:30 pm - 1:30 pm, Paul G. Allen Center for Computer Science and Engineering 203

Abstract

I will review recent advances in grammar-based sentence processing, specifically realization from logical-form meaning representations. The LOGON Machine Translation prototype aims at the fully-automated, high-quality translation of Norwegian instructional texts (on hikes in the Norwegian backcountry) into English. The generator operates off underspecified meaning representations derived from grammatical analysis (in the LFG framework) and subsequent semantic transfer. I will provide an overview of all three processing layers, emphasizing the re-usable nature of linguistic resources and the needs for tight integration of linguistic processing and probabilistic model to rank alternate hypotheses (within each component, as well as end-to-end). Besides empirical results for the realization task when evaluated in isolation, I will present a summary of quantitative measures on the current development status (and promise) of the LOGON MT pipeline as a whole.

Turing Talk

Luis von Ahn (Computer Science, Carnegie Mellon)
An Informal Discussion on the Use of Human Computation to Collect Common-Sense Facts
April 16 (Monday), 2:00 pm - 3:30 pm, Paul G. Allen Center for Computer Science and Engineering 403

Related Publication

Luis von Ahn, Mihir Kedia, and Manuel Blum, "Verbosity: A Game for Collecting Common-Sense Facts", CHI 2006.

Turing Talk

Rada Mihalcea (Computer Science, North Texas)
A Picture is Worth Seven Thousand Words: Toward Communicating Simple Sentences Using Pictorial Representations
April 17 (Tuesday), 2:30 pm - 3:20 pm, Paul G. Allen Center for Computer Science and Engineering 303

Abstract

Universal communication represents one of the long-standing goals of humanity--borderless communication among people, regardless of the language they speak. According to recent studies, there are about 7,000 languages spoken worldwide. From these, only about 15-20 languages can currently take advantage of the benefits provided by machine translation, and even for these languages the automatically produced translations are not error-free and their quality lags behind human expectations.

In this talk, I will discuss a new paradigm for translation: translation through pictures, as opposed to translation through words, as a means for producing universal representations of information that can be effectively conveyed across language barriers. I will describe several experiments that evaluate the hypothesis that pictorial representations can be used to effectively convey simple sentences across language barriers. Using comparative evaluations, I will show that a considerable amount of understanding can be achieved using visual descriptions of information, with evaluation figures within a comparable range of those obtained with linguistic representations produced by an automatic machine translation system.

Speaker

Rada Mihalcea is an Assistant Professor of Computer Science at the University of North Texas. Her research interests are in lexical semantics, graph-based algorithms for natural language processing, minimally supervised natural language learning, and multilingual natural language processing. She is currently involved in a number of research projects, including knowledge-based word sense disambiguation, (non-traditional) methods for building annotated corpora with volunteer contributions over the Web, graph-based algorithms for text processing, text-to-picture synthesis, and computational humour. She has published a large number of articles in books, journals, and proceedings, in these and related areas. She is the president of the ACL Special Group on the Lexicon (SIGLEX) and a board member for the ACL Special Group on Natural Language Learning (SIGNLL). She serves on the editorial boards of Computational Linguistics, Language Resources and Evaluation, Natural Language Engineering, Research on Language and Computation, and the recently established Journal of Interesting Negative Results in Natural Language Processing and Machine Learning.

Turing Talk

Cynthia Matuszek (Cyc)
Cyc and ResearchCyc: An Overview
April 20 (Friday), 1:30 pm - 2:50 pm, Electrical Engineering 042

Abstract

The Cyc project is a long-running knowledge representation effort, begun in 1985, with the ambitious goal of formally representing human-level common sense to support learning and reasoning. This talk will describe the history and background of Cyc and some of the uses it has been put to in the last two decades, especially the readily-available ResearchCyc spinoff. This will include an overview of how to actually use Cyc and ResearchCyc--what knowledge it contains, what tools exist for tasks such as querying over or entering knowledge, and what is involved in using it within an application.

Turing Talk

Mark Greaves (Vulcan)
Knowledge Representation in Practice: Project Halo and the Semantic Web
April 24 (Tuesday), 1:30 pm - 2:30 pm, Paul G. Allen Center for Computer Science and Engineering 303

Abstract

Vulcan's Project Halo is an ambitious, multiyear research program to develop a detailed scientific knowledge base that can answer AP-level questions and provide explanations in a user-appropriate manner. It is one of the larger AI research programs in the US today. Halo's current focus is building AI tools that allow graduate students in chemistry, biology, and physics to author scientific knowledge adequate to answer sophisticated natural-language questions without relying on trained knowledge engineers. Halo's contractors have been working to link Semantic Web technology with the other knowledge representations in the system. This talk will lay out Halo's technologies and results to date, and describe some of the technical and UI issues we have faced in getting users to author scientific conceptual knowledge.

Speaker

Dr. Mark Greaves is currently Program Manager for Knowledge Systems at Vulcan Inc., the private asset management company for Paul Allen. At Vulcan, he is sponsoring advanced R&D in large knowledge bases and semantic web technologies, including Project Halo. Formerly, Mark served as Director of DARPA's Joint Logistics Technology Office, and as Program Manager in DARPA's Information Exploitation Office. He managed a variety of DARPA projects in semantics and distributed computing technology, including the DAML project that funded the development of the OWL, OWL/S, and SWRL languages. Prior to coming to DARPA, Mark worked on natural language semantics and software agent technology at the Mathematics and Computing Technology group of Boeing Phantom Works.

Turing Talk

Carolyn Penstein Rosé (Language Technologies and Human-Computer Interaction Institute, Carnegie Mellon)
Language Technologies for Supporting Productive Collaborative Learning Interactions for Science and Engineering Education
April 26 (Thursday), 11:00 am - 12:00 noon, Paul G. Allen Center for Computer Science and Engineering 303

Abstract

In this talk I will present recent work in text classification and conversation summarization in the application area of computer supported collaborative learning. I will begin by describing our recent text classification research, which has produced technology capable of real-time analysis of collaborative learning discussions. Our work demonstrates that successfully identifying some key types of conversational behavior that characterize successful learning interactions requires extracting features from the running conversation that reflect the threaded structure of the conversation. This technology opens up the new possibility for context-sensitive collaborative learning support in the midst of free form on-line communication. Two studies in the past year evaluating a fully-automatic form of context sensitive collaborative learning support both demonstrate significant learning benefits for students in comparison with a no-support control condition. In more recent work, we have developed a form of indicative conversation summarization that draws upon similar technology to generate intermittent summaries for group learning facilitators to support them in the task of identifying groups that need more help than others.

Speaker

Carolyn Penstein Rosé joined the faculty at the Language Technologies Institute and the Human-Computer Interaction Institute at Carnegie Mellon University in Fall of 2003. A particular focus of her research is the role of explanation and language communication in learning and in supporting productive learning interactions with language technologies.

Symposium

Twelfth UW/Microsoft Quarterly Symposium in Computational Linguistics
April 27 (Friday), 3:30 pm - 5:30 pm, Mary Gates 241

You are invited to take advantage of this opportunity to connect with the computational linguistics community at Microsoft and the University of Washington. Sponsored by the UW Departments of Linguistics, Electrical Engineering, and Computer Science and Engineering; the MSR NLP Group; the Microsoft Natural Language (NLG) and Speech Components groups; and UW alumni at Microsoft. The symposium consists of two invited talks, followed by an informal reception.

Jianfeng Gao (Microsoft Research)
A Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing

In this talk we present a comparative study of five parameter estimation algorithms on four NLP tasks. Three of the five algorithms are well-known in the computational linguistics community: Maximum Entropy (ME) estimation with L2 regularization, the Averaged Perceptron (AP), and Boosting. We also investigate ME estimation with the increasingly popular L1 regularization using a novel optimization algorithm, and BLasso, which is a version of Boosting with Lasso (L1) regularization. We first investigate all of our estimators on two reranking tasks: a parse selection task and a language model adaptation task. Then we apply the best of these estimators to two additional tasks involving conditional sequence models: a Conditional Markov Model (CMM) for part of speech tagging and a Conditional Random Field (CRF) for Chinese word segmentation. Our experiments show that across tasks, three of the estimators--ME estimation with L1 or L2 regularization, and the Averaged Perceptron--are in a near statistical tie for first place.

Marcus Sammer (Turing Center, UW)
Panlingual Lexical Translation

Lexical Translation is the task of translating individual words or phrases. Lexical translation is useful to support applications such as cross-lingual search, the translation of meta-data, and knowledge-based translation. The Turing Center has two lexical translation projects that aim to scale lexical translation to a very large number of language pairs. The PanImages project has built a cross-lingual image search engine for the Web. Lexical translation occurs in PanImages via the translation graph, a massive lexical resource where each node denotes a word in some language and each edge denotes a word sense shared by a pair of words. The graph is automatically constructed from machine-readable dictionaries and Wiktionaries.

Colloquium

Barney Pell (Powerset)
Powerset and Natural Language Search
May 1 (Tuesday), 3:30 pm - 4:30 pm, Electrical Engineering 105

Lecture

Joseph T. Tennis (Library, Archival and Information Studies, British Columbia)
Semantic Carnival: Tagging and the New Descriptive Acts of the Next Generation Web
May 3 (Thursday), 5:00 pm - 5:30 p.m. (reception), 5:30 pm - 6:30 pm (lecture), Suzzallo Library 324 (Smith Room)

Colloquium

Mark Johnson (Microsoft Research and Brown)
Bayesian Learning of Grammars
May 7 (Monday), 11:00 am - 12:00 noon, Paul G. Allen Center for Computer Science and Engineering 303

Abstract

While the most famous applications of statistical learning are perhaps word associations and neural networks, in the past decade we (i.e., the computational linguistics community) discovered how to extend these learning algorithms to grammars that generate linguistically-realistic structures. These techniques currently learn phrase-structure and similar grammars, but there is no principled reason why they can't learn other kinds of grammars as well. Bayesian approaches are particularly attractive because they exploit "prior" (e.g., innate) knowledge as well as statistical generalizations from the input. Structured statistical learners have two major advantages over other approaches. First, because the generalizations they learn and the prior knowledge they utilize are both expressed in terms of explicit linguistic representations, it is clear what was learnt and what information was exploited during learning. Second, because of the "curse of dimensionality", learners that identify and exploit structural properties of their input seem to be the only ones that have a chance of "scaling up" to learn real languages.

Of course, developing explicit computational models that actually learn language is more difficult--and more interesting--than constructing the kind of "in principle" arguments given above. Time and audience permitting, I will describe the Markov Chain Monte Carlo techniques we have developed for sampling from Bayesian posterior distributions over syntactic analyses and grammars,and our use of Dirichlet Process models to address over-dispersion in lexical and morphological acquisition.

Turing Talk

Abraham Bernstein (Informatics, Zürich)
Making the Semantic Web Accessible to the Casual User: Empirical Evidence on the Usefulness of Semistructured Query Languages
October 9 (Tuesday), 1:30 pm - 2:20 pm, Paul G. Allen Center for Computer Science and Engineering 203

Abstract

The Semantic Web presents the vision of a distributed, dynamically growing knowledge base founded on formal logic. Common users, however, seem to have problems even with the simplest Boolean expression. So how can we help users to query a web of logic that they do not seem to understand? One frequently proposed solution to address this problem is the use of natural language (NL) for knowledge specification and querying. We propose to regard formal query languages and NL as two extremes of a continuum, where semistructured languages lie somewhere in the middle.

To evaluate what degree of structuredness casual users prefer, we introduce four query interfaces, each at a different point in the continuum, and evaluate the users' preference and their query performance in a study with 48 subjects. The results of the study reveal that while the users dislike the constraints of a fully structured formal query language they also seem at a loss with the freedom of a full NLP approach. This suggests that restricted query languages will be preferred by casual users because of their guidance effect, mirroring findings from social science theory on human activity in general.

Lecture

Brewster Kahle (Internet Archive; Library and Information Science, University of North Carolina)
Universal Access to All Human Knowledge
October 9 (Tuesday), 4:00 pm - 5:00 pm (lecture), 5:00 pm - 6:00 pm (reception), Henry Art Gallery 301 (Auditorium)
Preregistration required

Symposium

UW/Microsoft Quarterly Symposium in Computational Linguistics
November 2 (Friday), 3:30 pm - 5:00 pm, Microsoft Building 113, 1st floor, room 1021

You are invited to take advantage of this opportunity to connect with the computational linguistics community at Microsoft and the University of Washington. Sponsored by the UW Departments of Linguistics, Germanics, Electrical Engineering, and Computer Science; the MSR NLP and Speech and Natural Language (SNL) groups; and UW alumni at Microsoft. The symposium consists of two invited talks, followed by an informal reception.

To facilitate the creation of badges, the symposium is compiling a list of attendees in advance. If you plan to attend, please email jparvi at u.washington.edu, with "UW/MS Symposium" in the subject line.

Hoifung Poon and Pedro Domingos (Computer Science & Engineering, UW)
Joint Inference in Information Extraction

Andreas Bode and Anthony Aue (Machine Translation Incubation Team, Natural-Language-Processing Group, Microsoft)
From Research to Production--Issues & Lessons Learned

Transportation by bus: From the UW area, you can reach the site in 12 minutes by Route 545 from the Montlake/SR 520 freeway stop, leaving at 2:52 pm and arriving at Overlake Transit Center at 3:04 pm. You can return in 23 minutes from 148th Avenue NE and NE 34th Street by Route 242, leaving at 5:22 pm and arriving at Montlake at 5:45 pm. See the map for details on the bus stops.

Transportation by automobile: Take the 148th Ave NE northbound exit from SR 520. Limited visitor parking can be found in front of both building 112 (from NE 36th St) and building 114 (from NE 31st Circle). Please carpool if possible; the visitor parking is very limited.

Current Events

 
 

Email: | Maps | Directions