Twentieth UW/Microsoft Quarterly Symposium in Computational Linguistics
January 22 (Friday), 2:30 pm - 5:00 pm, Electrical Engineering 105
You are invited to take advantage of this opportunity to connect with the computational linguistics community at Microsoft and the University of Washington. Sponsored by the UW Departments of Linguistics, Electrical Engineering, and Computer Science; Microsoft Research; and UW alumni at Microsoft.
Ellen Lidin (Microsoft)
Machine Translation and Localization
Software localization nowadays is of great importance, because success in the software industry today heavily depends on international scope, high presence in international markets, compliance with governmental requirements, and more. The largest software companies in the business today invest significant engineering resources to make sure their products are ready for a global market, and the more languages they support, the better. This presentation introduces a new localization testing technology based on machine translation. As a proof of concept we present a localization tool—InstantLocGen—to demonstrate how using machine translation can drastically reduce the hidden costs of the localization process.
Chris Quirk (Microsoft)
Extracting Parallel Sentences from Wikipedia
Wikipedia is a compelling resource for statistical machine translation due to its sheer volume of data and broad coverage of both concepts and languages. Unfortunately these comparable articles are not easily digestible by standard statistical machine translation engines. We present improved supervised models for extracting parallel sentences from noisy data that improve on the state of the art and can leverage structure from Wikipedia. The resulting sentence pairs can significantly improve translation quality.
Anand Chakravarty (Microsoft)
Stress Testing an AI-Based Web Service—A Case Study
The stress testing of AI-based systems differs from the approach taken for more traditional web-services, both in terms of the design of test cases and the metrics used to measure quality. The expected variability in responses of an AI-based system to the same request adds a level of complexity to stress testing, when compared to more standard systems where the system response is deterministic and any deviations may easily be characterized as product defects. Generating test cases for AI-based systems requires balancing breadth of test cases with depth of response quality: most AI-systems may not return a 'perfect answer. An example of a Machine Learning translation system is considered, and the approach used to stress testing it is presented, alongside comparisons with a more traditional approach. The challenges of shipping such a system to support a growing set of features and language pairs necessitate a mature prioritization of test cases. This approach has been successful in shipping a web-service that currently serves millions of users per day.
Eduardo Alvarez-Godinez (Microsoft)
Test Corpora Development for Machine Translation Evaluation
We present our experience with building test corpora using actual usage data from a general domain MT system. We compare this to the more traditional approach of using held out data from the same domain as the training data, and discuss some of the challenges and results.
Michael Gamon (Microsoft)
Correcting Non-native English Using Mostly Native Data
The world's population of non-native speakers of English is at least twice the size of the native English population. Despite the huge potential for tools to help with non-native writing, tools for automatic error detection and correction are still in their infancy. We show one particular approach that uses a large set of native data and a small set of annotated error data to detect and correct typical errors in non-native English writing.
Efthimis N. Efthimiadis (iSchool)
Search Interaction in Context Research Group
I will present an overview of the research conducted by the Search Interaction in Context (SIC) group at the iSchool, UW. The SIC research group focuses on user-centered design and evaluation of search systems. We study user searching behavior, query formulation and refining, query expansion, ranking algorithms, and design and evaluation. This contributes to the development and testing of theories and models which are then applied to the design of IR systems. Current research projects of search interaction in context (SIC) include: studying query formulation, refining, and search abandonment by analyzing search engine query logs and conducting user studies; conducting a comparative search engine evaluation study of GYB; conducting Visual Search Engine evaluation studies; and studying different aspects of search interaction of health professionals (doctors, nurses, administrators) with the medical record (VA funded project).
William Lewis (Microsoft) and Fei Xia (Linguistics)
Applying NLP Technologies to the Collection and Analysis of Language Data to Aid Linguistic Research
As a vast amount of language data has become available electronically, linguistics is gradually transforming itself into a discipline where science is often conducted using corpora. In this talk, we review the process of building ODIN, the Online Database of Interlinear Text, a multilingual repository of linguistically analyzed language data. ODIN is built from interlinear text that has been harvested from scholarly linguistic documents posted to the Web, and it currently holds nearly 190,000 instances of interlinear text representing annotated language data for more than 1,000 languages (representing data from more than 10% of the world's languages). Further, we have sought to enrich the collected data and extract "knowledge" from the enriched content. This work demonstrates the benefits of using natural language processing technology to create resources and tools for linguistic research, allowing linguists to have easy access not only to language data embedded in existing linguistic papers, but also to automatically generated language profiles for hundreds of languages.
Emily M. Bender (Linguistics)
Grammar Customization with the LinGO Grammar Matrix
Precision grammars can be very rich resources for natural language processing, but they are also expensive to create. At the same time, much has been learned from existing large scale grammar development projects which is applicable cross-linguistically. The LinGO Grammar Matrix aims to reduce the cost of creating precision grammars by abstracting out a cross-linguistic core grammar and then providing a series of reusable analyses of recurrent, non-universal phenomena through a web-based grammar customization system. In addition to practical applications, this approach provides a new angle on the linguistic study of variation and typology.
Mausam (Computer Science & Engineering)
From Extracted Information to Knowledge: Current Research at Turing Center
To exploit the practically infinite data available on the Web, Turing Center has focused on efficient and open-domain techniques to extract information at scale. This vast amount of extracted information can potentially lead to information overload. Our current research focuses on organizing this information along several types of abstractions, as well as deducing meta-properties of various relations. I will briefly summarize the key ongoing projects in this talk.
Hoifung Poon (Computer Science & Engineering)
Markov Logic in Natural Language Processing
Natural languages are characterized by rich relational structures and tight integration with world knowledge. In recent years, there has been increasing interest in applying joint inference to leverage such relations and prior knowledge in both machine learning and NLP communities. Recent work in statistical relational learning and structured prediction has shown that joint inference can not only substantially improve prediction accuracy, but also enable effective learning with little or no labeled information. Markov logic is a unifying framework for joint inference, and has enabled a series of successful NLP applications, ranging from information extraction to unsupervised semantic parsing. In this talk, I will review recent work in Markov logic and its NLP applications at the University of Washington, and outline exciting directions for future work.