NSF Supported Summer Internships at Johns Hopkins for Undergraduates

Thu Jan 31 15:44:09 EST 2002

Dear Colleague:
The Center for Language and Speech Processing at Johns Hopkins University
is offering a unique summer internship opportunity, which we would like
you to bring to the attention of your best students in the current junior
class. Only two weeks remain for students to apply for these internships.
This internship is unique in the sense that the selected students will
participate in cutting edge research as full members alongside leading
scientists from industry, academia, and the government. The exciting
nature of the internship is the exposure of the undergraduate students
to the emerging fields of language engineering, such as automatic speech
recognition (ASR), natural language processing (NLP), machine
translation (MT), and speech synthesis (ITS).
We are specifically looking to attract new talent into the field and,
as such, do not require the students to have prior knowledge of language
engineering technology. Please take a few moments to nominate suitable
bright students who may be interested in this internship. On-line
applications for the program can be found at http://www.clsp.jhu.edu/
along with additional information regarding plans for the 2002 Workshop
and information on past workshops. The application deadline is
February 15, 2002.
If you have questions, please contact us by phone (410-516-4237),
e-mail (sec at clsp.jhu.edu) or via the Internet http://www.clsp.jhu.edu

Sincerely,
Frederick Jelinek
J.S. Smith Professor and Director

Project Descriptions for this Summer
1. Weakly Supervised Learning For Wide-Coverage Parsing
Before a computer can try to understand or translate a human sentence,
it must identify the phrases and diagram the grammatical relationships
among them. This is called parsing.
State-of-the-art parsers correctly guess over 90% of the phrases and
relationships, but make some errors on nearly half the sentences
analyzed. Many of these errors distort any subsequent automatic
interpretation of the sentence.
Much of the problem is that these parsers, which are statistical,
are not "trained" on enough example parses to know about many of the
millions of potentially related word pairs. Human labor can produce
more examples, but still too few by orders of magnitude.
In this project, we seek to achieve a quantum advance by automatically
generating large volumes of novel training examples. We plan to
bootstrap from up to 350 million words of raw newswire stories,
using existing parsers to generate the new parses together with
confidence measures.
We will use a method called co-training, in which several reasonably
good parsing algorithms collaborate to automatically identify one
another's weaknesses (errors) and to correct them by supplying new
example parses to one another. This accuracy-boosting technique has
widespread application in other areas of machine learning, natural
language processing and artificial intelligence.
Numerous challenges must be faced: how do we parse 350 million words
of text in less than a year (we have 6 weeks)? How to use partly
incompatible parsers to train one another? Which machine learning
techniques scale up best? What kind of grammars, probability models,
and confidence measures work best? The project will involve a
significant amount of programming, but the rewards should be high.

2. Novel Speech Recognition Models for Arabic
Previous research on large-vocabulary automatic speech recognition
(ASR) has mainly concentrated on European and Asian languages.
Other language groups have been explored to a lesser extent,
for instance Semitic languages like Hebrew and Arabic. These
languages possess certain characteristics, which present problems
for standard ASR systems. For example, their written representation
does not contain most of the vowels present in the spoken form,
which makes it difficult to utilize textual training data.
Furthermore, they have a complex morphological structure, which is
characterized not only by a high degree of affixation but also by
the interleaving of vowel and consonant patterns (so-called
"non-concatenative morphology"). This leads to a large number of
possible word forms, which complicates the robust estimation of
statistical language models.
In this workshop group we aim to develop new modeling approaches
to address these and related problems, and to apply them to the
task of conversational Arabic speech recognition. We will develop
and evaluate a multi-linear language model, which decomposes the
task of predicting a given word form into predicting more basic
morphological patterns and roots. Such a language model can be
combined with a similarly decomposed acoustic model, which
necessitates new decoding techniques based on modeling statistical
dependencies between loosely coupled information streams. Since
one pervading issue in language processing is the tradeoff between
language-specific and language-independent methods, we will also
pursue an alternative control approach which relies on the
capabilities of existing, language-independent recognition technology.
Under this approach no morphological analysis will be performed and
all word forms will be treated as basic vocabulary units. Furthermore,
acoustic model topologies will be used which specify short vowels as
optional rather than obligatory elements, in order to facilitate the
use of text documents as language model training data. Finally, we
will investigate the possibility of using large, generally available
text and audio sources to improve the accuracy of conversational Arabic
speech recognition.

3. Generation from Deep Syntactic Representation in Machine Translation
Let's imagine a system for translating a sentence from a foreign
language (say Arabic) into your native language (say English). Such a
system works as follows. It analyzes the foreign-language sentence to
obtain a structural representation that captures its essence, i.e.
"who did what to whom where," It then translates (or transfers) the
actors, actions, etc. into words in your language while "copying over"
the deeper relationship between them. Finally it synthesizes a
syntactically well-formed sentence that conveys the essence of the
original sentence. Each step in this process is a hard technical
problem, to which the best-known solutions are either not adequate
for applications, or good enough only in narrow application domains,
failing when applied to other domains. This summer, we will concentrate
on improving one of these three steps, namely the synthesis (or
generation).
The target language for generation will be English, and that the
source language to the MT system a language of a completely different
type (Arabic and Czech). We will further assume that the transfer
produces a fairly deeply analyzed sentence structure. The
incorporation of the deep analysis makes the whole approach very novel -
so far no large-coverage translation system has tried to operate with
such a structure, and the application to very diverse languages makes
it an even more exciting enterprise!
Within the generation process, we will focus on the structural
(syntactic) part, assuming that a morphological generation module
exists to complete the generation process, and will be added to the
suite so as to be able to evaluate the final result, namely, the
goodness of the plain English text coming out of the system.
Statistical methods will be used throughout. A significant part of
the workshop preparation will be devoted to assembling and running a
simplified MT system from Arabic/Czech to English (up to the
syntactic structure level), in order to have realistic training data
for the workshop project. As a consequence, we will not only
understand and solve the generation problem, but also learn the
mechanics of an end-to-end MT system, creating the intellectual
preparation of team members to work on other parts of the MT
system in the future.

4. SuperSID: Exploiting High-level Information for High-performance
Speaker Recognition
Identifying individuals based on their speech is an important component
technology in many application, be it automatically tagging speakers
in the transcription of a board-room meeting (to track who said what),
user verification for computer security or picking out a known
terrorist or narcotics trader among millions of ongoing satellite
telephone calls.
How do we recognize the voices of the people we know? Generally, we
use multiple levels of speaker information conveyed in the speech signal.
At the lowest level, we recognize a person based on the sound of his/her
voice (e.g., low/high pitch, bass, nasality, etc.). But we also use
other types of information in the speech signal to recognize a speaker,
such as a unique laugh, particular phrase usage, or speed of speech
among other things.
Most current state-of-the-art automatic speaker recognition systems,
however, use only the low level sound information (specifically, very
short-term features based on purely acoustic signals computed on 10-20
ms intervals of speech) and ignore higher-level information. While
these systems have shown reasonably good performance, there is much
more information in speech which can be used and potentially greatly
improve accuracy and robustness.
In this workshop we will look at how to augment the traditional
signal-processing based speaker recognition systems with such
higher-level knowledge sources. We will be exploring ways to define
speaker-distinctive markers and create new classifiers that make use
of these multi-layered knowledge sources. The team will be working
on a corpus of recorded telephone conversations (Switchboard I and II
corpora) that have been transcribed both by humans and by machine and
have been augmented with a rich database of phonetic and prosodic
features. A well-defined performance evaluation procedure will be
used to measure progress and utility of newly developed techniques.