Tech Report - Connectionist Speech Recognition

Wed Nov 23 15:53:30 EST 1988

The following technical report is available from the Department of Computer
and Information Science, University of Pennsylvania:

	    Speech Recognition Using Connectionist Networks

			Raymond L. Watrous

			MS-CIS-88-96
			LINC LAB 138

			Abstract

The use of connectionist networks for speech recognition is assessed 
using a set of representative phonetic discrimination problems. The problems
are chosen with respect to the physiological theory of phonetics in
order to give broad coverage to the space of articulatory phonetics.
Separate network solutions are sought to each phonetic discrimination problem.

A connectionist network model called the Temporal Flow Model is defined
which consists of simple processing units with single valued outputs
interconnected by links of variable weight. The model represents temporal
relationships using delay links and permits general patterns of connectivity
including feedback. It is argued that the model has properties appropriate
for time varying signals such as speech.

Methods for selecting network architectures for different recognition problems
are presented. The architectures discussed include random networks, minimally
structured networks, hand crafted networks and networks automatically
generated based on samples of speech data.

Networks are trained by modifying their weight parameters so as to minimize
the mean squared error between the actual and the desired response of the
output units. The desired output unit response is specified by a target
function. Training is accomplished by a second order method of iterative
nonlinear optimization by gradient descent which incorporates a method for
computing the complete gradient of recurrent networks.

Network solutions are demonstrated for all eight phonetic discrimination
problems for one male speaker. The network solutions are analyzed carefully
and are shown in every case to make use of known acoustic phonetic cues.
The network solutions vary in the degree to which they make use of context
dependent cues to achieve phoneme recognition. 

The network solutions were tested on data not used for training and achieved an
average accuracy of 99.5%.

Methods for extending these results to a single network for recognizing the
complete phoneme set from continuous speech obtained from different speakers
are outlined.

It is concluded that acoustic phonetic speech recognition can be accomplished
using connectionist networks.

+++++++++++++++++++++++++++++++++++++++++++++++++++++

This report is available from:

James Lotkowski
Technical Report Facility
Room 269/Moore Building
Computer Science Department
University of Pennsylvania
200 South 33rd Street
Philadelphia, PA 19104-6389

or james at central.cis.upenn.edu

Please do not request copies of this report from me. Copies of the report
cost approximately $19.00 which covers duplication (300 pages) and postage.
I will bring a 'desk copy' to NIPS.

As of December 1, I will be affiliated with the University of Toronto.
My address will be:

Department of Computer Science
University of Toronto
10 King's College Road
Toronto, Canada M5S 1A4

watrous at ai.toronto.edu