Processing of auditory sequences

Fri Aug 30 17:22:24 EDT 1991

Thanks to Robert Port for a fine discussion of acoustic/auditory pattern
processing and hysteresis.  I do not take exception to his observations,
but would like to extend the discussion a bit.  It seems that humans
process human utterances differently than they do other sounds.  I
cannot point to hard data on this point, but I am fairly confident that
playing back recorded speech at twice the recording speed has serious
effects upon intelligibility.  (Surely some of you find Alvin and the
Chipmunks difficult to understand.)  What accounts for the difference
between speech and synthetic "Watson patterns"?

Perhaps one of the Connectionists could briefly describe the techniques
used to preserve intelligibility in time-compression of speech recordings
for the blind.  This might be an important clue.  I believe the
techniques are more sophisticated than, say, clipping 5 msec of speech
from each 10 msec and smoothing the transitions between the remaining
speech segments.

I submit that if evolution has not provided us with special apparatus
for processing the calls of members of our own species, it should.
That is, "knowing" the characteristics of the physical system that
generates human utterances should be of great utility in extracting
information from utterances (especially when they are noisy).  How
might such innate knowledge be represented and utilized in human
processing of speech signals?

Of course, some would argue that an internal model of the articulators
need not be innate.  I vaguely recall that Grossberg and associates
have placed speech generation and recognition in a single "loop."  The
model of articulation might be learned by generating motor "commands"
and hearing the results.  Does anyone know whether people born with
impaired control of the articulators suffer some detriment in processing
speech signals?

To tie these comments together, allow me to hypothesize that artificially
slowed and speeded speech, even when it is in the range of natural
speaking rates, does not exhibit the coarticulatory phenomena appropriate
to the speaking rate.  Thus the internal model of articulation does
not account well for the altered speech, and intelligibility suffers.
This hypothesis is half-baked, if only because it ignores the fact
that "unnatural sounding" synthetic speech may be highly intelligible.
I would, however, like to see relevant evidence and/or discussion.

Thomas English
english at sun1.cs.ttu.edu
Dept. of Computer Science
Texas Tech University