Processing of auditory sequences

Sat Aug 31 10:24:21 EDT 1991

    Perhaps one of the Connectionists could briefly describe the techniques
    used to preserve intelligibility in time-compression of speech recordings
    for the blind.  This might be an important clue.  I believe the
    techniques are more sophisticated than, say, clipping 5 msec of speech
    from each 10 msec and smoothing the transitions between the remaining
    speech segments.

I'm not an expert on this, but I believe that the basic idea idea is to
speed up the speech by 2x or so, while keeping the frequencies where they
should be.  Apparently the human speech-understanding system is quite
flexible about rate, but really doesn't like to deal with formants moving
too far from where they are expected to be in frequency space.  Perhaps
that is because the basic analysis into frequency bands is done in the
cochlea, with fixed neuro-mechanical filters, and the rest of the
processing is done by the brain using neural machinery that is more
flexible and trainable.

I believe the crude chopping you describe above is one technique that has
been used to accomplish this, and that it works surprisingly well.  Even
though some critical events get dropped on the floor this way -- the pop in
a "P", for example, listeners quickly learn to compensate for this.  One
can do a better job if more attention is paid to the smoothing: chopping at
zero-crossings, etc.  Some pitch-shifter/harmonizer boxes used in music
processing do this sort of thing.  The best approach would probably be to
move everything into the Fourier domain, slide and stretch everything
around smoothly, and then convert back, but I doubt that any practical
reading machines actually do this.  Only in the last couple of years has
the necessary signal-processing power been available on a single chip.

-- Scott Fahlman