Connectionists: Comparing speech recognition Word Error Rates is deceiptful, please stop

Navdeep Jaitly ndjaitly at gmail.com
Wed Dec 16 00:14:50 EST 2015


After skimming the paper, I want to point out that most of these results
are reported on short utterances, not on long conversations such as
lectures. In the domain of voice search queries, language models play a
very strong role in helping improve WER and it's not a fair stretch to say
that current systems are performing close to the accuracy of human
transcription (although I would stop short of saying that it is actually
better).

On longer conversations such as lectures, human beings are obviously much
better than speech recognition systems for a variety of mechanisms that our
speech recognition systems do not have. So its not clear that we can infer
from our failings in this domain, to say that we are proportionally just as
bad in the domain of short queries.

One of these mechanisms, I think, has to do with the human ability to adapt
language models on the fly (for example if we are in a lecture on abstract
algebra, we are able to adapt our language model to expect to hear
Homeomorphisms over and over again, and having heard it once, we can use it
to inform ourselves later). Our techniques for doing this just haven't
evolved to the point that we can do so well in this domain.

On Tue, Dec 15, 2015 at 2:49 PM, Richard Loosemore <rloosemore at susaro.com>
wrote:

>
> So, can I take it that no-one disagrees with this?  :-)
>
> I have received private emails from people who say they agree with this
> analysis, but no one speaks out publicly (and that, on a mailing list with
> some ferociously opinionated correspondents, too!
>
> If so, is this not a little .... shocking?  That no one bats an eye when
> leading researchers give the strong impression that their systems are "near
> or exceeding human performance" when in fact the truth is that they are ONE
> THOUSAND times worse than human performance?
>
>
> Richard Loosemore
>
>
>
>
>
> On 12/14/15, 11:20 AM, Richard Loosemore wrote:
>
>>
>> I just read "Deep Speech 2: End-to-End Speech Recognition in English and
>> Mandarin" by Amodei et al. ( http://arxiv.org/abs/1512.02595v1, and I
>> have finally reached the end of my tether over the reporting of Word Error
>> Rates (WER).
>>
>> These rates are being used to make a comparison with human performance on
>> TRANSCRIPTION of speech.  But transcription involves recognition plus a
>> complex pile of work like memory storage, time pressure, and semantic
>> paraphrasing.  And I would be willing to bet that almost all the errors are
>> in the non-recognition parts.
>>
>> But by using transcription error, the reported error rate for humans is
>> supposedly about 5%, and on that basis Amodei et al declare that their
>> system is now better than human.
>>
>> That is ludicrous.  If I give an hour-long lecture I can cram in about
>> 20,000 words, and I would be willing to bet that not one of those words
>> would be misrecognized by any of the students in my audience who were
>> actually awake.  That would be an error rate that is three orders of
>> magnitude smaller than the one for transcription.
>>
>> Amodei et al (and all the other deep learning speech recognition folks
>> who overinflate claiims on a regular basis):  your system is NOT
>> outperforming humans, because your system should be compared with the
>> primal recognition rate in humans, and since humans are probably about 1000
>> times better, you have a long way to go.
>>
>>
>> Richard Loosemore
>>
>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/connectionists/attachments/20151215/80907bea/attachment.html>


More information about the Connectionists mailing list