Connectionists: Comparing speech recognition Word Error Rates is deceiptful, please stop

Mon Dec 14 11:20:57 EST 2015

I just read "Deep Speech 2: End-to-End Speech Recognition in English and 
Mandarin" by Amodei et al. ( http://arxiv.org/abs/1512.02595v1, and I 
have finally reached the end of my tether over the reporting of Word 
Error Rates (WER).

These rates are being used to make a comparison with human performance 
on TRANSCRIPTION of speech.  But transcription involves recognition plus 
a complex pile of work like memory storage, time pressure, and semantic 
paraphrasing.  And I would be willing to bet that almost all the errors 
are in the non-recognition parts.

But by using transcription error, the reported error rate for humans is 
supposedly about 5%, and on that basis Amodei et al declare that their 
system is now better than human.

That is ludicrous.  If I give an hour-long lecture I can cram in about 
20,000 words, and I would be willing to bet that not one of those words 
would be misrecognized by any of the students in my audience who were 
actually awake.  That would be an error rate that is three orders of 
magnitude smaller than the one for transcription.

Amodei et al (and all the other deep learning speech recognition folks 
who overinflate claiims on a regular basis):  your system is NOT 
outperforming humans, because your system should be compared with the 
primal recognition rate in humans, and since humans are probably about 
1000 times better, you have a long way to go.

Richard Loosemore