Connectionists: Comparing speech recognition Word Error Rates is deceiptful, please stop
Richard Loosemore
rloosemore at susaro.com
Mon Dec 14 11:20:57 EST 2015
I just read "Deep Speech 2: End-to-End Speech Recognition in English and
Mandarin" by Amodei et al. ( http://arxiv.org/abs/1512.02595v1, and I
have finally reached the end of my tether over the reporting of Word
Error Rates (WER).
These rates are being used to make a comparison with human performance
on TRANSCRIPTION of speech. But transcription involves recognition plus
a complex pile of work like memory storage, time pressure, and semantic
paraphrasing. And I would be willing to bet that almost all the errors
are in the non-recognition parts.
But by using transcription error, the reported error rate for humans is
supposedly about 5%, and on that basis Amodei et al declare that their
system is now better than human.
That is ludicrous. If I give an hour-long lecture I can cram in about
20,000 words, and I would be willing to bet that not one of those words
would be misrecognized by any of the students in my audience who were
actually awake. That would be an error rate that is three orders of
magnitude smaller than the one for transcription.
Amodei et al (and all the other deep learning speech recognition folks
who overinflate claiims on a regular basis): your system is NOT
outperforming humans, because your system should be compared with the
primal recognition rate in humans, and since humans are probably about
1000 times better, you have a long way to go.
Richard Loosemore
More information about the Connectionists
mailing list