Connectionists: Comparing speech recognition Word Error Rates is deceiptful, please stop

Wed Dec 16 11:27:50 EST 2015

The comments in reply to my original post (... including some of the 
ones I have received offlist) are getting surreal.

My main point was:  in the paper, humans were reported to have an error 
rate for speech recognition of one word in twenty.  If what we are 
talking about is ordinary, in-the-wild recognition of speech, that rate 
is transparently ridiculous.  Do you make a mistake recognizing every 
20th word you hear?

Clearly not.  The rate reported is for a human who is recognizing AND 
transcribing.

Yes, I am appealing to common sense to make that point:  but am I really 
supposed to do a factor analysis to demonstrate that it is 
"transparently ridiculous" to suggest that humans have a 1-in-20 error 
rate?   I suggested (that is all: suggested) that the real number for a 
pure recognition task was probably closer to 1 in 20,000 or less.  Do we 
really have to have a debate about how accurate that suggestion was, or 
how irresponsible I am to make the suggestion?  It is clearly not 1 in 
20, so I made a first stab at a better number.

Secondly:  in BOTH the case of people naturally listening to speech (the 
lecture that I mentioned) and a transcriber trying to write down speech 
in an online task, there will be all kinds of high-level processing that 
makes a top-down contribution to the recognition task, so it makes no 
sense (Stefano Rovetta) to discount what I said about possible error 
rates of less than 1 in 20,000 when listening to a lecture.  Yes, you 
can help the recognition process by understanding the content of a 
lecture ... but so can the person doing transcription.

Finally, my comments were not a specific accusation of fraud directed 
against Amodei et al., because I extended my target to "all the other 
deep learning speech recognition folks who overinflate claims on a 
regular basis".

Here is the last paragraph of the conclusion to the Amodei et al paper:

"Overall, we believe our results confirm and exemplify the value of 
end-to-end Deep Learning methods for speech recognition in several 
settings. In those cases where our system is not already comparable to 
humans, the difference has fallen rapidly, largely because of 
application-agnostic Deep Learning techniques. We believe these 
techniques will continue to scale, and thus conclude that the vision of 
a single speech system that outperforms humans in most scenarios is 
imminently achievable."

Everything about this paragraph shouts "speech system that outperforms 
humans".

What it does not say is "speech system that outperforms humans .... but 
only if we are talking about humans who are being overloaded by the need 
to simultaneously perform the task of transcribing the results of 
recognition".

The computer speech recognition system finds the transcription part 
utterly trivial; the human finds it crippling.  It doesn't take a rocket 
surgeon to figure that out.

Richard Loosemore

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/connectionists/attachments/20151216/1556f01d/attachment.html>