Connectionists: Comparing speech recognition Word Error Rates is deceiptful, please stop
Richard Loosemore
rloosemore at susaro.com
Wed Dec 16 11:27:50 EST 2015
The comments in reply to my original post (... including some of the
ones I have received offlist) are getting surreal.
My main point was: in the paper, humans were reported to have an error
rate for speech recognition of one word in twenty. If what we are
talking about is ordinary, in-the-wild recognition of speech, that rate
is transparently ridiculous. Do you make a mistake recognizing every
20th word you hear?
Clearly not. The rate reported is for a human who is recognizing AND
transcribing.
Yes, I am appealing to common sense to make that point: but am I really
supposed to do a factor analysis to demonstrate that it is
"transparently ridiculous" to suggest that humans have a 1-in-20 error
rate? I suggested (that is all: suggested) that the real number for a
pure recognition task was probably closer to 1 in 20,000 or less. Do we
really have to have a debate about how accurate that suggestion was, or
how irresponsible I am to make the suggestion? It is clearly not 1 in
20, so I made a first stab at a better number.
Secondly: in BOTH the case of people naturally listening to speech (the
lecture that I mentioned) and a transcriber trying to write down speech
in an online task, there will be all kinds of high-level processing that
makes a top-down contribution to the recognition task, so it makes no
sense (Stefano Rovetta) to discount what I said about possible error
rates of less than 1 in 20,000 when listening to a lecture. Yes, you
can help the recognition process by understanding the content of a
lecture ... but so can the person doing transcription.
Finally, my comments were not a specific accusation of fraud directed
against Amodei et al., because I extended my target to "all the other
deep learning speech recognition folks who overinflate claims on a
regular basis".
Here is the last paragraph of the conclusion to the Amodei et al paper:
"Overall, we believe our results confirm and exemplify the value of
end-to-end Deep Learning methods for speech recognition in several
settings. In those cases where our system is not already comparable to
humans, the difference has fallen rapidly, largely because of
application-agnostic Deep Learning techniques. We believe these
techniques will continue to scale, and thus conclude that the vision of
a single speech system that outperforms humans in most scenarios is
imminently achievable."
Everything about this paragraph shouts "speech system that outperforms
humans".
What it does not say is "speech system that outperforms humans .... but
only if we are talking about humans who are being overloaded by the need
to simultaneously perform the task of transcribing the results of
recognition".
The computer speech recognition system finds the transcription part
utterly trivial; the human finds it crippling. It doesn't take a rocket
surgeon to figure that out.
Richard Loosemore
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/connectionists/attachments/20151216/1556f01d/attachment.html>
More information about the Connectionists
mailing list