<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<br>
The comments in reply to my original post (... including some of the
ones I have received offlist) are getting surreal.<br>
<br>
My main point was: in the paper, humans were reported to have an
error rate for speech recognition of one word in twenty. If what we
are talking about is ordinary, in-the-wild recognition of speech,
that rate is transparently ridiculous. Do you make a mistake
recognizing every 20th word you hear?<br>
<br>
Clearly not. The rate reported is for a human who is recognizing
AND transcribing.<br>
<br>
Yes, I am appealing to common sense to make that point: but am I
really supposed to do a factor analysis to demonstrate that it is
"transparently ridiculous" to suggest that humans have a 1-in-20
error rate? I suggested (that is all: suggested) that the real
number for a pure recognition task was probably closer to 1 in
20,000 or less. Do we really have to have a debate about how
accurate that suggestion was, or how irresponsible I am to make the
suggestion? It is clearly not 1 in 20, so I made a first stab at a
better number.<br>
<br>
Secondly: in BOTH the case of people naturally listening to speech
(the lecture that I mentioned) and a transcriber trying to write
down speech in an online task, there will be all kinds of high-level
processing that makes a top-down contribution to the recognition
task, so it makes no sense (Stefano Rovetta) to discount what I said
about possible error rates of less than 1 in 20,000 when listening
to a lecture. Yes, you can help the recognition process by
understanding the content of a lecture ... but so can the person
doing transcription.<br>
<br>
Finally, my comments were not a specific accusation of fraud
directed against Amodei et al., because I extended my target to "all
the other deep learning speech recognition folks who overinflate
claims on a regular basis".<br>
<br>
Here is the last paragraph of the conclusion to the Amodei et al
paper:<br>
<br>
"Overall, we believe our results confirm and exemplify the value of
end-to-end Deep Learning methods for speech recognition in several
settings. In those cases where our system is not already comparable
to humans, the difference has fallen rapidly, largely because of
application-agnostic Deep Learning techniques. We believe these
techniques will continue to scale, and thus conclude that the vision
of a single speech system that outperforms humans in most scenarios
is imminently achievable."<br>
<br>
Everything about this paragraph shouts "speech system that
outperforms humans".<br>
<br>
What it does not say is "speech system that outperforms humans ....
but only if we are talking about humans who are being overloaded by
the need to simultaneously perform the task of transcribing the
results of recognition". <br>
<br>
The computer speech recognition system finds the transcription part
utterly trivial; the human finds it crippling. It doesn't take a
rocket surgeon to figure that out.<br>
<br>
<br>
Richard Loosemore<br>
<br>
<br>
<br>
<meta http-equiv="Content-Type" content="text/html;
charset=ISO-8859-1">
<title></title>
</body>
</html>