Robinson's vowel dataset

Tue Oct 29 15:23:46 EST 1991

    Does anyone have any NEW results on Robinson's vowel dataset. I am aware of
    the original results given in his thesis:

    A. Robinson. "Dynamic Error Propagation Networks", PhD Thesis, Cambridge Univ
    1989.

I don't know of any more recent publications on this problem.  I got some
rather good results using Cascade-Correlation:

(train 300 300 25))
SigOff 0.10, WtRng 1.00, WtMul 1.00
OMu 2.00, OEps 1.00, ODcy 0.0300, OPat 12, OChange 0.010
IMu 2.00, IEps 10.00, IDcy 0.0300, IPat 8, IChange 0.030
Utype :SIGMOID, Otype :SIGMOID, RawErr NIL, Pool 32

Trial 0:  181 of 462 cases wrong, 281 right, 60.82% @ 23 hidden
Trial 1:  174 of 462 cases wrong, 288 right, 62.34% @ 11 hidden
Trial 2:  193 of 462 cases wrong, 269 right, 58.23% @ 24 hidden
Trial 3:  174 of 462 cases wrong, 279 right, 60.39% @ 15 hidden
Trial 4:  180 of 462 cases wrong, 282 right, 61.04% @ 24 hidden
Trial 5:  186 of 462 cases wrong, 276 right, 59.74% @ 17 hidden
Trial 6:  188 of 462 cases wrong, 274 right, 59.31% @ 11 hidden
Trial 7:  174 of 462 cases wrong, 288 right, 62.34% @ 15 hidden
Trial 8:  173 of 462 cases wrong, 289 right, 62.55% @ 13 hidden
Trial 9:  170 of 462 cases wrong, 292 right, 63.20% @ 18 hidden
Avg:      180 of 462 cases wrong, 282 right, 61.03% @ 17 hidden

The test set was run after each output training phase and the best value
obtained is the one reported.

The best results obtained by Robinson were 260 right (56%) for nearest
neighbor, and 253 right (55%) for 528 Gaussian nodes or 88 square nodes.
Backprop with 88 sigmoids never got better than  234 (51%).

I've never published these results, because I think they are a bit of a
cheat.  The problem is that I played around with the decay factor and other
parameters until I got good results on the test set.  It's not clear that
the same setting would give equally good performance on a new test set that
I had never seen.  Also, in all cases the algorithm obtained a solid level
of 59% or so, but then wandered up and down, in no particular pattern, as
new units were added.  I can get a good number -- up to 63% -- by grabbing
the best point on this random walk, but I don't honestly believe that the
network at that point would give equally good results on new test data
drawn from the same distribution.

What we really need is a much larger data set for this problem.  Then we
could split the set into training data (a larger set, offering much better
generalization), cross-validation data (used to determine when training
should stop), and final test data, never used in training.  The the current
set is so small that it's not possible to split things up this way.

-- Scott Fahlman