Send us your data

Mon Sep 19 11:32:11 EDT 1994

We are planning to create a database of tasks for evaluating supervised neural
network learning procedures (both classification and regression).  The main
aim of the enterprise is to make it as easy as possible for neural net
researchers to compare the performance of their latest algorithm with the
performance of many other techniques on a wide variety of tasks.  A subsidiary
aim is to encourage neural net researchers to use systematic ways of setting
"free parameters" in their algorithms (such as the number of hidden units, the
weight-decay etc. etc.).  Its easy to fudge these parameters on a single
dataset but these fudges become far more evident when the same algorithm is
applied to many different tasks.

If you have a real-world dataset with 500 or more input-output cases that fits
the criteria below we would really like to get it from you.  You will be
helping the research community, and by including it in this database you will
ensure that lots of different methods get tried on your data.

WHATS THE POINT OF YET ANOTHER DATABASE

1. Since some neural network learning procedures are quite slow, we want a
database in which there is a designated training and test set for each task.
We don't want to train many different times on different subsets of the data,
testing on the remainder.  To avoid excessive sampling error we want the
designated test set to be quite large, so even though the aim is to evaluate
performance on smallish training sets, we will avoid tasks where there is only
a small amount of data available for testing.  The justification for only using
a single way of splitting the data into training and test sets is this: For a
given amount of computer time, its better to evaluate perfomance on many
different tasks once than to evaluate performance on one task many times since
this cuts down on the noise caused by the random choice of task.

2. To make life easy we want to focus on tasks in which there are no missing
values and all of the inputs are numerical.  This could be viewed as 
tailoring the database to make life easy for algorithms that are limited in
certain ways. That is precisely our intention.

3. We want all the tasks to be in the same format (which they are not if a
researcher gets different tasks from different databases).

4. We want the database to include results from many different algorithms with
an email guarantee that NONE of the free parameters of the algorithms were
tuned by looking at the results on the test data.  So for a result to be
entered the researcher will have to specify how all the free parameters were
set and the same recipe should preferably be used for all the tasks in the
database.  Its fine to say "I always use 80 hidden units because that worked
nicely on the first example I ever tried".  Just so long as there is a reason.
We plan to run quite a few of the standard methods ourselves, so other
researchers will be able to just run their favorite method and get a fair
comparison with other methods.

WHAT KINDS OF TASKS WILL WE INCLUDE

In addition to excluding missing values and nominal attributes we will
initially exclude time series tasks, so the order of the examples will be
unimportant. 

Each task will have a description that includes known limits on input and
output variables.  

We will include both real and synthetic tasks. For the synthetic tasks the
description will specify the correct generative model (i.e. exactly how the
data was generated), but researchers will only be allowed to use the training
data for learning.  They will have to pretend that they do not know the
correct generative model when they are setting the free parameters of their
algorithm.

Tasks will vary in the following ways:

Dimensionality of input.
Dimensionality of output.
Degree of non-linearity.
Noise level and type of noise in both input and output.
Number of irrelevant variables.
The existence of topology  on the input space and known invariances in the
    input-output mapping.

WHERE WILL THE TASKS COME FROM?

Many of them will come from existing databases (e.g. the UC Irvine machine
learning database).

Hopefully other connectionists will provide us with data or with pointers to
data or databases.  To be useful, the data should have few missing values
(we'll simply leave out those cases), no nominal attributes, and at least 500 
cases (preferably many more) so that the test set can be large.

In addition to pointers to suitable datasets, now would be a good time to
comment on the design of this database since it will soon be too late, and we
would like this effort to be of use to as many fellow researchers as possible.