No subject
Mon Jun 5 16:42:55 EDT 2006
Topic and purpose of the workshop
=================================
Proper benchmarking of neural networks on non-toy
examples is needed from an application perspective in
order to evaluate the relative strenghts and weaknesses of
proposed algorithms and from a theoretical perspective in
order to validate theoretical predictions and see how they
relate to realistic learning tasks. Despite this important
role, NN benchmarking is rarely done well enough today:
o Learning tasks: Most researchers use only toy
problems and, perhaps, one at least somewhat
realistic problem. While this shows that an
algorithm works at all, it cannot explore its
strenghts and weaknesses.
o Design: Often the setup is designed wrongly and
cannot produce valid results from a statistical
point of view.
o Reproducibility: In many cases, the setup is not
described exactly enough to reproduce the
experiments. This violates scientific principles.
o Comparability: Hardly ever are two setups of
different researchers so similar that one could
directly compare the experiment results. This has
the effect that even after a large number of
experiments with certain algorithms, their
differences in learning results may remain
unclear.
There are various reasons why we still find this situation:
o unawareness of the importance of proper
benchmarking;
o insufficient pressure from reviewers towards
good benchmarking;
o unavailability of a sufficient number of standard
benchmarking datasets;
o lack of standard benchmarking procedures.
The purpose of the workshop is to address these issues in
order to improve research practices, in particular more
benchmarking with more and better datasets, better
reproducibility, and better comparability. Specific
questions to be addressed on the workshop are
[Concerning the data:]
o What benchmarking facilities (in particular:
datasets) are publicly available? For which kinds
of domains? How suitable are they?
o What facilities would we like to have? Who is
willing to prepare and maintain them?
o Where and how can we get new datasets from real
applications?
[Concerning the methodology:]
o When and why would we prefer artificial datasets
over real ones and vice versa?
o What data representation is acceptable for general
benchmarks?
o What are the most common errors in performing
benchmarks? How can we avoid them?
o Real-life benchmarking warstories and lessons
learned
o What must be reported for proper
reproducibility?
o What are useful general benchmark approaches
(broad vs. deep etc.)?
o Can we agree on a small number of standard
benchmark setup styles in order to improve
comparability? Which styles?
The workshop will focus on two things: Launching a
new benchmark database that is currently being prepared
by some of the workshop chairs and discussing the above
questions in general and in the context of this database.
The benchmark database facility is planned to comprise
o datasets,
o data format conversion tools,
o terminological and methodological suggestions,
and
o a results database.
Workshop format
===============
We invite anyone who is interested in the above issues to
participate in the discussions at the workshop. The
workshop will consist of a few talks by invited speakers
and extensive discussion periods. The purpose of the
discussion is to refine the design and setup of the
benchmark collection, to explore questions about its
scope, format, and purpose, to motivate potential users
and contributors of the facility, and to discuss
benchmarking in general.
Workshop program
================
The following talks will be given at the workshop [The
list is still preliminary]. After each talk there will be
time for discussion. In the morning session we will focus
on assessing the state of the practice of benchmarking and
discussing an abstract ideal of it. In the afternoon session
we will try to become concrete how that ideal might be
realized.
o Lutz Prechelt. A quantitative study of current benchmarking practices.
A quantitative survey of 400 journal articles on
NN algorithms. (15 minutes)
o Tom Dietterich. Experimental Methodology.
Benchmarking goals, measures of behavior,
correct statistical testing, synthetic versus
real-world data. (15 minutes)
o Brian Ripley. What can we learn from the study of the design
of experiments? (15 minutes)
o Lutz Prechelt. Available NN benchmarking data collections.
CMU nnbench, UCI machine learning databases
archive, Proben1, Statlog data, ELENA data (10 minutes).
o Tom Dietterich. Available benchmarking data generators.
(10 minutes)
o Break.
o Carl Rasmussen and Geoffrey Hinton.
A thoroughly designed benchmark collection.
A proposal of data, terminology, and procedures
and a facility for the collection of benchmarking
results. (45 minutes)
o Panel discussion. The future of benchmarking:
purpose and procedures
The WWW adress for this announcement is
http://wwwipd.ira.uka.de/~prechelt/NIPS_bench.html
Lutz
Dr. Lutz Prechelt (http://wwwipd.ira.uka.de/~prechelt/) | Whenever you
Institut fuer Programmstrukturen und Datenorganisation | complicate things,
Universitaet Karlsruhe; D-76128 Karlsruhe; Germany | they get
(Phone: +49/721/608-4068, FAX: +49/721/694092) | less simple.
More information about the Connectionists
mailing list