No subject

Mon Jun 5 16:42:55 EDT 2006

Topic and purpose of the workshop
=================================

Proper benchmarking of neural networks on non-toy
examples is needed from an application perspective in
order to evaluate the relative strenghts and weaknesses of
proposed algorithms and from a theoretical perspective in
order to validate theoretical predictions and see how they
relate to realistic learning tasks. Despite this important
role, NN benchmarking is rarely done well enough today:

 o Learning tasks: Most researchers use only toy
   problems and, perhaps, one at least somewhat
   realistic problem. While this shows that an
   algorithm works at all, it cannot explore its
   strenghts and weaknesses. 
 o Design: Often the setup is designed wrongly and
   cannot produce valid results from a statistical
   point of view. 
 o Reproducibility: In many cases, the setup is not
   described exactly enough to reproduce the
   experiments. This violates scientific principles. 
 o Comparability: Hardly ever are two setups of
   different researchers so similar that one could
   directly compare the experiment results. This has
   the effect that even after a large number of
   experiments with certain algorithms, their
   differences in learning results may remain
   unclear. 

There are various reasons why we still find this situation:

 o unawareness of the importance of proper
   benchmarking; 
 o insufficient pressure from reviewers towards
   good benchmarking; 
 o unavailability of a sufficient number of standard
   benchmarking datasets; 
 o lack of standard benchmarking procedures. 

The purpose of the workshop is to address these issues in
order to improve research practices, in particular more
benchmarking with more and better datasets, better
reproducibility, and better comparability. Specific
questions to be addressed on the workshop are 

[Concerning the data:] 

 o What benchmarking facilities (in particular:
   datasets) are publicly available? For which kinds
   of domains? How suitable are they? 
 o What facilities would we like to have? Who is
   willing to prepare and maintain them? 
 o Where and how can we get new datasets from real
   applications? 

[Concerning the methodology:] 

 o When and why would we prefer artificial datasets
   over real ones and vice versa? 
 o What data representation is acceptable for general
   benchmarks? 
 o What are the most common errors in performing
   benchmarks? How can we avoid them? 
 o Real-life benchmarking warstories and lessons
   learned 
 o What must be reported for proper
   reproducibility? 
 o What are useful general benchmark approaches
   (broad vs. deep etc.)? 
 o Can we agree on a small number of standard
   benchmark setup styles in order to improve
   comparability? Which styles? 

The workshop will focus on two things: Launching a
new benchmark database that is currently being prepared
by some of the workshop chairs and discussing the above
questions in general and in the context of this database.
The benchmark database facility is planned to comprise 

 o datasets, 
 o data format conversion tools, 
 o terminological and methodological suggestions,
   and 
 o a results database. 

Workshop format
===============

We invite anyone who is interested in the above issues to
participate in the discussions at the workshop. The
workshop will consist of a few talks by invited speakers
and extensive discussion periods. The purpose of the
discussion is to refine the design and setup of the
benchmark collection, to explore questions about its
scope, format, and purpose, to motivate potential users
and contributors of the facility, and to discuss
benchmarking in general. 

Workshop program
================

The following talks will be given at the workshop [The
list is still preliminary]. After each talk there will be
time for discussion. In the morning session we will focus
on assessing the state of the practice of benchmarking and
discussing an abstract ideal of it. In the afternoon session
we will try to become concrete how that ideal might be
realized. 

 o Lutz Prechelt. A quantitative study of current benchmarking practices.
   A quantitative survey of 400 journal articles on
   NN algorithms. (15 minutes) 
 o Tom Dietterich. Experimental Methodology.
   Benchmarking goals, measures of behavior,
   correct statistical testing, synthetic versus
   real-world data. (15 minutes) 
 o Brian Ripley. What can we learn from the study of the design
   of experiments? (15 minutes)
 o Lutz Prechelt. Available NN benchmarking data collections.
   CMU nnbench, UCI machine learning databases
   archive, Proben1, Statlog data, ELENA data (10 minutes). 
 o Tom Dietterich. Available benchmarking data generators.
   (10 minutes)
 o Break. 
 o Carl Rasmussen and Geoffrey Hinton.
   A thoroughly designed benchmark collection.
   A proposal of data, terminology, and procedures
   and a facility for the collection of benchmarking
   results. (45 minutes) 
 o Panel discussion. The future of benchmarking:
   purpose and procedures 

The WWW adress for this announcement is
http://wwwipd.ira.uka.de/~prechelt/NIPS_bench.html

 Lutz

Dr. Lutz Prechelt (http://wwwipd.ira.uka.de/~prechelt/) | Whenever you 
Institut fuer Programmstrukturen und Datenorganisation  | complicate things,
Universitaet Karlsruhe;  D-76128 Karlsruhe;  Germany    | they get
(Phone: +49/721/608-4068, FAX: +49/721/694092)          | less simple.