No subject

Mon Jun 5 16:42:55 EDT 2006

aspects of classification by neural networks, including links 
between neural networks and Bayesian statistical classification, 
incremental learning,...
The project includes theoretical work on classification algorithms, 
simulations and benchmarks, especially on realistic industrial
data. Hardware implementation, especially VLSI option, is the 
last objective. 

The set of databases available is to be used for tests and benchmarks 
of machine-learning classification algorithms.
The databases are splitted into two parts: ARTIFICIALly generated
databases, mainly used for preliminary tests, and REAL ones, used for
objective benchmarks and comparisons of methods.

The choice of the databases has been guided by various parameters, such
as availability of published results concerning conventional
classification algorithms, size of the database, number of attributes,
number of classes, overlapping between classes and non-linearities of
the borders,...  Results of PCA and DFA preprocessing of the REAL
databases are also included, together with several measures useful for
the databases characterization (statistics, fractal dimension,
dispersion,...).

All these databases and their preprocessing are available together
with a postcript technical report describing in details the different
databases ('Databases.ps.Z' - 45 pages - 777781 bytes) and a report
related to the comparative benchmarking studies of various algorithms
('Benchmarks.ps.Z' - 113 pages - 1927571 bytes) well-known by the
Statistical and Neural Network communities (MLP, RCE, LVQ, k_NN, GQC)
or developped in the framework of the Elena project (IRVQ, PLS).

A LaTeX bibfile containing more than 90 entries corresponding to
the Elena partners bibliography related to the project is also
available ('Elena.bib') in the same directory.  

All files are available by anonymous ftp from the following directory:

  ftp://ftp.dice.ucl.ac.be/pub/neural-nets/ELENA/databases

The databases are splitted into two parts: the 'ARTIFICIAL' ones, being
generated in order to obtain some defined characteristics, and for
which the theoretical Bayes error can be computed,  and the 'REAL'
ones, collected in existing real-world applications.

The ARTIFICIAL databases ('Gaussian', 'Clouds' and 'Concentric') 
were generated according to the following requirements:   
  - heavy intersection of the class distributions,
  - high degree of nonlinearity of the class boundaries,
  - various dimensions of the vectors,
  - already published results on these databases.     
They are restricted to two-class problems, since we believe it yield 
answers to the most essential questions.  
The ARTIFICIAL databases are mainly used for rapid test purposes on newly 
developed algorithms.

The REAL databases ('Satimage', 'Texture', 'Iris' and 'Phoneme') were 
selected according to the following requirements:
  - classical databases in the field of classification (Iris),
  - already published results on these databases (Phoneme, 
      from the ROARS ESPRIT project and 'Satimage' from the STATLOG ESPRIT 
      project), 
  - various dimensions of the vectors,
  - sufficient number of vectors (to avoid the ``empty space phenomenon'').
  - the 'Texture' database, generated at INPG for the Elena project is 
      interesting for its high number of classes (11). 

##############################################################################

				###########
				# DETAILS #
				###########

The 'Benchmarks' technical report
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The 'Benchmarks.ps' Elena report is related to the benchmarking studies of
various classifiers.  Most of the classifiers which were used for the
benchmark comparative studies are are well known by the neural network
and machine learning community.  These are the k-Nearest Neighbour
(k_NN) classifier, selected for its powerful probability density
estimation properties; the Gaussian Quadratic Classifier (GQC), the
most classical statistical parametric simple classification method; the
Learning Vector Quantizer (LVQ), a powerful non-linear iterative
learning algorithm proposed by Kohonen; the Reduced Coulomb Energy
(RCE) algorithm, an incremental Region Of Influence algorithm; the
Inertia Rated Vector Quantizer (IRVQ) and the Piecewise Linear
Separation (PLS) classifiers, developed in the framework of the Elena
project.

The main objectives of the 'Benchmarks.ps' Elena report report are the 
following:
- to provide an overall comprehensive view of the general problem of
    comparative benchmarking studies and to propose a useful common 
    test basis for existing and further classification methods,
- to obtain objective comparisons of the different chosen classifiers on 
    the set of databases described in this report (each classifier being 
    used with its optimal configuration for each particular database),
- to study the possible links between the data structures of the databases 
    viewed by some parameters, and the behavior of the studied classifiers 
    (mainly the evolution of their the optimal configuration parameters).
- to study the links between the preprocessing methods and the 
    classification algorithms from the performances and hardware constraints 
    point of view (especially the computation times and memory requirements).

Databases format
~~~~~~~~~~~~~~~~

All the databases available are in the following format (after decompression) :

 - All files containing the databases are stored as ASCII files for
    their easy edition and checking. 
 - In a file, each of the n lines is reserved for each vectorial sample
    (instance) and each line consists  of d floating-point numbers (the
    attributes) followed  by the class label (which must be an integer).

  Example:

 1.51768 12.65 3.56 1.30 73.08 0.61 8.69 0.00 0.14 1
 1.51747 12.84 3.50 1.14 73.27 0.56 8.55 0.00 0.00 0
 1.51775 12.85 3.48 1.23 72.97 0.61 8.56 0.09 0.22 1
 1.51753 12.57 3.47 1.38 73.39 0.60 8.55 0.00 0.06 1
 1.51783 12.69 3.54 1.34 72.95 0.57 8.75 0.00 0.00 3
 1.51567 13.29 3.45 1.21 72.74 0.56 8.57 0.00 0.00 1

 There are NO missing values. 

If you desire to get a database, you MUST do it in ftp the binary mode. 
So if you aren't in this mode, simply type 'binary' at the ftp prompt.

         EXAMPLE: to get the "phoneme" database :

                      cd REAL
                      cd phoneme
                      binary
                      get phoneme.txt
                      get phoneme.dat.Z
                      get ...
                      cd ...
                      ...
                      quit

          After your ftp session, you simply have to type 
                'uncompress phoneme.dat.Z' 
          to get the uncompressed datafile.

 Contents of the 'ARTIFICIAL' directory 
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  The databases of this directory contain only the 'ARTIFICIAL'
  classification problems.
  The present 'ARTIFICIAL' databases are only two-class problems, since it 
  yields answers to the most essential questions. 
  For each problem, the confusion matrix corresponding to the theoretical 
  Bayes boundary is provided with the confusion matrix obtained by a k_NN 
  classifier (k chosen to reach the minimum of the total Leave-One-Out error).

  These databases were selected to use for preliminary test and to study the   
  behavior of the implemented algorithms for some particular problems:

  - Overlapping classes: 
     The classifier should have the ability to form a decision boundary 
     that minimizes the amount of misclassification for all of the overlapping
     classes.

  - Nonlinear separability:
     The classifier should be able to build decision regions that separate
     classes of any shape and size. 

  There is one subdirectory for each database. In this subdirectory, 
  there is :

  - A text file providing detailed information about the related database     
    ('databasename.txt').

  - The compressed database ('databasename.dat.Z).
    The different patterns of each database are presented in a random order.

  - For bidimensional databases, a postscript file representing the 2-D
    datasets (those files are in eps format).

  For each subdirectory, the directoryname is the same as the name chosen
  for the concerned database.  Here are the directorynames with a brief
  description. 

  - 'clouds'

    Bidimensional distributions : the class 0 is the sum of three different
    normal distributions while the the class 1 is another normal, overlapping 
    the class 0.
      5000 patterns, 2500 in each class.
    This allows the study of the classifier behavior for heavy intersection 
    of the class distributions and for high degree of nonlinearity of the 
    class boundaries.

  - 'gaussian'

    A set of seven databases corresponding to the same problem, but with 
    dimensionality ranging from 2 to 8.
    This allows the study of the classifier behavior for different 
    dimensionalities of the input vectors, for heavy overlapped
    distributions and for non linear separability.
    Theses databases where already studied by Kohonen in:
      Kohonen, T. and Barna, G. and Chrisley, R., "Statistical Pattern 
      Recognition with Neural Networks: Benchmarking Studies", 
      IEEE Int. Conf. on Neural Networks, SOS Printing, San Diego, 1988.
    In this paper,the performances of three basis types of neural-like 
    networks (Backpropagation network, Boltzmann machine and Learning 
    Vector Quantization) is evaluated and compared to the theoretical limit.

  - 'concentric' 

    Bidimensional uniform concentric circular distributions.
        2500 instances, 1579 in class 1, 921 in class 0. 
    This database may be used to study the linear separability of the 
    classifier when some classes are nested in other without overlapping.

Contents of the 'REAL' directory 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The databases of this directory contain only the real
classification problem sets selected for the Elena benchmarking studies.
There is one subdirectory for each database. In this subdirectory, 
there are:

- a text file giving detailed information about the related database     
  (`databasename.txt'),
- the compressed original database in the Elena format 
  (`databasename.dat.Z'); the different patterns of each database being 
  presented in a random order.
- By the way of a normalization process, each original feature will have 
  the same importance in a subsequent classification process. 
  A typical method is first to center each feature separately and than 
  to reduce it to a unit variance; this process has been applied on all 
  the REAL Elena databases in order to build the ``CR'' databases 
  contained in the ``databasename_CR.dat.Z'' files.

The Principal Components Analysis (PCA) is a very classical method in pattern
recognition [Duda73].  PCA reduces the sample dimension in a linear way
for the best representation in lower dimensions keeping the maximum of
inertia. The best axe for the representation is however not necessary
the best axe for the discrimination. After PCA, features are selected
according to the percentage of initial inertia which is covered by the
different axes and the number of features is determined according to
the percentage of initial inertia to keep for the classification
process. This selection method has been applied on every REAL database
after centering and reduction (thus on the databasename_CR.dat files).
When quasi-linear correlations exists between some initial features,
these redundant dimensions are removed by PCA and this preprocessing is
then recommended. In this case, before a PCA, the determinant of the
data covariance matrix is near zero; this database is thus badly
conditioned for all process which use this information (the quadratic
classifier for example).

The following files, related to PCA are also available for the REAL databases:
- ``databasename_PCA.dat.Z'', the projection of the ``CR'' database on its 
   principal components (sorted in a decreasing order of the related 
   inertia percentage),
- ``databasename_corr_circle.ps.Z'', a graphical representation of the 
    correlation between the initial attributes and the two first 
    principal components,
- ``databasename_proj_PCA.ps.Z'', a graphical representation of the 
    projection of the initial database on the two first principal 
    components,
-  ``databasename_EV.dat'', a file with the eigenvalues and associated
    inertia percentages

The Discriminant Factorial Analysis (DFA) can be applied to a learning
database where each learning sample belongs to a particular class
[Duda73]. The number of discriminant features selected by DFA is fixed
in function of the number of classes (c) and of the number of input
dimensions (d); this number is equal to the minimum between d and c-1.
In the usual case where d is greater than c, the output dimension is
fixed equal to the number of classes minus one and the discriminant
axes are selected in order to maximize the between-variance and to
minimize the within-variance of the classes. The discrimination power
(ratio of the projected between-variance over the projected
within-variance) is not the same for each discriminant axis: this ratio
decreases for each axis. So for a problem with many classes, this
preprocessing will not be always efficient as the last output features
will not be so discriminant. This analysis uses the information of the
inverse of the global covariance matrix, so the covariance matrix must
be well conditioned (for example, a preliminary PCA must be applied to
remove the linearly correlated dimensions). The DFA preprocessing
method has been applied on the 18 first principal components of the
'satimage_PCA' and 'texture_PCA' databases (thus by keeping only the 18
first attributes of these databases before to apply the DFA
preprocessing) in order to build the 'satimage_DFA.dat.Z' and
'texture_DFA.dat.Z' database files, having respectively 5 and 10
dimensions (the 'satimage' database having 6 classes and 'texture'
11).

  For each subdirectory, the directoryname is the same as the name chosen
for the contained database.  Here are the directorynames with a brief
numerical description of the available databases. 

  - phoneme

    French and Spannish phoneme recognition problem. 
  The aim is to distinguish between nasal (AN, IN, ON) and oral 
  (A, I, O, E, E') vowels.

      5404 patterns, 5 attributes (the normalized amplitudes of the five 
     first harmonics), 2 classes.

     This database was in use in the European ESPRIT 5516 project ROARS.
  The aim of this project is the development and the implementation of a
  REAL time analytical system for French and Spannish phoneme
  recognition.

  - texture

    The aim is to distinguish between 11 different textures (Grass lawn, 
  Pressed calf leather, Handmade paper, Raffia looped to a high pile, Cotton 
  canvas, ...), each pattern (pixel) being characterised by 40 attributes 
  built by the estimation of fourth order modified moments in four orientations:
  0, 45, 90 and 135 degrees.

    5500 patterns, 11 classes of 500 instances (each class refers to a type 
    of texture in the Brodatz album).

    The original source of this database is:
  P. Brodatz "Textures: A Photographic Album for Artists and Designers",
  Dover Publications, Inc., New York, 1966.
    This database was generated by the Laboratory of Image Processing 
  and Pattern Recognition (INPG-LTIRF Grenoble, France) in the development 
  of the Esprit project ELENA No. 6891 and the Esprit working group ATHOS
  No. 6620. 

  - satimage (*)

    Classification of the multi-spectral values of an image of the Landsat
  satellite. Each line contains the pixel values in four spectral bands 
  of each of the 9 pixels in a 3x3 neighbourhood and a number indicating 
  the classification label of the central pixel (corresponding to the type 
  of soil: red soil, cotton crop, grey soil, ...).
  The aim is to predict this classification, given the multi-spectral     
  values.

     6435 instances, 36 attributes (4 spectral bands x 9 pixels in  
    neighbourhood), 6 classes. 

    This  database was in use in the European StatLog project, which
  involves comparing the performances of machine learning,
  statistical, and neural network algorithms on data sets from REAL-world
  industrial areas including medicine, finance, image analysis, and
  engineering design:

    D. Michie, D.J. Spiegelhalter, and C.C. Taylor, editors.
    Machine learning, Neural and Statistical Classification.
    Ellis Horwood Series In Artificial Intelligence,
    England, 1994.

  - iris (*)

   This is perhaps the best known database to be found in the pattern
   recognition literature.  Fisher's paper is a classic in the field
   and is referenced frequently to this day.  (See Duda & Hart, for
   example.)  The data set contains 3 classes of 50 instances each,
   where each class refers to a type of iris plant.  One class is
   linearly separable from the other 2; the latter are NOT linearly
   separable from each other.
   4 attributes (sepal length, sepal width, petal length and petal width).

 (*) These databases are taken from the ftp anonymous "UCI Repository Of 
     Machine Learning Databases and Domain Theories" 
     (ics.uci.edu: pub/machine-learning-databases):
  Murphy, P. M. and Aha, D. W. (1992). "UCI Repository of machine
  learning databases" [Machine-readable data repository]. Irvine, CA:
  University of California, Department of Information and Computer Science.

 [Duda73]
 Duda, R.O. and Hart, P.E.,
 Pattern Classification and Scene Analysis,
 John Wiley & Sons, 1973.

##############################################################################

The ELENA PROJECT
~~~~~~~~~~~~~~~~~                  

  Neural networks are now known as powerful methods for empirical
  data analysis, especially for approximation (identification,
  control, prediction) and classification problems. The ELENA project
  investigates several aspects of classification by neural networks,
  including links between neural networks and Bayesian statistical
  classification, incremental learning (control of the network size
  by adding or removing neurons),...

  URL: http://www.dice.ucl.ac.be/neural-nets/ELENA/ELENA.html

  ELENA is an ESPRIT III Basic Research Action project (No. 6891).
  It involves:
	INPG (Grenoble, F),
	UPC (Barcelona, E), 
	EPFL (Lausanne, CH),
	UCL (Louvain-la-Neuve, B), 
	Thomson-Sintra ASM (Sophia Antipolis, F)
	EERIE (Nimes, F).  

  The coordinator of the project can be
  contacted at: 

      Prof. Christian Jutten, 
      INPG-LTIRF, 
      46 av. Flix Viallet, 
      F-38031 Grenoble Cedex, 
      France 

      Phone: +33 76 57 45 48, 
      Fax: +33 76 57 47 90, 
      e-mail: chris at tirf.inpg.fr  

A simulation environment (PACKLIB) has been developed in the project;
it is a smart graphical tool allowing fast programming and
interactive analysis. The PACKLIB environment greatly simplifies the
user's task by requiring only to write the basic code of the
algorithms, while the whole graphical input, output and relationship
framework is handled by the environment itself.  PACKLIB is used for
extensive benchmarks in the ELENA project and in other situations
(image processing, control of mobile robots,...). Currently, PACKLIB
is tested by beta users and a demo version available in the public 
domain.
  URL: http://www.dice.ucl.ac.be/neural-nets/ELENA/Packlib.html

##############################################################################

IF YOU HAVE ANY PROBLEM, QUESTION OR PROPOSITION, PLEASE E_MAIL the following.

  VOZ Jean-Luc or Michel Verleysen
  Universite Catholique de Louvain
  DICE - Lab. de Microelectronique
  3, place du Levant
  B-1348 LOUVAIN-LA-NEUVE

  E_mail : voz at dice.ucl.ac.be
	   verleysen at dice.ucl.ac.be