Dissertation announcement

ali@almaden.ibm.com ali at almaden.ibm.com
Fri Jan 5 20:43:02 EST 1996


The following dissertation is available via anonymous FTP and through
http://www.ics.uci.edu/~ali (either as a whole or by chapters).

Title: "Learning Probabilistic Relational Concept Descriptions"

By Kamal Ali

Key words: Learning probabilistic concepts, multiple models, multiple
classifiers, combining classifiers, evidence combination, relational
learning, First-order learning, Noise-tolerant learning, Learning of
small disjuncts, Inductive Logic Programming.

                         A B S T R A C T

This dissertation presents results in the area of multiple models
(multiple classifiers), learning probabilistic relational (first order)
rules from noisy, "real-world" data and reducing  the small disjuncts
problem - the problem whereby learned rules that cover few training examples
have high error rates on test data.

Several results are presented in the arena of multiple models.  The
multiple models approach in relevant to the problem of making accurate
classifications in ``real-world'' domains since it facilitates evidence
combination which is needed to accurately learn on such domains.
It is also useful when learning from small training data samples in which
many models appear to be equally "good" w.r.t. the given evaluation metric.
Such models often have quite varying error rates on test data so in such
situations, the single model method has problems. Increasing search only
partly addresses this problem whereas the multiple models approach has the
potential to be much more useful.

The most important result of the multiple models research is that the
*amount* of error reduction afforded by the multiple models approach is
linearly correlated with the degree to which the individual models make
errors in an uncorrelated manner. This work is the first to model the degree
of error reduction due to the use of multiple models.  It is also shown that
it is possible to learn models that make less correlated errors in domains
in which there are many ties in the search evaluation metric during
learning.  The third major result of the research
on multiple models is the realization that models should be learned that
make errors in a negatively-correlated manner rather than those that make
errors in an uncorrelated (statistically independent) manner.

The thesis also presents results on learning probabilistic first-order rules
from relational data.  It is shown that learning a class description for
each class in the data - the one-per-class approach - and attaching
probabilistic estimates to the learned rules allows accurate classifications
to be made on real-world data sets.  The thesis presents the system HYDRA
which implements this approach.  It is shown that the resulting
classifications are often more accurate than those made by three existing
methods for learning from noisy, relational data.  Furthermore, the learned
rules are relational and so are more expressive than the attribute-value
rules learned by most induction systems.

Finally, results are presented on the small-disjuncts problem in which rules
that apply to rare subclasses have high error rates
The thesis presents the first approach that is simultaneously successful
at reducing the error rates of small disjucnts while also reducing the
overall error rate by a statistically significant margin. The previous
approach which aimed to reduce small disjunct error rates only did so at the
expense of increasing the error rates of large disjuncts.
It is shown that the one-per-class approach reduces error rates for such
rare rules while not sacrificing the error rates of the other rules.

The dissertation is approximately 180 pages long (single spaced) (~590K).

ftp ftp.ics.uci.edu
logname:  anonymous
password:  your email address
cd /pub/ali
binary
get thesis.ps.Z
quit

============================================================================
I am now with the IBM Data Mining group at Almaden (San Jose) - we are
looking for good people for data analysis (data mining) and consulting
so please feel free to call me at (408) 365 8736. My address is:

        Kamal Ali,
        Room D3-250
        IBM Almaden Research Center
        650 Harry Rd
        San Jose, CA 95120

==============================================================================
Kamal Mahmood Ali, Ph.D.                                Phone:    408 927 1354
Consultant and data mining analyst,                     Fax:      408 927 3025
Data Mining Solutions,                                  Office: ARC D3-250
     IBM                                     http://www.almaden.ibm.com/stss/
==============================================================================


More information about the Connectionists mailing list