PhD Thesis: "Learning with MIXTURES OF TREES" - Marina Meila - MIT

Sat Feb 13 18:12:27 EST 1999

Dear connectionists,

my thesis on learning tree graphical models and mixtures thereof is
now available via my web page:

      http://www.ai.mit.edu/~mmp

or directly from 

      ftp://ftp.ai.mit.edu/pub/mmp/thesis/thesis.ps.Z

The thesis presents:
   * Density estimation in multidimensional domains and graphical models
   * Tree distributions and their properties
   * Mixtures of trees: definition, properties, an efficient learning
     algorithm
   * Learning mixtures of trees in the Bayesian framework
   * Very fast tree learning algorithms for sparse discrete data
   * A new approach to hidden variable discovery, an empirical method for
     estimating the description length of a mixture model, a novel
     independence test based on large deviation theory
   * Experiments: density estimation and classification with mixtures of
     trees

Abstract and Table of Contents follow:
---------------------------------------------------------------------------
Abstract

One of the challenges of density estimation as it is used in machine
learning is that usually the data are multivariate and often the
dimensionality is large. Operating with joint distributions over
multidimensional domains raises specific problems that are not encountered
in the univariate case. Graphical models are representations of joint
densities that are specifically tailored to address these problems. They
take advantage of the (conditional) independencies between subsets of
variables in the domain which they represent by means of a graph. When the
graph is sparse, graphical models provide an excellent support for human
intuition and allow for efficient inference algorithms. However, learning
the underlying dependence graph from data is generally NP-hard.

The purpose of this thesis is to propose and to study a class of models that
admits tractable inference and learning algorithms yet is rich enough for
practical applications. This class is the class of mixtures of trees models.
Mixtures of trees inherit the excellent computational properties of tree
distributions (themselves a subset of graphical models) but combine several
trees in order to augment their modeling power, thereby going beyond the
standard graphical model framework.

The thesis demonstrates the performance of the mixture of trees in density
estimation and classification tasks. In the same time it deepens the
understanding of the properties of the tree distribution as a multivariate
density model. Among others, it shows that the tree classifier implements an
implicit variable selection mechanism.

Learning mixtures of trees from data is a central subject of this thesis.
The learning algorithm that is introduced here is based on the the EM and
the Minimum Weight Spanning Tree algorithms and is quadratic in the
dimension of the domain.

This algorithm can serve as a tool for discovering hidden variables in a
special but important class of models where, conditioned on the hidden
variable, the dependencies between the observed variables become sparse.
Finally, it is shown that in the case of sparse discrete data, the original
learning algorithm can be transformed in an algorithm that is jointly
subquadratic and that in simulations achieves speedups factors of up to a
thousand.
----------------------------------------------------------------------------

Learning with Mixtures of Trees

Marina Meila-Predoviciu

Table of Contents

   * Cover, etc (postscript, 14 pages)
        o Abstract
        o Acknowledgements
        o Contents
        o List of Figures
        o List of Tables
        o Index of Algorithms
   * Introduction (postscript, 12 pages)
        o Density estimation in multidimensional domains
        o Introduction by example
        o Graphical models of conditional independence
             + Examples of established belief network classes
             + Advantages of graphical models
             + Structure learning in belief networks
             + Inference and decomposable models
        o Why, what and where? Goal, contributions and road map of the
          thesis
             + Contributions
                  + The mixture of trees model
                  + An efficient learning algorithm
                  + An accelerated learning algorithm for sparse data
                  + A top-down approach to hidden variable discovery
             + A road map for the reader
   * Trees and their properties (postscript, 14 pages)
        o Tree distributions
        o Inference, sampling and marginalization in a tree distribution
             + Inference
             + Marginalization
             + Sampling
        o Learning trees in the Maximum Likelihood framework
             + Problem formulation
             + Fitting a tree to a distribution
             + Solving the ML learning problem
        o Representation capabilities
        o Appendix: The Junction Tree algorithm for trees
   * Mixtures of trees (postscript, 12 pages)
        o Representation power of mixtures of trees
        o Basic operations with mixtures of trees
             + Marginalization
             + Inference
             + Sampling
        o Learning mixtures of trees in the ML framework
             + The basic algorithm
             + Running time and storage requirements
             + Learning mixtures of trees with shared structure
             + Remarks on the learning algorithms
        o Summary and related work
   * Learning mixtures of trees in the Bayesian framework (postscript, 9
     pages)
        o MAP estimation by the EM algorithm
        o Decomposable priors for tree distributions
             + Decomposable priors over tree structures
             + Priors for tree parameters: the Dirichlet prior
                  + The Dirichlet prior in natural coordinates
                  + Dirichlet priors for trees and mixtures
   * Accelerating the tree learning algorithm (postscript, 22 pages)
        o Introduction
        o Assumptions
        o Accelerated CL algorithms
             + First idea: Comparing mutual informations between binary
               variables
             + Second idea: computing cooccurrences in a bipartite graph
               data representation
             + Putting it all together: the aCL-I algorithm and its data
               structures
             + Time and storage requirements
             + The aCL-II algorithm
             + Time and memory requirements for aCL-II
        o Generalization to discrete variables of arbitrary arity
             + Computing cooccurrences
             + Presorting mutual informations
                  + A ``chain rule'' expression for the entropy of a
                    discrete variable
                  + The mutual information of two non-cooccurring variables
        o Using the aCL algorithms with EM
        o Decomposable priors and the aCL algorithm
        o Experiments
        o Concluding remarks
        o Appendix: Bounding the number of lists
   * An Approach to Hidden variable discovery (postscript, 21 pages)
        o Structure learning paradigms
        o The problem of variable partitioning
        o The tree H model
        o Variable partitioning in the general case
             + Outline of the procedure
             + Defining structure as simple explanation
        o Experiments
             + Experimental procedure
             + Experiments with tree H models
             + General H models
        o Approximating the description length of a model
             + Encoding a multinomial distribution
        o Model validation by independence testing
             + An alternative independence test
             + A threshold for mixtures
             + Validating graphical models with hidden variables
        o Discussion
   * Experimental results (postscript, 17 pages)
        o Recovering the structure
        o Density estimation experiments
             + Digits and digit pairs images
             + The ALARM network and data set
        o Classification with mixtures of trees
             + Using a mixture of trees as a classifier
             + The AUSTRALIAN data set
             + The MUSHROOM data set
             + The SPLICE data set. Classification and structure discovery
             + The single tree classifier as an automatic feature selector
   * Conclusion (postscript, 2 pages)
   * References (postscript, 5 pages) or (html)

----------------------------------------------------------------------------