NIPS95 workshop

Tue Nov 21 14:55:14 EST 1995

                    **** TENTATIVE SCHEDULE ****

                          NIPS*95 Workshop

       Learning in Bayesian Networks and Other Graphical Models

               Friday and Saturday, December 1-2, 1995
               Marriott Vail Mountain Resort, Colorado

  http://www.cs.cmu.edu/afs/cs/project/cnbc/nips/NIPS.html (conference)
  http://www.ai.mit.edu/people/jordan/workshop.html (workshop)

Topic and Purpose of the Workshop:

A Bayesian network is a directed graphical representation of
probabilistic relationships that people find easy to understand and
use, often because the relationships have a causal interpretation.  A
network for a given domain defines a joint probability distribution
for that domain, and algorithms exist for efficiently manipulating the
joint distribution to determine probability distributions of
interest.  Over the last decade, the Bayesian network has become a
popular representation for encoding uncertain expert knowledge in
expert systems.  More recently, researchers have developed methods for
learning Bayesian networks from data.  These approaches will be the
focus of this workshop.

Issues to be discussed include (1) the opposing roles of prediction
and explanation; (2) search, model selection, and capacity control;
(3) representation issues, including extensions of the Bayesian
network (e.g., chain graphs), the role of ``hidden'' or ``latent''
variables in learning, the modeling of temporal systems, and the
assessment of priors; (4) optimization and approximation methods,
including gradient-based methods, EM algorithms, stochastic sampling,
and the mean field algorithms.  In addition, we plan to discuss known
relationships among Bayesian networks, Markov random fields, Boltzmann
machines, loglinear models for contingency tables, Hidden Markov
models, decision trees, and feedforward neural networks, as well as to
uncover previously unknown ties.

Tentative Schedule:

Friday, Dec 1
-------------
7:30am - 8:50am
Tutorial on Graphical Models
Ross Shachter, Stanford University
Bruce D'Ambrosio, Oregon State University
Michael Jordan, MIT

8:50am - 9:30am
Decomposable graphical models and their use in learning algorithms
Steffen Lauritzen, Aalborg University

Decomposable, or triangulated, graphical models occur for example as
basic computational structures in probabilistic expert systems.  I
will first give a brief description of properties of decomposable
graphical models and then present some unfinished ideas about their
possible use in automatic learning procedures.

9:30am - 9:40am
break

9:40am - 10:20am
Likelihoods and Priors for Learning Bayesian Networks
David Heckerman, Microsoft
Dan Geiger, Technion

I will discuss simple methods for constructing likelihoods and
parameter priors for learning about the parameters and structure of a
Bayesian network.  In particular, I will introduce several assumptions
that permit the construction of likelihoods and parameter priors for a
large number of Bayesian-network structures from a small set of
assessments.  Two notable assumptions are parameter independence,
which says that the parameters associated with each variable in a
structure are independent, and likelihood equivalence, which (roughly
speaking) says that data should not help to discriminate structures
that represent the same assertions of conditional independence.  In
addition to explicating methods for likelihood and prior construction,
I will show how the assumptions lead to characterizations of
well-known prior distributions for the parameters of multivariate
distributions.  For example, when the joint likelihood is an
unrestricted discrete distribution, parameter independence and
likelihood equivalence imply that the parameter prior must be a
Dirichlet distribution.

10:20am - 11:00am
Bayesian model averaging for Markov equivalence classes of acyclic digraphs
David Madigan, Michael D. Perlman, and Chris T. Volinsky,
University of Washington, Seattle

Acyclic digraphs (ADGs) are widely used to describe dependencies among
variables in multivariate distributions. There may, however, be many
ADGs that determine the same dependence (= Markov) model.  Thus, the
family of all ADGs with a given set of vertices is naturally
partitioned into Markov-equivalence classes, each class being
associated with a unique statistical model.  Statistical procedures,
such as model selection or model averaging, that fail to take into
account these equivalence classes, may incur substantial computational
or other inefficiencies.  Recent results have shown that each
Markov-equivalence class is uniquely determined by a single chain
graph, the essential graph, that is itself Markov-equivalent
simultaneously to all ADGs in the equivalence class.  Here we propose
two stochastic Bayesian model averaging and selection algorithms for
essential graphs and apply them to the analysis of a number of
discrete- variable data sets.

11:00am - 11:40am
discussion

----mid-day break----

4:30pm - 5:10pm
Automated Causal Inference
Peter Spirtes, Carnegie Mellon University

Directed acyclic graphs can be used to represent both families of
probability distributions and causal relationships. We introduce two
axioms (Markov and Faithfulness) that are widely but implicitly
assumed by statisticians and that relate causal structures with
families of probability distributions. We then use these axioms to
develop algorithms that infer some features of causal graphs given a
probability distribution and optional background knowledge as
input. The algorithms are correct in the large sample limit, even when
latent variables and selection bias may be present. In the worst case,
the algorithms are exponential, but in many cases they has been able
to handle up to 100 variables. We will also present Monte Carlo
simulation results on various sample sizes.

5:10pm - 5:50pm
HELMHOLTZ MACHINES
Geoffrey E. Hinton, Peter Dayan, Brendan Frey, Radford Neal
University of Toronto and MIT

For hierarchical generative models that use distributed
representations in their hidden variables, there are exponentially
many ways in which the model can produce each data point.  It is
therefore intractable to compute the posterior distribution over the
hidden distributed representations given a datapoint and so there is
no obvious way to use EM or gradient methods for fitting the model to
data.  A Helmholtz machine consists of a generative model that uses
distributed representations and a recognition model that computes an
approximation to the posterior distribution over representations.  The
machine is trained to minimize a Helmholtz free energy which is equal
to the negative log probability of the data if the recognition model
computes the correct posterior distribution.  If the recognition model
computes a more tractable, but incorrect distribution, the Helmholtz
free energy is an upper bound on the negative log probability of the
data, so it acts as a tractable and useful Lyapunov function for
learning a good generative model.  It also encourages generative
models that give rise to nice simple posterior distributions, which
makes perception a lot easier.  Several different methods have been
developed for minimizing the Helmholtz free energy.  I will focus on
the "wake-sleep" algorithm, which is easy to implement with neurons,
and give some examples of it learning probability density functions in
high dimensional spaces.

5:50pm - 6:30pm
Bounding Log Likelihoods in Sigmoid Belief Networks
Lawrence K. Saul, Tommi Jaakkola, and Michael I. Jordan, MIT

Sigmoid belief nets define graphical models with useful probabilistic
semantics.  We show how to calculate a lower bound on the log
likelihood of any partial instantiation of a sigmoid belief net.  The
bound can be used as a basis for inference and learning provided it is
sufficiently tight; in practice we have found this often to be the
case.  The bound is computed by approximating the true posterior
distribution over uninstantiated nodes, $P$, by a more tractable
distribution, $Q$.  Parameterized forms for $Q$ include factorial
distributions, mixture models, and hierarchical distributions that
exploit the presence of tractable substructures in the original belief
net.

Saturday, Dec 2
---------------
7:30am - 8:10am
A Method for Learning the Structure of a Neural Network from Data
Gregory F. Cooper and Sankaran Rajagopalan, University of Pittsburgh

A neural network can be viewed as consisting of a set of arcs and a
parameterization of those arcs. Most neural network learning has
focused on parameterizing a user-specified set of arcs. Relatively
less work has addressed automatically learning from data which arcs to
include in the network (i.e., the neural network structure). We will
present a method for learning neural network structures from data
(call it LNNS). The method takes as input a database and a set of
priors over possible neural network structures, and it outputs the
neural network structure that is most probable as found by a heuristic
search procedure. We will describe the close relationship between the
LNNS method and current Bayesian methods for learning Bayesian belief
networks. We also will show how we can apply the LNNS method to learn
hidden nodes in Bayesian belief networks.

8:10am - 8:50am
Local learning in probabilistic networks with hidden variables
Stuart Russell, John Binder, Daphne Koller, Keiji Kanazawa
University of California, Berkeley

We show that general probabilistic (Bayesian) networks with fixed
structure containing hidden variables can be learned automatically
from data using a gradient-descent mechanism similar to that used in
neural networks.  The gradient can be computed locally from
information available as a direct by-product of the normal inference
process in the network. Because probabilistic networks provide
explicit representations of causal structure, human experts can easily
contribute prior knowledge to the training process, thereby
significantly improving the sample complexity. The method can be
extended to networks with intensionally represented distributions,
including networks with continuous variables and dynamic probabilistic
networks (DPNs). Because DPNs provide a decomposed representation of
state, they may have some advantages over HMMs as a way of learning
certain types of stochastic temporal processes with hidden state.

8:50am - 9:30am
Asymptotic Bayes Factors for Directed Networks with Hidden Variables
Dan Geiger, Technion
David Heckerman, Microsoft
Chris Meek, Carnegie Mellon University

9:30am - 9:40am
break

9:40am - 10:20am
Bayesian Estimation of Gaussian Bayes Networks
Richard Scheines, Carnegie Mellon University

The Gibbs sampler can be used to draw a sample from the posterior
distribution over the parameters in a linear causal model ( Gaussian
network, structural equation model, or LISREL model).  I show how this
can be done, and provide several examples that demonstrate its
utility.  These include: estimating under-identified models and making
correct small sample inferences about the parameters when the
likelihood surface is non-normal, contrary to the assumptions of the
asymptotic theory that all current techniques rely on.  In fact the
likelihood surface for structural equation models (SEMs) with latent
variables is often multi-modal.  I give an example of such a case, and
show how the Gibbs sampler is still informative and useful in this
case in contrast to standard SEM software like LISREL.

10:20am - 11:00am
Learning Stochastic Grammars
Stephen M. Omohundro

Bayesian networks represent the distribution of a fixed set of random
variables using conditional independence information. Many important
application domains (eg. speech, vision, planning, etc.) have state
spaces that don't naturally decompose into a fixed set of random
variables. In this talk, I'll present stochastic grammars as a
tractable class of probabilistic models over this kind of domain. I'll
describe an algorithm by Stolcke and myself for learning Hidden Markov
Models and stochastic regular grammars from training strings which is
closely related to algorithms for learning Bayesian networks. I'll
discuss connections between stochastic grammars and graphical
probability models and argue for the need for richer structures which
encompass certain properties of both classes of model.

11:00am - 11:20am
Variable selection using the theory of Bayesian networks
Christopher Meek, Carnegie Mellon University

This talk will describe two approaches to variable selection based
upon the theory of Bayesian networks. In a study about the prediction
of mortality in hospital patients with pneumonia these two methods
were used to select a set of variables upon which predictive models
were developed. In addition several neural network methods were used
to develop predictive models from the same large database of
cases. The mortality models developed in this study are distinguished
more by the number of variables and parameters that they contain than
by their error rates. I will offer an explanation of why the two
methods described lead to small sets of variables and models with
fewer parameters. In addition the variable sets selected by the
Bayesian network approaches are amenable to future implementation in a
paper-based form and, for several of the models, are strikingly
similar to variable sets hand selected by physicians.

11:20am - 11:40am
Methods for Learning Hidden Variables
Joel Martin

Hidden variables can be learned in many ways.  Which method should be
used?  Some have better theoretical justifications, some seem to have
better pragmatic justifications.  I will describe a simple class of
probabilistic models and will compare several methods for learn hidden
variables from data.  The methods compared are EM, gradient descent,
simulated annealing, genetic search, and a variety of incremental
techniques.  I will discuss the results by considering particular
applications.

----mid-day break----

4:30pm - 5:10pm
Brains, Nets, Feedback and Time Series
Clark Glymour, Thomas Richardson and Peter Spirtes, Carnegie Mellon

Neural networks have been used extensively to model the differences
between normal cognitive behavior and the behavior of brain damaged
subjects. A network is trained to simulate normal behavior, lesioned,
and then simulates, or under re-training simulates, brain-damaged
behavior. A common objection to this explanatory strategy (see for
example comments on Martha Farah's recent contribution in Behavioral
and Brain Sciences) is that it can "explain anything." The complaint
alleges that for any mathematically possible pairing of normal and
brain-damaged behavior, there exists a neural net that simulates the
normal and when lesioned simulates the abnormal. Is that so?

We can model the normal and abnormal behaviors as probability
distributions on a set of nodes, and the network as a set of
simultaneous (generally non-recursive) equations with independent
noises. The network and joint probability distribution then describe a
cyclic graph and associated probability distribution. What is the
connection between the probability distribution and the graph
topology?  It is easy to show that the (local) Markov condition fails
for linear networks of this sort. Spirtes (and independently
J. Koster, in a forthcoming paper in Annals of Statistics) has shown
that the conditional independencies implied by a linear cyclic network
are characterized by d-separation (in either Pearl's or Lauritzen's
versions). Spirtes has also shown that d-separation fails for
non-linear cyclic networks with independent errors, but has given
another characterization for the non-linear case.

These results can be directly applied to the methodological disputes
about brains and neural net models. Assuming faithfulness, it follows
that a lesioned neural net represented as a cyclic Bayes net, whether
linear or non-linear, must preserve the conditional independence
relations in the original network. Hence not every pairing of normal
and abnormal behavior is possible according to the neural net
hypothesis as formulated.

This connection between Bayes nets and work on neural models suggests
that we might look for other applications of Bayesian networks in
cognitive neuropsychology. We might hope to see applications of
discovery methods for Bayes nets to multiple single cell recordings,
or to functional MRI data. Such appplications would be aided by
discovery methods for cyclic Bayes nets as models of recurrent neural
nets.  Several important theoretical steps have been taken towards
that goal by Richardson, who has found a polynomial time decision
procedure for the equivlaence of linear cyclic graphs, and a
polynomial (in sparse graphs) time, asymptotically correct and
complete procedure for discovering equivalence classes of linear
cyclic graphs (without latent variables).  Research is under way on
search procedures that parallel the FCI procedure of Spirtes Glymour
and Scheines (1993) in allowing latent variables. The questions of
equivalence and discovery for non-linear systems are relatively
untouched.

It may be that the proper way to model a recurrent neural net is not
by a cyclic graphical model, but by a time series. That suggestion
raises the question of the connections between time-series and cyclic
graphical models, a matter under investigation by Richardson. Results
in this area would have implications for econometrics as well as for
neuropsychology, since the econometric tradition has treated feedback
systems in both ways, by simultaneous linear equations and by time
series, without fully characterizing the relations between the
representations. While there are situations in which equilibrium
models, such as cyclic graphical models appear applicable, these
models, since they are not dynamic, make no predictions about the
dynamic behavior of a system (time series) if it is pushed out of
equilibrium.

5:10pm - 5:50pm
Compiling Probabilistic Networks and Some Questions this Poses
Wray Buntine

Probabilistic networks (or similar) provide a high-level language that
can be used as the input to a compiler for generating a learning or
inference algorithm.  Example compilers are BUGS (inputs a Bayes
net with plates) by Gilks, Spiegelhalter, et al., and MultiClass (inputs
a dataflow graph) by Roy.  This talk will cover three parts:  (1) an
outline of the arguments for such compilers for probabilistic
networks, (2) an introduction to some compilation techniques, and
(3) the presentation of some theoretical challenges that compilation
poses.

High-level language compilers are usually justified as a rapid
prototyping tool.  In learning, rapid prototyping arises for the
following reasons:  good priors for complex networks are not obvious
and experimentation can be required to understand them;  several
algorithms may suggest themselves and experimentation is required
for comparative evaluation.  These and other justifications will be
described in the context of some current research on learning
probabilistic networks, and past research on learning classification
trees and feed-forward neural networks.  Techniques for compilation
include the data flow graph, automatic differentiation, Monte Carlo
Markov Chain samplers of various kinds, and the generation of C
code for certain exact inference tasks.  With this background, I will
then pose a number of important research questions to the audience.

5:50pm - 6:30pm
discussion

Organizers:
Wray Buntine
Greg Cooper
Dan Geiger
Clark Glymour
David Heckerman
Geoffry Hinton
Mike Jordan
Steffen Lauritzen
David Madigan
Radford Neal
Steve Omohundro
Judea Pearl
Stuart Russell
Richard Scheines
Peter Spirtes