New Book
Dirk Husmeier
d.husmeier at ic.ac.uk
Wed Mar 10 09:32:34 EST 1999
The following book is now available:
Dirk Husmeier
NEURAL NETWORKS FOR CONDITIONAL
PROBABILITY ESTIMATION
Forecasting Beyond Point Predictions
Perspectives in Neural Computing
Springer Verlag
ISBN 1-85233-095-3
275 pages
http://www.springer.co.uk
--------------------------------------------------
SYNOPSIS
--------------------------------------------------
Neural networks have been extensively applied
to regression, forecasting, and system modelling.
However, most of the conventional approaches
predict only a single value as a function of the
network inputs, which is inappropriate when
the underlying conditional probability density
is skewed or multi-modal.
The objective of this book is to study the
application of neural networks to
predicting the entire conditional probability
distribution of an unknown data-generating process.
In the first part, the structure of a
universal approximator architecture is discussed,
and a backpropagation-like training scheme is
derived from a maximum likelihood approach.
More advanced chapters address the problems
of training speed and generalisation performance.
Several recent learning and regularisation methods
are reviewed and adapted to the problem of predicting
conditional probabilities:
a combination of the random vector functional link net
approach with the expectation maximisation algorithm,
a generalisation of the Bayesian evidence scheme to
mixture models, the derivation of an appropriate
weighting scheme in network ensembles,
and a discussion of why the over-fitting of individual
networks may lead to an improved prediction
performance of a network committee.
All techniques and algorithms are applied to
a set of various synthetic and real-world benchmark problems,
and numerous graphs and diagrams provide a deeper insight
into the nature of the learning and regularisation
processes.
Presupposing only a basic knowledge of
probability and calculus, this book should
be of interest to graduate students, researchers
and practitioners in statistics, econometrics and
artificial intelligence.
--------------------------------------------------
OVERVIEW
--------------------------------------------------
Conventional applications of neural networks usually
predict a single value as a function of given inputs.
In forecasting, for example,
a standard objective is to predict the future value
of some entity of interest on the basis of a time
series of past measurements or observations.
Typical training schemes aim to minimise the sum of
squared deviations between predicted and actual
values (the `targets'), by which, ideally, the network
learns the conditional mean of the target given the input.
If the underlying conditional distribution is
Gaussian or at least unimodal
this may be a satisfactory approach.
However, for a multimodal distribution, the conditional
mean does not capture the relevant features of the
system, and the prediction performance will, in general,
be very poor. This calls for a more powerful and
sophisticated model, which can learn the whole
conditional probability distribution.
Chapter~1 demonstrates that
even for a deterministic system and
`benign' Gaussian observational noise,
the conditional distribution of a future observation,
conditional on a set of past observations, can
become strongly skewed and multimodal.
In Chapter~2, a general neural network structure
for modelling conditional probability densities
is derived, and it is shown that a universal
approximator for this extended task requires
at least two hidden layers.
A training scheme is developed from a
maximum likelihood
approach in Chapter~3, and the performance
of this method is demonstrated on
three stochastic time series in Chapters~4
and 5.
Several extensions of this basic paradigm are studied
in the following chapters, aiming at both an
increased training speed and a better generalisation
performance.
Chapter~7 shows that a straightforward application
of the Expectation Maximisation (EM) algorithm does
not lead to any improvement in
the training scheme, but that in combination with the
random vector functional link (RVFL) net approach,
reviewed in Chapter~6, the training
process can be accelerated by about two orders of magnitude.
An empirical corroboration for this `speed-up'
can be found in Chapter~8.
Chapter~9 discusses a simple
Bayesian approach to network training,
where a conjugate prior distribution on the network
parameters naturally results in a penalty term
for regularisation.
However, the hyperparameters still
need to be set by intuition or cross-validation,
so a consequent extension is presented in
Chapters~10 and 11,
where the Bayesian evidence scheme,
introduced to the neural network
community by MacKay for regularisation and model selection
in the simple case of Gaussian homoscedastic noise,
is generalised to arbitrary
conditional probability densities. The Hessian matrix of the
error function is calculated with an extended version of the
EM algorithm.
The resulting update equations for the hyperparameters
and the expression for the model evidence
are found to reduce to
MacKay's results in the above limit of Gaussian noise
and thus provide a consequent generalisation
of these earlier results.
An empirical test of the evidence-based regularisation scheme,
presented in Chapter~12, confirms that the problem of
overfitting can be considerably reduced,
and that the training process is stabilised with
respect to changes in the length of training time.
A further improvement of the generalisation
performance can be achieved by
employing network committees, for which two weighting
schemes -- based on either the evidence or the
cross-validation performance -- are derived
in Chapter~13.
Chapters~14 and 16 report the results
of extensive simulations on a synthetic and a
real-world problem,
where the intriguing observation is made that in
network committees, overfitting of the individual
models can be useful and may lead to better prediction results
than obtained with an ensemble of properly regularised networks.
An explanation for this curiosity can be given
in terms of a modified bias-variance dilemma,
as expounded in Chapter~13.
The subject of Chapter~15 is the problem of feature
selection and the identification of irrelevant inputs.
To this end, the automatic relevance
determination (ARD) scheme of MacKay and Neal is adapted
to learning in committees of probability-predicting RVFL networks.
This method is applied in Chapter~16 to a
real-world benchmark problem, where the objective is
the prediction of housing prices
in the Boston metropolitan area on the basis of various
socio-economic explanatory variables.
The book concludes in Chapter~17
with a brief summary.
More information about the Connectionists
mailing list