New Book

Wed Mar 10 09:32:34 EST 1999

The following book is now available:

Dirk Husmeier

NEURAL NETWORKS FOR CONDITIONAL 
PROBABILITY ESTIMATION

Forecasting Beyond Point Predictions

Perspectives in Neural Computing
Springer Verlag

ISBN 1-85233-095-3

275 pages

http://www.springer.co.uk

--------------------------------------------------
                 SYNOPSIS
--------------------------------------------------
Neural networks have been extensively applied
to regression, forecasting, and system modelling.
However, most of the conventional approaches
predict only a single value as a function of the 
network inputs, which is inappropriate when 
the underlying conditional probability density 
is skewed or multi-modal.
The objective of this book is to study the
application of neural networks to
predicting the entire conditional probability
distribution of an unknown data-generating process.

In the first part, the structure of a 
universal approximator architecture is discussed, 
and a backpropagation-like training scheme is
derived from a maximum likelihood approach.
More advanced chapters address the problems
of training speed and generalisation performance.
Several recent learning and regularisation methods
are reviewed and adapted to the problem of predicting
conditional probabilities:
a combination of the random vector functional link net
approach with the expectation maximisation algorithm,
a generalisation of the Bayesian evidence scheme to
mixture models, the derivation of an appropriate
weighting scheme in network ensembles, 
and a discussion of why the over-fitting of individual 
networks may lead to an improved prediction
performance of a network committee.

All techniques and algorithms are applied to
a set of various synthetic and real-world benchmark problems,
and numerous graphs and diagrams provide a deeper insight
into the nature of the learning and regularisation
processes.

Presupposing only a basic knowledge of
probability and calculus, this book should
be of interest to graduate students, researchers
and practitioners in statistics, econometrics and
artificial intelligence.

--------------------------------------------------
                 OVERVIEW
--------------------------------------------------
Conventional applications of neural networks usually 
predict a single value as a function of given inputs. 
In forecasting, for example,
a standard objective is to predict the future value
of some entity of interest on the basis of a time 
series of past measurements or observations.
Typical training schemes aim to minimise the sum of 
squared deviations between predicted and actual 
values (the `targets'), by which, ideally, the network 
learns the conditional mean of the target given the input. 
If the underlying conditional distribution is
Gaussian or at least unimodal
this may be a satisfactory approach.
However, for a multimodal distribution, the conditional
mean does not capture the relevant features of the
system, and the prediction performance will, in general, 
be very poor. This calls for a more powerful and 
sophisticated model, which can learn the whole 
conditional probability distribution.

Chapter~1 demonstrates that
even for a deterministic system and 
`benign' Gaussian observational noise, 
the conditional distribution of a future observation,
conditional on a set of past observations, can
become strongly skewed and multimodal.
In Chapter~2, a general neural network structure
for modelling conditional probability densities
is derived, and it is shown that a universal
approximator for this extended task requires
at least two hidden layers.
A training scheme is developed from a 
maximum likelihood
approach in Chapter~3, and the performance 
of this method is demonstrated on 
three stochastic time series in Chapters~4 
and 5.
Several extensions of this basic paradigm are studied
in the following chapters, aiming at both an
increased training speed and a better generalisation 
performance.
Chapter~7 shows that a straightforward application 
of the Expectation Maximisation (EM) algorithm does 
not lead to any improvement in
the training scheme, but that in combination with the
random vector functional link (RVFL) net approach,
reviewed in Chapter~6, the training
process can be accelerated by about two orders of magnitude.
An empirical corroboration for this `speed-up'
can be found in Chapter~8.
Chapter~9 discusses a simple 
Bayesian approach to network training, 
where a conjugate prior distribution on the network 
parameters naturally results in a penalty term
for regularisation. 
However, the hyperparameters still
need to be set by intuition or cross-validation,
so a consequent extension is presented in
Chapters~10 and 11,
where the Bayesian evidence scheme, 
introduced to the neural network
community by MacKay for regularisation and model selection
in the simple case of Gaussian homoscedastic noise, 
is generalised to arbitrary
conditional probability densities. The Hessian matrix of the
error function is calculated with an extended version of the
EM algorithm.
The resulting update equations for the hyperparameters
and the expression for the model evidence 
are found to reduce to
MacKay's results in the above limit of Gaussian noise
and thus provide a consequent generalisation 
of these earlier results.
An empirical test of the evidence-based regularisation scheme,
presented in Chapter~12, confirms that the problem of 
overfitting can be considerably reduced,
and that the training process is stabilised with 
respect to changes in the length of training time.
A further improvement of the generalisation
performance can be achieved by 
employing network committees, for which two weighting
schemes -- based on either the evidence or the
cross-validation performance -- are derived
in Chapter~13.
Chapters~14 and 16 report the results
of extensive simulations on a synthetic and a 
real-world problem,
where the intriguing observation is made that in
network committees, overfitting of the individual
models can be useful and may lead to better prediction results
than obtained with an ensemble of properly regularised networks.
An explanation for this curiosity can be given 
in terms of a modified bias-variance dilemma, 
as expounded in Chapter~13.
The subject of Chapter~15 is the problem of feature
selection and the identification of irrelevant inputs.
To this end, the automatic relevance 
determination (ARD) scheme of MacKay and Neal is adapted
to learning in committees of probability-predicting RVFL networks.
This method is applied in Chapter~16 to a 
real-world benchmark problem, where the objective is
the prediction of housing prices
in the Boston metropolitan area on the basis of various
socio-economic explanatory variables.
The book concludes in Chapter~17
with a brief summary.