The sigmoid is the poserior distribution from Gaussian likelihoods

Tue Sep 15 15:42:51 EDT 1992

The subject line says it all.  Given N classes each of which has a Gaussian
distribution in the input space (with common covariance matrix), it is
reasonably well known that the discriminant function is a hyperplane (e.g.
Kohonen's book, section 7.2).  But what I didn't know until a month ago is
that if you calculate the posterior probabilities using Bayes rule from the
Gaussian likelihoods, then you end up with a weighted sum computation and
the Potts/softmax activation function for N classes or the sigmoid for the
two class case.  This is exactly the same function as computed in the last
layer of a multi-layer perceptron used for classification.  One nice
corollary is that the "bias" weight contains the log of the prior for the
class, and so may be adjusted to compensate for different training/testing
environments.  Another is that provided the data near the class boundary can
accurately be modelled as Gaussian, the sigmoid gives a good estimate of the
posterior probabilities.  From this viewpoint, the function of the lower
levels of a multi-layer perceptron are to generate Gaussian distributions
with identical covariance matrices.

Feedback from veterans of the field has been "yes, of course I knew that",
but in case this is new to you like it was to me, I have written it up as
part of a tutorial paper which is available from the anonymous ftp site
svr-ftp.cam.eng.ac.uk as file reports/robinson_cnnss92.ps.Z.  The same
directory carries an INDEX file detailing other reports which may be of
interest.

Tony [Robinson]