cross entropy and training time in Connectionist Nets and HMM's

Wed Mar 15 18:23:24 EST 1989

these are some random  thoughts on the issue of training in HMM and
Connectionist networks. i focus on the cross entropy function
and follow a different line of thinking than in my paper which i quote
further down. this note is the outcome of an exchange between Tony
Robinson and me; i thought some netters might be interested.
so i want to thank Tony for posing interesting ideas and
questions. also thanks to all the people who replied to my request
for information on the cross entropy function.

-----------------------
the starting point for this discussion is the following  question:

"why is  HMM training so much faster than Connectionist Networks?"

to put the question in perspective, let me first remark that, from a
certain point of view, HMM and CN are very similar objects. specifically
they use similar architectures to optimize appropriate cost functions.
for further explanation of this point, see [Kehagias], also [Kung].
the similarity is even more obvious when CM are used to solve speech
recognition problems.

the question remains: why, in attempting to solve the same problem,
CN require so much more training?

1. cost functions
-----------------

it appears that a (partial) explanation is the nature of the cost
function used in each case. in CN speech recognizers, the cost
function of choice is quadratic error (error being the difference of
appropriate vectors). however in most of what follows i will
consider CN that maximize the cross entropy function. a short
discussion of the relationship between cross entropy and square
error is included at the end. in HMM the function MAXIMIZED
is likelihood (of the observations).

however HMM are a bit more subtle. using the Markov Model, one can write
the likelihood of the observations used for training, call it L(q). here
q is a vector that contains the transition and emission probabilities
(usually called a_ij, b_kj, respectively). to keep the discussion
simple, let us consider the only unknown parameters to be the a_ij's.
that is, the elements of q are the a_ij's. now, q is a vector, but a mor
general view of it is that it is a function (specifically a probability
density function). so we will consider q as a vector or a function
interchangeably. (of course any vector is a function of its index!)

Now, to maximize L is not a trivial task : it
is a polynomial of n*T-th order in the elements of q
(where n is the order of the Markov model, T the number of observations)
furthermore, the elements of q are probabilties and they must
satisfy certain positivity and add-up-to-1 conditions.

2. Likelihood maximin, Backward-Forward, EM algorithm
-----------------------------------------------------

so HMM people have found a way to make the optimization problem easier:
consider an auxiliary function, call it Q(q,q'), to be presently
defined, which can be maximized much easier. then they prove the
remarkable inequality:

(1)       L(q)*log(L(q')/L(q)) >= (Q(q,q')-Q(q,q)).

the consequence  of (1) is the following: we can implement an iterative
algorithm that goes as follows:

Step 0: choose q(0)

.....

Step k: choose q(k) such that  Q(q(k-1),q(k)) is maximized.

        if Q(q(k-1),q(k))=0, terminate.
        if Q(q(k-1),q(k))>0 go to step k+1

.....

                    REMARKS:
1) observe that no provision is made for the case that Q(q(k-1),q(k)) is
negative. this is due to the fact that max G is always nonnegative, as
proved in  [Baum 1968] or [Dempster].

2) of course , in practice, the termination condition will
be replaced by : if G<epsilon terminate (epsilon small).

3) it is easy to see (by use of Calculus methods
that EXACT maximization of  Q(q(k-1),q(k)) with respect to q(k) is
possible and , in fact, easy. the maximizing values q(k) are given
in terms of q(k-1) by the reestimation
formulas (see [Baum 1970]) the form of which guarantees that they have
the probability properties (positivity, add-up-to-one).

4) finally note that the algorithm is a maximin algorithm, since it
maximizes the minimum gain in likelihood.

this algorithm in its general form is known as the EM algorithm
[Dempster]. in the HMM context it is known as the Backward-Forward
(BF) algorithm [Baum 1970]. it is a greedy algorithm that produces
a sequence of parameter values q(0),q(1),q(2),... such that:

(2)     Q(q(k-1),q(k))>Q(q(k-1),q(k-1)).

>From (1) and (2) and Remark (1)  follows that

(3)     L(q(k)) > L(q(k-1)).

3. Connection of EM with cross entropy and neural networks
----------------------------------------------------------

Now we will discuss the function G and point out the relationship to
CN.

The function Q(q,q') can be defined in quite a general setting. q , q'
are probability densities. as such they are functions themselves; we
write q(x), q'(x). x takes values in an appropriate range. e.g., in the
HMM model x ranges over  all the state transition pairs (i,j), giving
the probability of a certain state transition. now, define G:

(4) Q(q,q')=sum{over all x} q(x)log(q'(x)).

Then, the difference   Q(q,q)-Q(q,q') is:

(5)  Q(q,q)-Q(q,q')=G(q,q')=sum{all x}q(x)log(q(x)/q'(x)).

G is the well known to connectionists (and statisticians) cross-
entropy between q and q', that is, a measure of distance between
these two probability densities.

now we recognize two things:

I. there have been cases where G minimization has been proposed
as a CN training procedure . see [Hinton].
In these cases, a desired probability density was known and what
was desired was to minimize the distance between desired and actual
probability density of the CN output. in some of these cases, there was ncurrent
simultaneous maximization of likelihood. this is noted in [Ackley].
it follows necessarily from (1) that maximizing the cross-entropy
maximizes the minimum improvement in likelihood.

II. it is clear that the BF algorithm does a similar thing: likelihood
maximization, cross entropy minimization. as noted in [Baum 1968]
and also in [Levinson], the difference q(k)-q(k-1) points in the
same direction as grad L(q), evaluated at q(k-1). That is, the
q(k-1) is changed in the direction of steepest descent of L. of all
the possible steps (choices of q(k)) the one is chosen that minimizes
the distance between q(k-1) and q(k) in the cross entropy sense.

4. Comparison in training of HMM and CN:
---------------------------------------

now we can make a comparison of the performance of CN and HMM's. this
comparison is between  G-optimizing-CN's and HMM's. the square-error
CN is not discussed here.

firstly, we see that the main focus of attention is different in the two
cases. in CN we want to minimize cross entropy. in HMM we want to maximi
likelihood. however, likelihood maximinimization
is an automatc consequence of G minimization for CN's and local
G minimization is built in in the BF algorithm. in that
sense, the two tasks are very similar and so the question is once
again raised: why are HMM's faster to train?

at this point the answers are many and easy. even though HMM's use
observations in a nonlinear way, the state vector
of the adjoint network (see [Kehagias]) evolves linearly.
not so for CN's. the HMM adjoint  network is
sparsely connected. not necessarily so for the CN (pointed out
by [Tony Robinson]). though both cost functions used are nonlinear,
the BF is a much more efficient method to optimize the HMM cost function
than Back Propagation is for CN's.

the last answer is the really important one. due to the special nature
of the Hidden Markov Model, we can use the BF algorithm. this algorithm
allows to take large steps (large changes from q(k-1) to q(k)) in the   traying
Euclidean distance, without moving too far away in the cross entropy
distance. of all the probability distributions, we consider only the
ones that are "relevant", in that they  are close to the current one;
and yet, even though we take conservative steps, we are guaranteed
to maximize the minimum improvement in likelihood. indeed the maxmin
is a conservative attitude. the rational is the following:
"you want to maximize L. you know the steepest ascent direction;
you want to go in that direction, but you do not know how far to
go. BF will tell you how far you can go (and it will not be an
infinitesimal step) so that you maximize the minimum improvement."

another way to look at this is that the Euclidean distance imposes
a structure (topology) to the space of probability distributions.
the cross entropy distance imposes a different structure, which
apparently, is more relevant to the problem.

in contrast, in BP we have not much choice in the change we bring on q.
we have control over w, the weights of the connections, and we usually
choose them in the steepest descent direction, and small enough that
we actually have an improvement. but it is not clear that the cross
entropy between distributions imposes a suitable structure on the
space of weights. apparently it does not. even a relatively small step
in the weight space can change the cost function by much. we have to
tread more carefully.

of course BF can be used due to the very special structure of the
HMM problem (which is probably a good argument for the usefulness
of the HM Model). BF is applicable when the cost function is a
homogeneous polynomial with additive constraints on the variables.
(see [Baum 1968]). the CN problem is characterized by
harder nonlinearities (e.g. the sigmoid function) which induce
a warped relationship between the weights and cost function.

in short, the CN problem is more general and harder.

5. square error cost function
-----------------------------

first a general observation: the square error cost function can be
introduced under two asumptions. in the one case we assume the error
to be deterministic and we want to minimize a deterministic sum of
square errors (the sum is over all training patterns; the error is
the difference between desired and actual response)
by appropriate choice of weights. there is nothing
probabilistic here. alternatively, we can assume that
the training patterns are selected randomly (according to some
prob. density) and also the test patterns will come from the
same prob. density, and we choose the weights to minimize
expected square error.

even though the two points of view are distinct, they are not
that different, since in both cases we can define inner products,
distance functions etc. and so get a Hilbert space structure
that is practically the same for both cases. of course this would involv
some ergodicity assumption.

at any rate, assume  here the probabilistic point of view of square
error. what are then the connections between the two cost
functions: cross entropy and expected (or mean ) square error?

i have seen some remarks on this problem in the literature, but i
do not know enough about  at this point. however, judging from
training time, i would say that the nonlinear nature of CN with
sigmoids again maps the weight space to the cost function in a very
warped way. it would be interesting to examine the shape of the
cost function contour in the weight space. have such studies been made?
visualization seems to be a problem for high dimensional networks.

6. cross entropy maximization and some loose ends
------------------------------------------------
an interesting variation is G maximization. this usually occurs
in unsupervised learning. See [Linsker], [Plumbley]. it appears under
the name of transinformation maximization, or error information
minimization, but these quantities can be interpreted as cross
entropy between the joint input-output probability den. induced by the
CN (for given weights) and the probability den. where input and output
have the same marginals, but are independent (so the joint density
is a product of the two marginals). i guess a way to explain this
in terms of cross entropy is: even though we have no prior
information on the best input-output density, there is one density
we certainly want to avoid as much as possible, and this the one
where input and output are independent (so the input gives no
information as to what the output is). hence we want to maxmize the
cross entropy distance between this product distribution and the
CN induced distribution. there is also a possible interpretation
along the lines of the maximum entropy principle.

i must  say  that  these interpretations do not seem (yet)
to me as appealing as maximum   transinformation. however they are
possible   and indeed      statisticians   have been considering
them  for  many  years  now.

another interseting connection is between cross entropy and rate
of convergence (obviously rate of convergence is connected to
training time). [Ellis] gives an excellent analysis of the connection
between rate of convergence and crossentropy. application of
his results to computational problems is not obvious.

finally, an interesting example (of statistical work
that relates to this line of connectionist research) is [Rissanen];
there the linear regression model is considered, which of course
can be interpreted as a linear perceptron. in [Rissanen] selection
of the optimal model is based on minmax entropy criterion.

References:
-----------

D.H.Ackley:   "A Learning algorithm for Boltzmann machines"
     et.al.    Cognitive Science 9 (1985).

L.E. Baum &:  "Growth Transformations for Functions on Manifolds"
G.R. Sell     Pacific Journal of Mathematics, Vol.27, No.2., 1968.

L.E.Baum  :   "A Maximization Technique occurring in the Statistical
     et.al.    Analysis of Probabilistic Functions of Markov Chains"
               The Annals of Math, Stat., Vol. 41, No. 1, 1970.

A.P. Dempster:"Maximum Likelihood from Incomplete Data via EM algorithm"
       et. al. Pr. Roy. Stat. Soc., No. 1, 1977.

R. Ellis:     "Entropy, Large Deviations and Statistical Mechanics"
               Springer, New York 1985.

G. Hinton    :"Connectionist Learning Procedures", Technical Report
               CMU-CS-87-115 (Carnegie Mellon University), June 1987.

A. Kehagias: "Optimal Control for Training: Themissing link
              between HMM and Connectionist Networks"
              submitted to 7th int. Conf. on Math. and Computer
              Modelling, Chicago, Illinois, August 1989.

S.Y. Kung &: "A Unifying viewpoint of Multilayer Perceptrons
J.N. Hwang    and HMM models" (IEEE Int. Symposium and Systems
              Portland, Oregon, 1989.

S.E.Levinson: "An Introduction to the Application of the Theory of
       et.al.  Probabilistic Functions of a Markov Process to Automatic
               Speech Recognition", The Bell Sys. Tech. J., Vol.62,
               No. 4, April 1983.

R. Linsker:   "Self Organization in a Perceptual Network", IEEE
               Computer, Vol.21, No.3, March 1988.

M. Plumbley&: "An information Theoretic Approach to Unsupervised
F. Fallside    Connectionist Models", Proceedings of 1988 Connectionist
               Models Summer School, Pittsburgh, 1988.

J. Rissanen: "Minmax Entropy Estimation of Models for Vector
              Processes", in Lainiotis-Mehra (ed.), System Advances
              and case studies, Academic, New York, 1976.

T. Robinson: personal communication