training time in HMM and CN

Tue Mar 28 09:41:36 EST 1989

Two comments on Thansis' post on the relative training speed of HMM vs CN
for sequential problems such as speech recognition:

1. The BF algorithm is quite highly optimized, while vanilla BP doesn't
   implement anything that a numerical analyst would consider a real
   descent procedure (not even steepest descent). If you were to use a
   reasonably powerful numerical optimization technique, such as one of
   the Broyden methods you may find CN convergence extremely fast. Ray
   Watrous has in fact shown this sort of speedup for speech problems [1].

 2. A more subtle, but probably more important difference, is the issue of
    how targets are specified over an input sequence. The BF algorithm
    specifies targets for intermediate steps in an input sequence based on
    expectations of final outcome of that sequence collected from many
    similar sequences. It is not clear how to specify output targets for
    intermediate points of an input sequence in a CN, although Watrous
    has shown that intelligent choice of such targets can markedly improve
    CN convergence and performance. Of interest in this regard is the work
    by Sutton on Temporal Difference methods [2]. One can view this work as
    specifying a target function over a sequence in a dynamical way, so that
    the target function reflects the experience of the system to date in a
    clever way. Sutton [2] has shown an equivalence between one form of linear
    TD method and the maximum likelihood estimates of the parameters for an
    absorbing Markov chain model of the same process. This seems much closer
    in flavour to what the BF algorithm is doing, and when applied to a 
    non-linear system may in fact be an interesting generalization of BF.

Comments and requests for clarifications should be directed to me, not to
Connectionists please.

 	- Steve Nowlan
 	  nowlan at ai.toronto.edu

References:

 [1]  Watrous, Raymond L. "Speech Recognition Using Connectionist Networks",
      TR MS-CIS-88-96, Department of Computer and Information Science,
      University of Pennsylvania, Philadelphia, 1988.

 [2]  Sutton, Richard S. "Learning to Predict by the Methods of Temporal
       Difference", GTE Technical Report TR87-509.1, GTE Laboratories Inc.
       Waltham, Mass. 1987.