Shift Invariance

Tue Mar 12 09:54:08 EST 1996

In his message <9602281000.ZM15421 at ICSI.Berkeley.edu>, Jerry Feldman
<jfeldman at ICSI.Berkeley.EDU> wrote:

>3) Shift invariance in time and recurrent networks.
>
> I threw in some (even more cryptic) comments on this anticipating that some
>readers would morph the original task into this form. The 0*1010* problem is
>an easy one for FSA induction and many simple techniques might work for this.
>But consider a task that is only slightly more general, and much more natural.
>Suppose the task is to learn any FSL from the class b*pb* where b and p are
>fixed for each case and might overlap. Any learning technique that just
>tried to predict (the probability of) successors will fail because there
>are three distinct regimes and the learning algorithm needs to learn this.
>I don't have a way to characterize all recurrent net learning algorithms to
>show that they can't do this and it will be interesting to see if one can.
>There are a variety on non-connectionist FSA induction methods that can
>effectively learn such languages, but they all depend on some overall measure
>of simplicity of the machine and its fit to the data - and are thus non-local.
>

This isn't really correct. First, any DFA can be represented in
recurrent neural networks with sigmoidal discriminants functions, i.e.
a network can be constructed such that the languages recognized by a DFA
and its network implementation are identical (this implies stability of the
internal DFA state representation for strings of arbitary length) [1,2].

As far as learning DFA's with recurrent networks is concerned: In my experience,
success of failure of a network to learn a particular grammar depends on the
size of the DFA, its complexity (simple self loops as opposed to orbits of arbitary
length), the training data, and the order in which the training data is concerned.
For instance, we found that incremental learning where the network is first
trained on the shortest strings of data [8] is often crucial to successful convergence
since it is a means to overcome the problem of learning long-term dependencies
with gradient descent [4] (for methods for overcoming that problem see [5,6,7]).

The `simplest' language of the form b*pb* might be 1*01*. A network with
second-order weights and a single recurrent state neuron can learn that language
within 100 epochs when trained on the first 100 strings in alphabetical order.
Furthermore, the ideal DFA can also be extracted from the trained network [3].
See for example [9,10,11,12] for other extraction approaches.
For the language 1*011* which is of the form b*pb* (notice overlapping of b and p),
a second-order network with 3 recurrent state neurons easily converged within
200 epochs and the ideal DFA can be extracted as well.

So, here are at least two examples which contradict the claim that 

        "Any learning technique that just tried to predict (the probability of)
         successors will fail because there are three distinct regimes and the
         learning algorithm needs to learn this. I don't have a way to characterize
         all recurrent net learning algorithms to show that they can't do this and
         it will be interesting to see if one can."

Christian

-------------------------------------------------------------------
Christian W. Omlin, Ph.D.       Phone (609) 951-2691
NEC Research Institute          Fax:  (609) 951-2438
4 Independence Way              E-mail: omlinc at research.nj.nec.com
Princeton, NJ 08540                     omlinc at cs.rpi.edu
URL: http://www.neci.nj.nec.com/homepages/omlin/omlin.html
-------------------------------------------------------------------

=================================== Bibliography =======================================

[1] P. Frasconi, M. Gori, M. Maggini, G. Soda,
    "Representation of Finite State Automata in Recurrent Radial Basis Function Networks",
    Machine Learning, to be published, 1996.

[2] C.W. Omlin, C.L. Giles,
    "Stable Encoding of Large Finite-State Automata in Recurrent Neural Networks with
    Sigmoid Discriminants", Neural Computation, to be published, 1996. 

[3] C.W. Omlin, C.L. Giles, 
    "Extraction of Rules from Discrete-Time Recurrent Neural Networks",
    Neural Networks , Vol. 9, No. 1, p. 41-52, 1996.

[4] Y. Bengio, P. Simard, P. Frasconi,
    "Learning Long-Term Dependencies with Gradient Descent is Difficult",
    IEEE Transactions on Neural Networks (Special Issue on Recurrent Neural Networks),
    Vol. 5, p. 157-166, 1994.

[5] T. Lin, B.G. Horne, P. Tino, C.L. Giles,
    "Learning  Long-Term Dependencies with NARX
    Recurrent Neural Networks, IEEE Transactions on Neural Networks,
    accepted for publication.

[6] S. El Hihi, Y. Bengio,
    "Hierarchical Recurrent Neural Networks for Long-Term Dependencies",
    Neural Information Processing Systems 8, MIT Press, 1996.

[7] S. Hochreiter, J. Schmidhuber,
    "Long Short Term Memory", Technical Report, Institut fuer Informatik,
    Technische Universitaet Muenchen, FKI-207-95, 1995.

[8] J.L. Elman,
    "Incremental Learning, or the Importance of Starting Small"
    Technical Report, Center for Research in Language, University of
    California at San Diego, CRL Tech Report 9101, 1991.

[9] S. Das, M.C. Mozer,
    "A Unified Gradient-descent/Clustering Architecture for Finite State 
    for Finite State Machine Induction",
    Advances in Neural Information Processing Systems 6, 
    J.D. Cowan , G. Tesauro, J. Alspector (Eds.), p. 19-26, 1994. 

[10] M.P. Casey,
     "Computation in Discrete-Time Dynamical Systems",
     Ph.D. Thesis, Department of Mathematics, University of California, 
     San Diego, 1995.

[11] P. Tino, J. Sajda,
     "Learning  and Extracting  Initial Mealy  Machines With  a Modular
     Neural Network Model}",
     Neural Computation, Vol. 7, No. 4, p. 822-844, 1995.

[12] R.L. Watrous, G.M. Kuhn,
     "Induction of Finite-State Languages Using Second-Order Recurrent Networks",
     Neural Computation, Vol. 4, No. 5, p. 406, 1992.