Sequential learning.

Mon Jan 9 13:07:21 EST 1995

Danny Silver writes:
>
>For me the significance of inteference in neurally inspired learning systems
>is the message that an effective learner must not only be capable
>of learning a single task from a set of examples but must also be
>capable of effectively integrating variant task knowledge at a meta-
>level.  This falls in line with McClelland's recent papers on consolidation
>of hippcocampal memories into cortical regions; his "interleaved learning".
>This is a delicate and complex process which undoubtedly occurs during sleep.

>In tune with Sebastian Thrun and Tom Mitchell's efforts on "Life Long
>Learning" I feel the next great step in learning theory will be the discovery
>of methods which allow our machine learning algorthms to take advantage of
>previously acquired task knowledge.

I could not agree more. And with all modesty, the 'next great step' has 
already begun with the work in my recently completed PhD thesis entitled 
'learning internal representations'. The thesis can be retrieved via 
anonymous ftp from the neuroprose archive (Thesis subdirectory)--
baxter.thesis.ps.Z (112 pages)

In the thesis I examine in detail one important method of enabling machine learning
algorithms to take advantage of previously acquired task knowledge, namely by
learning an internal representation. The idea behind learning an internal
representation is to notice that for many common machine learning problems
(such as character and speech recognition) there exists a transformation
from the input space of the problem (the space of all images of characters
or the space of speech signals) into some other space that makes the 
learning problem much easier. For example, in character recognition, if a 
map from the input space can be found that is insensitive to rotations,
dilations, and even writer-dependent distortions of the characters and such
a map is used to 'preprocess' the input data, then the learning problem becomes
quite trivial (the learner only needs to see one positive example of each 
charecter to be able to classify all future characters perfectly). I argue
in the thesis that the information required to learn such a representation
cannot in general be contained in a single task: many learning tasks are
required to learn a good representation. 

Thus, the idea is to sample from many similar learning tasks to first learn
a representation for a particular learning domain, and then use that 
representation to learn future tasks from the same domain. Examples of similar
tasks in the character recognition learning domain are classifiers for 
individual characters (which includes characters from other alphabets), and
in the speech recognition domain individual word classifiers constitute
the similar tasks. 

It is proven in chapter three of the thesis that for suitable learning domains
(of which speech and character recognition should be two examples), the
number of examplesof each task required for good generalisation decreases 
linearly with the number of tasks being leearnt, and that once a representation
has been learnt for the learning domain, far fewer examples of any novel
task will be required for good generalisation. In fact, depending on the domain,
there is no limit to the speedup in learning that can be achieved by first
learning an internal representation.

There are two levels at which represntation learning can be viewed as 
applying to human learning. At the bottom level we can assume that 
the tasks our evolutionary ancestors have had to learn in order to survive
has resulted in humans being born with built in representations that are
useful for learning the kinds of tasks necessary for survival. An example of 
this is the edge-detection processing that takes place early in the visual
pathway, among other things this should be useful for identifying the 
boundaries of surfaces in our environment and hence provides a big boost
to the process of learning not to bump into those surfaces. At a higher
level it is clear that we build representations on top of these lower level
representations during our lifetimes. For example, I grew up surrounded by
predominately caucasian faces and hence learnt a representation that allows
me to learn individual caucasian faces quickly (in fact with one example in 
most cases). However, although I am more able now, when I originally was 
presented with images of negro faces I was less able to distinguish them. Thus
I have had to re-learn my 'face recognition' representation to accomodate 
learning of negro faces. 

In chapter four of my thesis I show how gradient descent may be used to 
learn ineternal representations and present several experiments supporting
the theoretical conclusions that learning more tasks from a domain reduces
the number of examples per task, and that once an effective representation
is learnt, the number of examples required of future tasks is greatly 
reduced. 

It also turns out that the ideas involved in representation learning can 
be used to solve an old problem in vector quantization: namely how to
choose an appropriate distortion measure for the quantization process.
This is discussed in chapter five, in which the definition of the canonical
distortion measure is introduced and is shown to be optimal in a very
general sense. It is also shown how a distortion measure may be learnt using
the representation learning techniques introduced in the previous chapters. 

In the final chapter the ideas of chapter five are applied back to the
problem of representation learning to yield an improved error measure
for the representation learning process and some experiments are performed 
demonstrating the improvement.

Although learning an internal representation is only one way of enabling
information from a body of tasks to be used when learning a new task, I
believe it is the one employed extensively by our brains and hence the work
in this thesis should provide an appropriate theoretical framework in which
to address problems of sequential learning in humans, as well as providing
a practical framework and set of techniques for tackling artificial learning problems
for which there exists a body of similar tasks.  However, it 
is likely that other methods may be at play in human sequential 
learning and may also be useful in artificial learning, so at the end
of chapter three in my thesis I present a general theroretical
framework for tackling any kind of learning problem for which 
prior information is available in the form of a body of similar
learning tasks.

Jonathan Baxter
Department of Mathematics and Statistics,
The Flinders University of South Australia.
jon at maths.flinders.edu.au