sequential learning - lifelong learning

Fri Jan 13 15:33:11 EST 1995

The recent discussions on sequential learning brought up some very
interesting points about learning, which I'd like to comment on.

Much of current machine learning and neural network learning research
makes the assumption that the only available data is a set of
input-output examples of the target function (or, in the case of
unsupervised learning, a set of unlabeled points which characterize an
unknown probability distribution). There is a huge variety of
algorithms (Backprop, ID3, MARS, Cascade Correlation to name a few
famous ones) which all generalize from such data in somewhat different
ways.  Despite the exciting progress in understanding these approaches
in more depth and coming up with better algorithms (like the work on
complexity control, avoiding the over-fitting of noise, model
selection, mixtures of experts, committees and related issues), I think
there are intrinsic limitations to the view of the learning problem as
an isolated function fitting problem, where all the available data
consists of a set of examples of the target function.

If we consider human learning, there is usually much more data
available for generalization than just a task-specific set of
input-output data.  As Jon Baxter's face recognition example
convincingly illustrates, we often learn to recognize highly complex
patterns or complex motor strategies from an impressively small number
of training examples.  Humans somehow successfully manage to transfer
big chunks of knowledge across learning tasks.  If we face a new
learning task, much of the "training data" which we use for
generalization actually stems from other tasks, which we might have
faced in our previous lifetime. Consider for example Jon's task of
recognizing faces. Once one has learned that the shape of the nose does
matter, but facial expressions do not matter for the identification of
a person, one can transfer this knowledge to new faces and generalize
much more accurately from less training examples.

To apply these ideas in the context of artificial neural network
learning, one might think of learning as a lifelong assignment, in
which a learner faces a whole collection of learning tasks over its
entire "lifetime." Hence, what has been observed and/or learned in the
first n tasks can be reused in the (n+1)st task. There is a lot of
potential leverage in such a scenario. For example, in a recent study,
Tom Mitchell and I investigated the problem of learning to recognize
simple objects from a very small number camera images using
Backpropagation. We found that after seeing as few as one example of
each target object, the recognition rate increased from 50% (random) to
59.7%. However, by learning invariances up front based on images of
*other* objects, and by transferring these learned invariances to the
target recognition task, we achieved a recognition rate of 74.8%. After
seeing another training examples of each target object, the standard
neural network approach led to 64.8% accuracy, which could be improved
to 82.9% if knowledge about the invariances was transferred. These
results match our experience in other domains (robot control,
reinforcement learning robot perception).

As the discussion on this mailing list illustrates, there is a bunch of
people working on knowledge transfer and related issues; I have seen
quite a few exciting approaches. For example, Lori Pratt, Steve
Suddarth, Jon Baxter, Rich Caruana and many others have proposed
approaches which develop more robust internal representations in
Backprop networks based on learning multiple tasks (sequentially or in
parallel). Others, like Satinder Singh, Steve Whitehead, Anton Schwartz
have studied the issue of transfer in the context of reinforcement
learning. Basically, they have proposed ways to transfer actions
policies (which is the result of reinforcement learning) across tasks.
There is a whole variety of other approaches (like Chris Atkenson's
variable distance metrics in memory-based learning), which could
potentially be applied in a lifelong learning context. However, I feel
that the area of knowledge transfer is still largely unexplored.  To
scale up learning algorithms, I believe it is really helpful not to
restrict oneself to looking at a single training set in isolation but
to consider all possible sources of knowledge about the target
function.

Sebastian