interpolation vs. generalization

Sun Sep 8 16:30:24 EDT 1991

On  Sun, 08 Sep 91 13:36:50 ADT  	Zoubin Ghahramani <zoubin at learning.
siemens.com> writes:

>
> Just to throw a monkey wrench in:
>
> How does one interpret generalization as interpolation in a problem
> like n-bit parity? For any given data point, the n nearest neighbours
> in input space would all predict an incorrect classification. However,
> I wouldn't say that a problem like parity is ungeneralizable.
>
>
>                Zoubin Ghahramani
>                Dept. of Brain & Cognitive Sciences
>                MIT

Dear Zoubin,

I am happy to inform you that the above is no "monkey wrench". Here is
the solution to your parity problem within the transformation systems
model.
To simplify the exposition, I will ignore the positions of 1's and delete
all 0's.
Let C+ (the positive learning set) be as follows: ||||, ||, ||||||||;
and let C- (the negative learning set) be as follows: |||||, |, |||.
For this problem the initial transformation system T0 is
          S0 = {s1}
    the initial set of operations consists of a single operation s1 -
    deletion-insertion of |,
          CR = {R1}
    the set of composition rules consists of a single rule R1 -
    concatenation of two strings of |'s.
The competing family of distance functions is defined by means of
the shortest weighted path between two strings of |'s - the smallest
weighted number of current operations necessary to transform one string
into the other. Since we require that the sum of the operation weights
must be 1, and S0 consists of a single operation, the initial distance
between two patterns is just the difference between the number of |'s.
The following function is maximized during learning
                     F1 (w)
            F(w) = -----------,
                   F2 (w) + c
where w is the weight vector w = (w^1, w^2, . . ., w^m) , m is the
number of current operations, sum of w^i's is 1, F1 is the (shortest)
distance between C+ and C-, F2 is the average distance in C+, and c
is a very small positive constant (to prevent the overflow).
For T0 the maximum of F is 1/4. Applying rule R1 to the only
existing operation, we obtain new operation s2 - deletion-insertion
of ||. Thus, in the evolved transformation system T1 set S1 = {s1, s2},
and the optimization is over the 1-d simplex

                   w^2
                      |
                     1| .
                      |    .
                      |       .
                      |__________1______
                                        w^1

For T1 the maximum of F is a very large number 1/c and it is achieved at
w* = (1, 0). Moreover, F1 (w*) = 1  and F2 (w*) = 0. The learning stage
is completed.
The recognition stage and the propositional class description readily
follow from the above.

-- Lev Goldfarb