weighting of estimates

Thu Aug 5 11:33:00 EDT 1993

Jost Bernasch writes:
>
>James Franklin writes:
> > If you have a fairly accurate and a fairly inaccurate way of estimating
> >something, it is obviously not good to take their simple average (that
> >is, half of one plus half of the other). The correct weighting of the
> >estimates is in inverse proportion to their variances (that is, keep
> >closer to the more accurate one).
>
>Of course this is the correct weighting. Since the 60s this is done
>very succesfully with the well-known "Kalman Filter". In this theory
>the optimal combination of knowledge sources is described and
>proofed in detail.
>
> (At least, that is the correct
> >weighting if the estimates are independent: if they are correlated,
> >it is more complicated, but not much more). Proofs are easy, and included
> >in the ref below:
>
>For proofs and extensions to non-linear filtering and correlated
>weights see the control theory literature. A lot of work is already
>done!

I think the comments about the Kalman filter are a bit off the mark.  The
Kalman filter is based on the mathematics of conditional expectation.
However, the Kalman filter is designed to be used for time series.  What makes
the Kalman filter particularly useful is its recursive nature; a stream of
observations may be processed (often in real time) to produce a stream of
current estimates (or next estimates if you're trying to beat the stock
market).  

Committees of networks may also use conditional expectation, but combining
networks is not the same as processing time series of data.  I think it is
appropriate at this point to bring up 2 classical results concerning
probability theory, conditional expectation, and wide sense conditional
expectation.  (Wide sense conditional expectation uses the same formulas as
conditional expectation.  "Wide sense' merely serves to emphasize that the
distribution is not assumed to be normal.  'Conditional expectation' is used
in the case where the underlying distribution is assumed to be normal.)

(1) When the objective function is to minimize the mean squared error over the
training data, the wide sense conditional expectation is the best linear
predictor, regardless of the original distribution.

(2) If the original distribution is normal, and the objective function is to minimize
the MSE over the >entire< distribution, (both on-training and off-training), then 
the conditional expectation is the best predictor, linear or otherwise.

There are 3 important factors here.
[1]: Underlying distribution (of network outputs): 	normal?  not normal?
[2]: Objective function (assume MSE):  			on-training?  off-training?
[3]: Predictor: 					linear?  non-linear?

{1}
[1:normal] => [2:off-training],[3:linear]
Neural nets (as opposed to systolic arrays) are needed because the world
is full of non-normal distributions.  But that doesn't mean that the ouputs of
non-linear networks don't have joint normal distributions (over off-training
data).  Perhaps the non-linearities have been successfully ironed out by the
non-linear networks, leaving only linear (or nearly linear) errors to be
corrected.  In that case we can refer to result (2) to build the optimal
off-training predictor for the given committee of networks.

{2}
[1:not normal] and [2:on-training] and [3:linear] => best predictor is WSE. 
If the distribution of network outputs is not normal, and we use an
on-training criterion, then by virtue of (1), the best linear predictor is the
wide sense conditional expectation.  

{3}
[1:not normal] and [2:off-training] and [3:non-linear] => research
It is the case in {2} that since [1:not normal],
<1> better on-training results may be obtained using some non-linear predictor
<2> better on-or-off-training results may be obtained using some different criterion
<3> <1> and <2> together.
The problem is of course to find such criterion and non-linear predictors.
The existence of a priori knowledge can play an important role here; for
example adding a term to penelize the complexity of output functions.

In conclusion, if {1} is true, that is the networks have captured the
non-linearities and the network outputs are joint normal (or nearly normal)
distributions, we're home free.  Otherwise we ought to think about {3},
non-linear predictors and alternative criterion.  {2}, using the WSE, the best
performing linear predictor over the MSE of the on-training data, is useful to
get the job done, but is only optimal in a limited sense.

Craig Hicks           hicks at cs.titech.ac.jp
Ogawa Laboratory, Dept. of Computer Science
Tokyo Institute of Technology, Tokyo, Japan
lab: 03-3726-1111 ext. 2190  		home: 03-3785-1974
fax: +81(3)3729-0685 (from abroad), 03-3729-0685  (from Japan)