Robustness

Sat Aug 3 22:14:31 EDT 1991

In <9108020132.AA12736 at nsis86.cl.nec.co.jp>, Guido Bugmann writes:

> Robustness is a vague but often used concept. Is there an
> accepted method to determine the robustness of a NN ?

> In an application of a FF Backprop net to the reproduction
> of a function f(x,y) [1], we had measured the robustness in
> the following way: 
> After completed training, each weight was set to zero in turn
> and the root mean square of the relative errors (RMS) (relative 
> differences between the actual outputs of the net and the outputs
> defined in the training set) was measured (The mean is over all
> the examples in the training set).
> In our case, the largest RMS induced by the loss of one connection
> was 1600 %. We have used this "worst possible damage" as a measure 
> of the (non-) robustness of the network.
> 
> [1] Bugmann, G., Lister, J.B. and von Stockar, U. (1989)
>  "The standard deviation method: Data Analysis by Classical Means
>   and by Neural Networks", Lab Report LRP-384/89, CRPP, Swiss Federal
>   Institute of Technology, CH-1015 Lausanne.

Indeed, robustness of neural networks could do with considerably
greater investigation. I am just finishing up a dissertation on
the robustness of feed-forward networks with real-valued inputs
and outputs. I have looked at a very simple case --- probably the
simplest possible. I define a perturbation process over all the
non-output neurons of the network, with the major restriction that
only one neuron's output is perturbed during the presentation of
any one input vector. The neuron perturbed is selected with a
distribution q(i), and the magnitude of the perturbation is an
independent random variable with 0-mean distribution p(d). For
simplicity of analysis, I take both q and p to be uniform, but
that can be relaxed. The robustness of the network over a data
set T is defined as the average deviation in the output of the
network operating under the perturbation process, relative to some
appropriate parameter of distribution p (e.g. the spread of the
uniform deviation, or the variance etc.). The deviation
can be measured in many ways: for simplicity, I use the sum of
absolute deviations over all network outputs.

The main thing is to predict the average deviation without making
a hundred passes over the data set, and without actually perturbing
the network. This is easily done using a power series approximation
of the relationship between each neuron output and each network output.
The required derivatives can be calculated using dynamic feedback a la
Werbos (back-propagation, if you like). As long as the weight vectors
for individual neurons in the network are not huge (i.e. if no neurons have
activation functions close to being discountinuous), the approximation I
make is quite reasonable. Of course, since all activation and composition
functions in the network are continuous, and continuously differentiable
everywhere, there is always a perturbation process with bounded distribution
that satisfies the convergence criteria of the power series.

Using the uniform distributions for p and q, and retaining only the
linear term in the power series expansions, the analysis, applied to
any network, yields a characteristic measure that directly scales the
expected output deviation, i.e. given that p is U[0,b], the expected
output deviation is b/2r, where r is the characteristic measure of
robustness for the network. Once r is determined, the network's response
to perturbation distributions with various spreads can be predicted
(within limits). Indeed, with hardly any extra effort (and no extra
computational expense), even the variance of the output deviation can
be predicted in a similar way. The computational complexity of determining
r is O(|W|*|T|), where W is the set of weights and T is the data set. Of
course, the predictive accuracy of r over data sets other than T depends
on how representative T is --- the usual generalization issue. As T grows,
however, r's predictive accuracy should converge over all data sets chosen
under the same sampling distribution. The empirical results I have are
very good.

One interesting aspect of this analysis is that it also provides a measure
of the sensitivity of the network output to perturbations in each neuron
output, which is a natural way of measuring the relevance of individual
neurons. This can be used either for pruning, or (I think, more consistently
with connectionist philosophy), to encourage the emergence of distributed
representations. I am working on incorporating such distribution imperatives
into back-propagation etc., and should have some results in a few months.

The work described above is now being written up into papers, and should
be submitted over the next month or two. I would be delighted to discuss
this and related issues with anyone who is interested. There is much work
to be done in extending this formulation to the case where multiple
perturbations are permitted simultaneously. Since things are not additive
or subadditive, I'm not sure how important higher order effects are ---
probably quite important. Still, I have a few ideas, which I'll
be working on in the next few months.

Ali Minai
Electrical Engineering
University of Virginia