HowtoScale

Wed Oct 16 22:23:31 EDT 1991

  Dear Connectionist,
    Some time back I posted the following problem in this news group and
    many people responded with suggestions and references. I thankful to
    all of them. I have summarized their responses and posting here to
    for other who might find it interesting. For completeness sake
    I have included my original posting as well.

******Issue raised:
Background:
-----------
   I have been using neural network models 
(both Feed-Forward Nets and Recurrent Nets) in a prediction
application and I am getting pretty good results. In fact
neural networks approach outperformed many well known analytic
models. Similar results have been reported by many researchers
in (chaotic) time series predictions. 

 Suppose that X is the independent variable and Y is the
dependent variable. Let (x(i),y(i)) represent a sequence 
of actual input/output values observed at 
time i = 0,1,2,..,t of a temporal process. Let further that both 
the input and the output variables are single dimensional variable and
can take on a sequence of +ve integers up to a maximum of 2000.
Once we train a network with the
history of the system up to time "t" we can use the network
to predict outputs y(t+h), h=1,..,n  for any future input x(t+h).
In my application I already have the complete sequence and
hence I know what is the maximum value for x and y.
Using these maximum I normalized both X and Y over a 0.1 to 0.9 range.
(Here I call such normalization as "scaled representation".)
Since I have the complete sequence it is possible for me to evaluate 
how good the networks' predictions are.

Now some basic issues: 
---------------------
1) How to represent these variables if we don't know in advance
what the maximum values are? 
 Scaled representation presupposes the existence of a maximum value.
 Some may suggest that a linear units can be used at the output layer
 to get rid of scaling. If so how do I represent the input variable?
 The standard sigmoidal unit(with temp = 1.0) gets saturated(or railed
 to 1.0) when the sum is >= 14. However one may suggest that changing 
 the output range of the sigmoidal can help to 
 get rid of saturation effect. Is it a correct approach?

2) In such prediction application, people (including me)
compare the predictive accuracy of neural networks with
that of parametric models(that are based on analytical reasons). 
But one main advantage with the parametric models is that
their parameters can be calculated using any of the following
parameter estimation techniques: least square,
maximum likelyhood, Bayesian, Genetic Algorithms or any other
method. These parameter estimation techniques do not require
any scaling, and hence there is no need for preguessing of the maximum values.
However with the scaled representation in neural networks one can
not proceed without making guesses about the maximum(or a future)
input and/or output. In many real life situations such guesses are
infeasible or dangerous. How do we address this situation?

____________________________________________________________________________
N.  KARUNANITHI              E-Mail: karunani at CS.ColoState.EDU
Computer Science Dept,       
Colorado State University,
Collins, CO 80523.           
____________________________________________________________________________

******Responses Received:

1)  Dr Huang at CMU
Date: Thu, 26 Sep 1991 11:40-EDT
From: Xuedong.Huang at SPEECH2.CS.CMU.EDU

I have several papers addressing the issues you raised. See for example:

[1] Huang, X : A Study on Speaker-Adaptive Speech Recognition" DARPA Speech
and Language Workshop, Feb , 1991, pp278-283
[2] Huang, X, K. Lee and A. Waibel: Connectionist speaker normlization and its
 applications to speech recognition", IEEE Workshop on NNSP,
 Princeton, Sept. 1991

X.D. Huang, PhD
Research Computer Scientist		
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 School of Computer Science		Tel:  (412) 268 2329  
 Carnegie Mellon University		Fax:  (412) 681 5739  
 Pittsburgh, PA 15213			Email: xdh at cs.cmu.edu 

=============================================================================

2) From Alexander at CUNY
Date: Thu, 26 Sep 91 14:45 EDT
From: TWOMBLY%JHUBARD.BITNET at CUNYVM.CUNY.EDU

In response to your question about scaling for sigmoidal units.....

  I ran into the same problem of not knowing the maximum value that my
input/output data would take at any particular time.  There were no  a priori
bounds that could be reasonably set, so the solution (in this case) was to get
rid of the sigmoidal activation function and replace it with one that did not
require any scaling.  The function I used was a clipped linear function - that
is, f(x) = 0. for x<0., and f(x) = x for x>0.  For my data this activation
function worked as well as the sigmoidal units (in some cases better) because
the hidden units never took advantage of the non-linearity in the upper range
of the sigmoid function.
  The only difficulty with this function is that it does not have a continuous
derivative at 0.  You can get around this problem by tacking on a 1/x type
function for x<0 that drops off very quickly.  This will provide a well
behaved, non-zero derivative for all parts of the activation function while
adding a negligable value to the output for x<0.  The actual function I
use is:

f(x) = x;                       x > 0.
f(x) = 1/(10**2 - x*10**4);     x < 0.

I hope this helps.

-Alexander
=============================================================================

3) Dr. Fahlman at CMU

Date: Thu, 26 Sep 91 22:20:14 -0400
From: Scott_Fahlman at SEF-PMAX.SLISP.CS.CMU.EDU

    1) How to represent these variables if we don't know in advance
    what the maximum values are? 
     Scaled representation presupposes the existence of a maximum value.
     Some may suggest that a linear units can be used at the output layer
     to get rid of scaling.

Right, I was about to suggest that.

     If so how do I represent the input variable?
     The standard sigmoidal unit(with temp = 1.0) gets saturated(or railed
     to 1.0) when the sum is >= 14. However one may suggest that changing 
     the output range of the sigmoidal can help to 
     get rid of saturation effect. Is it a correct approach?

For a non-recurrent network, the first layer of weights cand and usually
will scale the inputs for you.  You save some learning time and possible
traps if the inputs are in some reasonable range, but it really isn't
essential.  I'd advise adding a small constant (0.1 works well) to the
derivative of the sigmoid for all units so that you can recover if the unit
gets pinned to an extreme value.

I don't understand your second point, so I won't try to reply to it.

Scott Fahlman
Carnegie Mellon University

=============================================================================
4) Ian Fitchet at Birmingham University

Date: Fri, 27 Sep 91 03:43:40 +0100
From: Ian Fitchet <I.D.Fitchet at computer-science.birmingham.ac.uk>

 I'm no expert, but how about having two outputs: one is a control and
has a (mostly) fixed value; the other is the output y(i) which is
adjusted such that the one divided by the other gives the required
result.  Off the top of my head, have the control output 0.9 most of
the time, when the value of y(i) goes above unity have y(i) = 0.9 and
the control decrease, so that if the control equalled 0.45, say, then
the real value of the output would be 0.9/0.45 = 2.0 .

 Of course the question is then, how do I train the nextwork to set
the value of the control?  But I leave that as an exercise... :-)

Cheers,

	Ian

--
Ian Fitchet     				     I.D.Fitchet at cs.bham.ac.uk
School of Computer Science
Univ. of Birmingham, UK, B15 2TT
 "You run and you run to catch up with the sun, but it's sinking"  Pink Floyd

=============================================================================

5) From Dermot O'Brien at the University of Edinburgh 

Date: Fri, 27 Sep 91 10:32:31 WET DST
Sender: dob at castle.edinburgh.ac.uk

You may be interested in the following references (if you havn't read them
already):

@techreport{Lapedes:87,
   Author = "Alan S. Lapedes and Robert M. Farber",
   Title  = "Nonlinear signal processing using neural networks: prediction
   and system modelling",
   Institution = "Los Alamos National Laboratory", Year = 1987,
   Number = "LA-UR-87-2662"}

@incollection{Lapedes:88,
   Author = "Alan S. Lapedes and Robert M. Farber",
   Title  = "How Neural Nets Work",
   BookTitle = "Evolution, Learning, and Cognition", Pages = {331--346},
   Editor = "Y.C Lee", Year = 1988, Publisher = "World Scientific",
   Address = "Singapore"}

The above papers analyse the behaviour of feed-forward neural networks
applied to the problem of time series prediction, and make an
interesting analogy with Fourier decomposition.

Cheers,

Dermot O'Brien
Physics Department
University of Edinburgh
The King's Buildings
Mayfield Road
Edinburgh EH9 3JZ
Scotland
=============================================================================

6) From: Tony Robinson <ajr at eng.cam.ac.uk>
Date: Fri, 27 Sep 91 12:23:23 BST

My immediate advice is:

  Don't put the input through a nonlinearity at the start of the network.
  Use linear output units.
  Allow a linear path through the system so that if a linear solution to the
    problem is possible then this is a possible network solution.

Then you will have no problems with maximum values.

Tony [Robinson]
=============================================================================

End of summary.
____________________________________________________________________________
N.  KARUNANITHI              E-Mail: karunani at CS.ColoState.EDU
Computer Science Dept,       
Colorado State University,
Collins, CO 80523.           
____________________________________________________________________________