HowtoScale
n karunanithi
karunani at CS.ColoState.EDU
Wed Oct 16 22:23:31 EDT 1991
Dear Connectionist,
Some time back I posted the following problem in this news group and
many people responded with suggestions and references. I thankful to
all of them. I have summarized their responses and posting here to
for other who might find it interesting. For completeness sake
I have included my original posting as well.
******Issue raised:
Background:
-----------
I have been using neural network models
(both Feed-Forward Nets and Recurrent Nets) in a prediction
application and I am getting pretty good results. In fact
neural networks approach outperformed many well known analytic
models. Similar results have been reported by many researchers
in (chaotic) time series predictions.
Suppose that X is the independent variable and Y is the
dependent variable. Let (x(i),y(i)) represent a sequence
of actual input/output values observed at
time i = 0,1,2,..,t of a temporal process. Let further that both
the input and the output variables are single dimensional variable and
can take on a sequence of +ve integers up to a maximum of 2000.
Once we train a network with the
history of the system up to time "t" we can use the network
to predict outputs y(t+h), h=1,..,n for any future input x(t+h).
In my application I already have the complete sequence and
hence I know what is the maximum value for x and y.
Using these maximum I normalized both X and Y over a 0.1 to 0.9 range.
(Here I call such normalization as "scaled representation".)
Since I have the complete sequence it is possible for me to evaluate
how good the networks' predictions are.
Now some basic issues:
---------------------
1) How to represent these variables if we don't know in advance
what the maximum values are?
Scaled representation presupposes the existence of a maximum value.
Some may suggest that a linear units can be used at the output layer
to get rid of scaling. If so how do I represent the input variable?
The standard sigmoidal unit(with temp = 1.0) gets saturated(or railed
to 1.0) when the sum is >= 14. However one may suggest that changing
the output range of the sigmoidal can help to
get rid of saturation effect. Is it a correct approach?
2) In such prediction application, people (including me)
compare the predictive accuracy of neural networks with
that of parametric models(that are based on analytical reasons).
But one main advantage with the parametric models is that
their parameters can be calculated using any of the following
parameter estimation techniques: least square,
maximum likelyhood, Bayesian, Genetic Algorithms or any other
method. These parameter estimation techniques do not require
any scaling, and hence there is no need for preguessing of the maximum values.
However with the scaled representation in neural networks one can
not proceed without making guesses about the maximum(or a future)
input and/or output. In many real life situations such guesses are
infeasible or dangerous. How do we address this situation?
____________________________________________________________________________
N. KARUNANITHI E-Mail: karunani at CS.ColoState.EDU
Computer Science Dept,
Colorado State University,
Collins, CO 80523.
____________________________________________________________________________
******Responses Received:
1) Dr Huang at CMU
Date: Thu, 26 Sep 1991 11:40-EDT
From: Xuedong.Huang at SPEECH2.CS.CMU.EDU
I have several papers addressing the issues you raised. See for example:
[1] Huang, X : A Study on Speaker-Adaptive Speech Recognition" DARPA Speech
and Language Workshop, Feb , 1991, pp278-283
[2] Huang, X, K. Lee and A. Waibel: Connectionist speaker normlization and its
applications to speech recognition", IEEE Workshop on NNSP,
Princeton, Sept. 1991
X.D. Huang, PhD
Research Computer Scientist
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
School of Computer Science Tel: (412) 268 2329
Carnegie Mellon University Fax: (412) 681 5739
Pittsburgh, PA 15213 Email: xdh at cs.cmu.edu
=============================================================================
2) From Alexander at CUNY
Date: Thu, 26 Sep 91 14:45 EDT
From: TWOMBLY%JHUBARD.BITNET at CUNYVM.CUNY.EDU
In response to your question about scaling for sigmoidal units.....
I ran into the same problem of not knowing the maximum value that my
input/output data would take at any particular time. There were no a priori
bounds that could be reasonably set, so the solution (in this case) was to get
rid of the sigmoidal activation function and replace it with one that did not
require any scaling. The function I used was a clipped linear function - that
is, f(x) = 0. for x<0., and f(x) = x for x>0. For my data this activation
function worked as well as the sigmoidal units (in some cases better) because
the hidden units never took advantage of the non-linearity in the upper range
of the sigmoid function.
The only difficulty with this function is that it does not have a continuous
derivative at 0. You can get around this problem by tacking on a 1/x type
function for x<0 that drops off very quickly. This will provide a well
behaved, non-zero derivative for all parts of the activation function while
adding a negligable value to the output for x<0. The actual function I
use is:
f(x) = x; x > 0.
f(x) = 1/(10**2 - x*10**4); x < 0.
I hope this helps.
-Alexander
=============================================================================
3) Dr. Fahlman at CMU
Date: Thu, 26 Sep 91 22:20:14 -0400
From: Scott_Fahlman at SEF-PMAX.SLISP.CS.CMU.EDU
1) How to represent these variables if we don't know in advance
what the maximum values are?
Scaled representation presupposes the existence of a maximum value.
Some may suggest that a linear units can be used at the output layer
to get rid of scaling.
Right, I was about to suggest that.
If so how do I represent the input variable?
The standard sigmoidal unit(with temp = 1.0) gets saturated(or railed
to 1.0) when the sum is >= 14. However one may suggest that changing
the output range of the sigmoidal can help to
get rid of saturation effect. Is it a correct approach?
For a non-recurrent network, the first layer of weights cand and usually
will scale the inputs for you. You save some learning time and possible
traps if the inputs are in some reasonable range, but it really isn't
essential. I'd advise adding a small constant (0.1 works well) to the
derivative of the sigmoid for all units so that you can recover if the unit
gets pinned to an extreme value.
I don't understand your second point, so I won't try to reply to it.
Scott Fahlman
Carnegie Mellon University
=============================================================================
4) Ian Fitchet at Birmingham University
Date: Fri, 27 Sep 91 03:43:40 +0100
From: Ian Fitchet <I.D.Fitchet at computer-science.birmingham.ac.uk>
I'm no expert, but how about having two outputs: one is a control and
has a (mostly) fixed value; the other is the output y(i) which is
adjusted such that the one divided by the other gives the required
result. Off the top of my head, have the control output 0.9 most of
the time, when the value of y(i) goes above unity have y(i) = 0.9 and
the control decrease, so that if the control equalled 0.45, say, then
the real value of the output would be 0.9/0.45 = 2.0 .
Of course the question is then, how do I train the nextwork to set
the value of the control? But I leave that as an exercise... :-)
Cheers,
Ian
--
Ian Fitchet I.D.Fitchet at cs.bham.ac.uk
School of Computer Science
Univ. of Birmingham, UK, B15 2TT
"You run and you run to catch up with the sun, but it's sinking" Pink Floyd
=============================================================================
5) From Dermot O'Brien at the University of Edinburgh
Date: Fri, 27 Sep 91 10:32:31 WET DST
Sender: dob at castle.edinburgh.ac.uk
You may be interested in the following references (if you havn't read them
already):
@techreport{Lapedes:87,
Author = "Alan S. Lapedes and Robert M. Farber",
Title = "Nonlinear signal processing using neural networks: prediction
and system modelling",
Institution = "Los Alamos National Laboratory", Year = 1987,
Number = "LA-UR-87-2662"}
@incollection{Lapedes:88,
Author = "Alan S. Lapedes and Robert M. Farber",
Title = "How Neural Nets Work",
BookTitle = "Evolution, Learning, and Cognition", Pages = {331--346},
Editor = "Y.C Lee", Year = 1988, Publisher = "World Scientific",
Address = "Singapore"}
The above papers analyse the behaviour of feed-forward neural networks
applied to the problem of time series prediction, and make an
interesting analogy with Fourier decomposition.
Cheers,
Dermot O'Brien
Physics Department
University of Edinburgh
The King's Buildings
Mayfield Road
Edinburgh EH9 3JZ
Scotland
=============================================================================
6) From: Tony Robinson <ajr at eng.cam.ac.uk>
Date: Fri, 27 Sep 91 12:23:23 BST
My immediate advice is:
Don't put the input through a nonlinearity at the start of the network.
Use linear output units.
Allow a linear path through the system so that if a linear solution to the
problem is possible then this is a possible network solution.
Then you will have no problems with maximum values.
Tony [Robinson]
=============================================================================
End of summary.
____________________________________________________________________________
N. KARUNANITHI E-Mail: karunani at CS.ColoState.EDU
Computer Science Dept,
Colorado State University,
Collins, CO 80523.
____________________________________________________________________________
More information about the Connectionists
mailing list