Post-processing of neural net output (SUMMARY)

Fri Feb 3 14:18:51 EST 1989

About a month ago, I asked for information about post-processing of
output activation of a (trained or semi-trained) network solving a
classification task.

My specific interest was what additional information can be extracted
from the output vector, and what techniques are being used to improve
performance and/or adjust the classification criteria (i.e., how the
output is interpreted).

I've been thinking about how Signal Detection Theory (SDT; cf. Green and
Swets, 1966) could be applied to NN classification systems.  Three areas
I am concerned about are:

 1) Typically interpretation of a net's classifications ignores the
    cost/payoff matrix associated with the classification decision.  SDT
    provides a way to take this into account.

 2) A "point-5 threshold interpretation" of output vectors is in some
    sense arbitrary given (1) and because it may have developed a
    "bias" (predisposition) towards producing a particular response (or
    responses) as an artifact of its training.

 3) The standard interpretation does not take into account the a priori
    probability (likelihood) of an input of a particular type being
    observed.

SDT may also provide an interesting way to compare two networks.
Specifically, the d' ("D-prime") measure and the ROC (receiver operating
characteristic) curves which have been successfully used to analyze
human decision making, may be quite useful in understanding NN behavior.

---

The enclosed summary covers only responses that addressed these specific
issues.  (The 19 messages I received totaled 27.5K.  This summary is
just under 8K.  I endeavored to preserve all the non-redundant
information and citations.)

Thanks to all who replied.

-- 
void Wayne_Mesard();   Mesard at BBN.COM   Bolt Beranek and Newman, Cambridge, MA
--

Summary of citation respondents:
------- -- -------- ------------
The following two papers discuss interpretation of multi-layer
perceptron outputs using probabilistic or entropy-like formulations

@TECHREPORT{Bourlard88,
	AUTHOR = "H. Bourlard and C. J. Wellekens",
	YEAR = "1988",
	TITLE = "Links Between {M}arkov Models and Multilayer Perceptrons",
	INSTITUTION = "Philips Research Laboratory",
	MONTH = "October",
	NUMBER = "Manuscript M 263",
	ADDRESS = "Brussels, Belgium"
	}

@INPROCEEDINGS{Golden88,
   AUTHOR = "R. M. Golden",
   TITLE = "Probabilistic Characterization of Neural Model Computations",
   EDITOR = "D. Anderson",
   BOOKTITLE = "Neural Information Processing Systems",
   PUBLISHER = "American Institute of Physics",
   YEAR = "1988",
   ADDRESS = "New York",
   PAGES = "310-316"
   }

Geoffrey Hinton (and others) cites Hinton, G. E. (1987) "Connectionist
Learning Procedures", CMU-CS-87-115 (version 2) as a review of some
post-processing techniques.  He said that this tech report will
eventually appear in the AI journal.

He also says:

    The central idea is that any gradient descent learning procedure works
    just fine if the "neural net" has a non-adaptive post processing stage
    which is invertible -- i.e. it must be possible to back-propagate the
    difference between the desired and actual outputs through the post
    processing.  [...] The most sophisticated post-processing I know of is
    Herve Bourlard's use of dynamic time warping to map the output of a net
    onto a desired string of elements.  The error is back-propagated through
    the best time warp to get error derivatives for the detection of the
    individual elements in the sequence.

The paper by Kaplan and Johnson in the 1988 ICNN Proceedings addressed the
problem.

A couple of people Michael Jordan has done interesting work in the area
of post-processing, but no citations were provided.  (His work from 2-3
years ago does discuss interpretation of output when trained with "don't
care"s in the target vector.  I don't know if this is what they were
referring to.)

"Best Guess"
 ---- -----
This involves looking at the set of valid output vectors, V(), and the
observed output, O, and interpreting O as V(i) where i minimizes 
|V(i) - O| .

For one-unit-on-the-rest-off output vectors, this is the same thing as
taking the one with the largest activation, but when classifying along
multiple dimensions simultaneously, this technique may be quite useful.

----

J.E. Roberts sent me a paper by A.P. Doohovskoy called "Metatemplates,"
presented at ICASSP Dallas, 1987 (no, I don't know what that is).  He
(Roberts) suggests using "a trained or semi-trained neural net to
produce one 'typical' output for each type of input class.  These
vectors would be saved as 'metatemplates'."  Then classification can be
done by comparing (via Euclidian distance or dot product) observed
output vectors with the metatemplates (where the closest metatemplate
wins).  This is uses the information from the entire network output
vector for classification.

Probability Measures
----------- --------

Terry Sejnowski writes:

    The value of an output unit is highly correlated with the
    confidence of a binary categorization.  In our study of 
    predicting protein secondary structure (Qian and Sejnowski,
    J. Molec. Biol., 202, 865-884) we have trained a network
    to perform a three-way classification.  Recently we have
    found that the real value of the output unit is highly
    correlated with the probability of correct classification
    of new, testing sequences.  Thus, 25% of the sequences
    could be predicted correctly with 80% or greater probability
    even though the average performance on the training set was
    only 64%.  The highest value among the output units is also
    highly correlated with the difference between the largest and
    second largest values.  We are preparing a paper for publication
    on these results.
---

Mark Gluck writes:
    In our recent JEP:General paper (Gluck & Bower, 1988) we showed how
    the activations could be converted to choice probabilities using
    an exponential ratio function. This leads to good quantitative fits
    to human choice performance both at asymptote and during learning.
---

Tony Robinson states that the summed squared difference between the
actual output vector and the relevant target vector provides a measure
of the probability of belonging to each class [in a
one-bit-on-others-off output set].  [See "Best Guess" above.]

Confidence Measures
---------- --------

John Denker says:
   Yes, we've been using the activation level of the runner-up neurons
   to provide confidence information in our character recognizer for some time.
   The work was reported at the last San Diego mtg and at the last Denver mtg.
---

Mike Rossen describes the speech recognition system that he and Jim
Anderson are working on.  The target vectors are real-valued.
With each phoneme represented by several units with activation on [-1, 1]:

    Our retrieval method is a discretized dynamical system
    in which system output is fed back into the system using appropriate
    feedback and decay parameters.  Our scoring method is based on an
    average activation threshold, but the number of iterations the
->  system takes to reach this threshold -- the system reaction time --
->  serves as a confidence measure.

[He also reports on intra-layer connections on the outputs (otherwise,
he's using a vanilla feedforward net) which sounds like a groovy idea,
although it seems to me that this would have pros and cons in his
application.]
    After the feedforward network is trained, connections AMONG THE OUTPUT
    UNITS are trained. this "post-processing" reduces both omission and
    confusion errors by the system.

Some preliminary results of the speech model are reported in:

  Rossen, M.L., Niles, L.T., Tajchman, G.N., Bush, M.A., & Anderson, J.A.
   (1988).  Training methods for a connectionist model of CV syllable
   recognition.  Proceedings of the Second Annual International
   Conference on Neural Networks, 239-246.

  Rossen, M.L., Niles, L.T., Tajchman, G.N., Bush, M.A., Anderson, J.A., &
   Blumstein, S.E. (1988).  A connectionist model for consonant-vowel syllable
   recognition.  ICASSP-88, 59-66.

Improving Discriminability
--------- ----------------

Ralph Linsker says:
    You may be interested in an issue related, but not identical,
    to the one you raised; namely, how can one tailor the
    network's response so that the output optimally discriminates among
    the set of input vectors, i.e. so that the output provides maximum
    information about what the input vector was?  This is addressed in:
    R. Linsker, Computer 21(3)105-117 (March 1988); and in my papers in the
    1987 and 1988 Denver NIPS conferences.  The quantity being maximized
    is the Shannon information rate (from input to output), or equivalently
    the average mutual information between input and output.
---

Dave Burr refers to 
  D. J. Burr, "Experiments with a Connectionist Text Reader," Proc. ICNN-87, 
  pp. IV717-IV724, San Diego, CA, June 1987.

Which describes a post-processing routine which assigns a score to every
word in an English dictionary by summing log compressed activations.

-=-=-=-=-=-=-=-=-=-