Subtractive methods / Cross validation (includes summary)

Wed Nov 27 15:33:37 EST 1991

Hi,

FYI, I've summarized the recent discussion on subtractive methods below.

A couple of comments:
  o [Ramachandran and Pratt, 1992] presents a new subtractive method, called
    Information Measure Based Skeletonisation (IMBS).  IMBS induces a decision
    tree hidden unit hyperplanes in a learned network in order to detect which
    are superfluous.  Single train/test holdout experiments on three real-world
    problems (Deterding vowel recognition, Peterson-Barney vowel recognition,
    heart disease diagnosis) indicate that this method doesn't degrade 
    generalization scores while it substantially reduces hidden unit counts.  
    It's also very intuitive.

  o There seems to be some confusion between the very different goals of:

      (1) Evaluating the generalization ability of a network, 
	and
      (2) Creating a network with the best possible generalization performance.

    Cross-validation is used for (1).  However, as P. Refenes points out, once
    the generalization score has been estimated, you should use *all* training
    data to build the best network possible.

--Lori

  @incollection{ ramachandran-92,
  MYKEY           = " ramachandran-92 : .con .bap",
  EDITOR          = "D. S. Touretzky",
  BOOKTITLE       = "{Advances in Neural Information Processing Systems 4}",
  AUTHOR          = "Sowmya Ramachandran and Lorien Pratt",
  TITLE           = "Discriminability Based Skeletonisation",
  ADDRESS         = "San Mateo, CA",
  PUBLISHER       = "Morgan Kaufmann",
  YEAR            = 1992,
  NOTE            = "(To appear)"
  }

Summary of discussion so far:

hht: Hans Henrik Thodberg <thodberg at nn.meatre.dk>
 sf: Scott_Fahlman at sef-pmax.slisp.cs.cmu.edu
jkk: John K. Kruschke <KRUSCHKE at ucs.indiana.edu>
 rs: R Srikanth <srikanth at cs.tulane.edu>
 pr: P.Refenes at cs.ucl.ac.uk
 gh: Geoffrey Hinton <hinton at ai.toronto.edu>
 kl: Ken Laws <LAWS at ai.sri.com>
 js: Jude Shavlik <shavlik at cs.wisc.edu>

hht~~: Request for discussion.  Goal is good generalisation: achievable
hht~~: if nets are of minimal size.    Advocates subtractive methods
hht~~: over additive ones.  Gives Thodberg, Lecun, Weigend
hht~~: references.

 sf~~: restricting complexity ==> better generalization only when
 sf~~: ``signal components are larger and more coherent than the noise''
 sf~~: Describes what cascade correlation does.
 sf~~: Questions why a subtractive method should be superior to this.
 sf~~: Gives reasons to believe that subtractive methods might be slower
 sf~~: (because you have to train, chop, train, instead of just train)

jkk~~: Distinguishes between removing a node and just removing its
jkk~~: participation (by zeroing weights, for example).  When nodes
jkk~~: are indeed removed, subtractive schemes can be more expensive,
jkk~~: since we are training nodes which will later be removed.
jkk~~: Cites his work (w/Mavellan) on schemes which are both additive
jkk~~: and subtractive.

 rs~~:  Says that overgeneralization is bad: distinguishes best fit from
 rs~~:  most general fit as potentially competing criteria.

 pr~~: Points out that pruning techniques are able to remove redundant
 pr~~: parts of the network.  Also points out that using a cross-validation
 pr~~: set without a third set is ``training on the testing data''.

 gh~~: Points out that, though you might be doing some training on the testing
 gh~~: set, since you only get a single number as feedback from it, you aren't
 gh~~: really fully training on this set.
 gh~~: Also points out that techniques such as his work on soft-weight sharing
 gh~~: seem to work noticeably better than using a validation set to decide
 gh~~: when to stop training.

hht~~: Agrees that comparitive studies between subtractive and additive
hht~~: methods would be a good thing. Describes a brute-force subtractive
hht~~: Argues, by analogy to automobile construction and idea generation, why 
hht~~: subtractive methods are more appealing than additive ones.

 ~~pr: Argues that you'd get better generalization if you used more
 ~~pr: examples for training; in particular not just a subset of all
 ~~pr: training examples present.

 ~~kl: Points out the similarity between the additive/subtractive debate
 ~~kl: and stepwise-inclusion vs stepwise-deletion issues in multiple
 ~~kl: regression.  

 ~~js: Points out that when reporting the number of examples used for
 ~~js: training, it's important to include the cross-validation examples
 ~~js: as well.