From dhw at santafe.edu  Fri Dec  1 11:18:19 1995
From: dhw at santafe.edu (David Wolpert)
Date: Fri, 1 Dec 95 09:18:19 MST
Subject: Correcting misunderstandings about NFL
Message-ID: <9512011618.AA27395@sfi.santafe.edu>


This posting is to correct some misunderstandings that were recently
posted concerning the NFL theorems. I also draw attention to some of
the incorrect interpretations commonly ascribed to certain COLT
results.

***


Joerg Lemm writes:

>>>
1.) If there is no relation between the function values
    on the test and training set
    (i.e. P(f(x_j)=y|Data) equal to the unconditional P(f(x_j)=y) ),
    then, having only training examples y_i = f(x_i) (=data) 
    from a given function, it is clear that I cannot learn anything 
    about values of the function at different arguments, 
    (i.e. for f(x_j), with x_j not equal to any x_i = nonoverlapping
test set).
>>>

Well put. Now here's the tough question: Vapnik *proves* that it is
unlikely (for large enough training sets and small enough VC dimension
generalizers) for error on the training set and full "generalization
error" to be grealy different. Regardless of the target. Using this,
Baum and Haussler even wrote a paper "What size net gives valid
generalization?" in which no assumptions whatsoever are made about the
target, and yet the authors are able to provide a response the
question of their title. HOW IS THAT POSSIBLE GIVEN WHAT YOU JUST
WROTE????

NFL is "obvious". And so are VC bounds on generalization error (well,
maybe not "obvious"). And so is the PAC "proof" of Occam's razor. And
yet the latter two bound generalization error (for those cases where
training set error is small enough) without making any assumptions
about the target. What gives?

The answer: The math of those works is correct. But far more care must
be exercised in the interpretation of that math than you will find in
those works. The care involves paying attention to what goes on the
right-hand side of the conditioning bars in one's probabilities, and
the implications of what goes there.  Unfortunately, such conditioning
bar are completely absent in those works...

(In fact, the sum-total of the difference between Bayesian and COLT
approaches to supervised batch learning lies in what's on the
right-hand side of those bars, but that's another story. See [2].)

As an example, it is widely realized that VC bounds suffer from being
worst-case.  However there is another hugely important caveat to those
bounds. The community as a whole simply is not aware of that caveat,
because the caveat concerns what goes on the right-hand side of the
conditioning bar, and this is NEVER made explicit.

This caveat is the fact that VC bounds do NOT concern

Pr(IID generalization error |
	observed error on the training set, training set size,
					VC dimension of the generalizer).

But you wouldn't know that to read the claims made on behalf of those
bounds ...

To give one simple example of the ramifications of this: Let's say you
have a favorite low-VC generalizer. And in the course of your career
you parse though learning problems, either explicitly or (far more
commonly) without even thinking about it. When you come across one
with a large training set on which your generalizer has small
generalization error, you want to invoke Vapnik to say you have
assuraces about full generalization error.

Well, sorry. You don't and you can't. You simply can't escape Bayes by
using confidence intervals. Confidence intervals in general (not just
in VC work) have the annoying property that as soon as you try to use
them, very often you contradict the underlying statistical assumptions
behind them. Details are in [1] and in the discussion of "We-Learn-It
Inc." in [2].


>>>
2.) We are considering two of those (influence) relations P(f(x_j)=y|Data):
    one, named A, for the true nature (=target) and one, named B, for our 
    model under study (=generalizer).
    Let P(A and B) be the joint probability distribution for the
    influence relations for target and generalizer.

3.) Of course, we do not know P(A and B), but in good old Bayesian tradition,
    we can construct a (hyper-)prior P(C) over the family of probability 
    distributions of the joint distributions C = P(A and B).
 
4.) NFL now uses the very special prior assumption
    P(A and B) = P(A)P(B)
>>>

If I understand you correctly, I would have to disagree. NFL also
holds with your P(C) being any prior assumption - more formally,
averaging over all priors, you get NFL. So the set of priors for which
your favorite algorithm does *worse than random* is just as large as
the set for which it does better. (In this sense, the uniform prior is
a typical prior, not a pathological one, out on the edge of the
space. It is certainly not a "very special prior".)

In fact, that's one of the major points of NFL - it's not to see what
life would be like if this or that were uniform, but to use such
uniformity as a mathematical tool, to get a handle on the underlying
geometry of inference, the size of the various spaces (e.g., the size
of the space of priors for which you lose to random), etc.

The math *starts* with NFL, and then goes on to many other things (see
[1]). It's only the beginning chapter of the text book.


>>>
I say that it is rational to believe 
(and David does so too, I think) that in real life cross-validation 
works better in more cases than  anti-cross-validation.
>>>

Oh, most definitely.

There are several issues here: 1) what gives with all the "prior-free"
general proofs of COLT, given NFL, 2) purely theoretical issues (e.g.,
as mentioned before, characterizing the relationship between target
and generalizers needed for xval. to beat anti-xval.) and 3) perhaps
most provocatively of all, seeing if NFL (and the associated
mathematical structure) can help you generalize in the real world
(e.g., with head-to-head minimax distinctions between generalizers).


***


Finally, Eric Baum weighs in:


>>>
Barak Pearlmutter remarked that saying
	We have *no* a priori reason to believe that targets with "low
	Kolmogorov complexity" (or anything else) are/not likely to
	occur in the real world.
(which I gather was a quote from David Wolpert?)
is akin to saying we have no a priori reason to believe there is non-random
structure in the world, which is not true, since we make great
predictions about the world.
>>>

Well, let's get a bit formal here. Take all the problems we've ever
tried to make "great predictions" on. Let's even say that these
problems were randomly chosen from those in the real world (i.e., no
selection effects of people simply not reporting when their
predictions were not so great). And let's for simplicity say that all
the predictions were generated by the same generalizer - the algorithm
in the brain of Eric Baum will do as a straw man.

Okay. Now take all those problems together and view them as one huge
training set. Better still, add in all the problems that Eric's
anscestors addressed, so that the success of his DNA is also taken
into account. That's still one training set. It's a huge one, but it's
tiny in comparison to the full spaces it lives in.

Saying we (Eric) makes "great predictions" simply means that the
xvalidation error of our generalizer (Eric) on that training set is
small. (You train on part of the data, and predict on the rest.)
Formally (!!!!!), this gives no assuraces whatsoever about any
behavior off-training-set. As I've stated before, without assumptions,
you cannot conclude that low xvalidation error leads to low
off-training set generalization error. And of course, each passing
second, each new scene you view, is "off-training-set".

The fallacy in Eric's claim was noted all the way back by
Hume. Success at inductive inference cannot formally establish the
utility of using inductive inference. To claim that it can you have to
invoke inductive inference, and that, as any second grader can tell
you, is circular reasoning.


Practically speaking of course, none of this is a concern in the real
world. We are all (me included) quite willing to conclude there is
structure in the real world.  But as was noted above, what we do in
practice is not the issue. The issue is one of theory.

***

It's very similar to high-energy physics. There are a bunch of
physical constants that, if only slightly varied, would (seem to) make
life impossible. Why do they have the values they have? Some invoke
the anthropic principle to answer this - we wouldn't be around if they
had other values. QED. But many find this a bit of a cop-out, and
search for something more fundamental. After all, you could have
stopped the progress of physics at any point in the past if you had
simply gotten everyone to buy into the anthropic principle at that
point in time.

Similarly with inductive inference. You could just cop out and say
"anthropic principle" - if inference were not possible, we wouldn't be
having this debate. But that's hardly a satisfying answer.


***


Eric goes on:

>>>
Consider the problem of learning
to predict the pressure of a gas from its temperature. Wolpert's theorem,
and his faith in our lack of prior about the world, predict,
that any learning algorithm whatever is as likely
to be good as any other. This is not correct.
>>>

To give two examples from just the past month, I'm sure MCI and
Coca-Cola would be astonished to know that the algorithms they're so
pleased with were designed for them by someone having "faith in our
lack of prior about the world".

Less glibly, let me address this claim about my "faith" with two
quotes from the NFL for supervised learning paper. The first is in the
introduction, and the second in a section entitled "On uniform
averaging". So neither is exactly hidden ...

1) "It cannot be emphasized enough that no claim is being made .. that
all algorithms are equivalent in the real world."

2) "The uniform sums over targets ... weren't chosen because there is
strong reason to believe that all targets are equally likely to arise
in practice. Indeed, in many respects it is absurd to ascribe such a
uniformity over possible targets to the real world. Rather the uniform
sums were chosen because such sums are a useful theoretical tool with
which to analyze supervised learning."


Finally, given that I'm mixing it up with Eric on NFL, I can't help
but quote the following from his "What size net gives valid
generalization" paper:

"We have given bounds (independent of the target) on the training set
size vs. neural net size need such that valid generalization can be
expected." 

(Parenthetical comment added - and true.)

Nowhere in the paper is there any discussion whatsoever of the
apparent contradiction between this statement and NFL-type concerns.
Indeed, as mentioned above, with only the conditioning-bar-free
mathematics in Eric's paper, there is no way to resolve the
contradiction. In this particular sense, that paper is extremely
misleading. (See discussion above on misinterpretations of Vapnik's
results.)


>>>>
Creatures evolving in this "play world" would exploit this structure and
understand their world in terms of it. There are other things they would
find hard to predict. In fact, it may be mathematically valid to say that
one could mathematically construct equally many functions on which
these creatures would fail to make good predictions. But so what?
So would their competition. This is not relevant to looking for
one's key, which is best done under the lamppost, where one has a
hope of finding it. In fact, it doesn't seem that the play world
creatures would care about all these other functions at all.
>>>

I'm not sure I quite follow this. In particular, the comment about the
"competition" seems to be wrong.

Let me just carry further Eric's metaphor though, and point out though
that it makes a hell of a lot more sense to pull out a flashlight and
explore into the surrounding territory for your key than it does to
spend all your time with your head down, banging into the lamppost. And
NFL is such a flashlight.


David Wolpert


[1] The current versions of the NFL for supervised learning papers,
nfl.ps.1.Z and nfl.ps.2.Z, at ftp.santafe.edu, in pub/dhw_ftp.

[2] "The Relationship between PAC, the Statistical Physics Framework,
the Bayesian Framework, and the VC Framework", in *The Mathematics of
Generalization*, D. Wolpert Ed., Addison-Wesley, 1995.


From marco at McCulloch.Ing.UniFI.IT  Fri Dec  1 12:21:43 1995
From: marco at McCulloch.Ing.UniFI.IT (Marco Gori)
Date: Fri, 01 Dec 1995 18:21:43 +0100
Subject: Italian Neural Network Society
Message-ID: <9512011721.AA09634@McCulloch.Ing.UniFI.IT>


==============================================================
This is to announce a new web page describing the aims and the
activities of the Italian Neural Network Society.  The page is
hosted at the DSI Web server of the Dipartimento di  Sistemi e 
Informatica, Universita' di Firenze) at the following address:

http://www-dsi.ing.unifi.it/neural/siren

-- marco gori. 
===============================================================


From schmidhu at informatik.tu-muenchen.de  Sun Dec  3 06:40:25 1995
From: schmidhu at informatik.tu-muenchen.de (Juergen Schmidhuber)
Date: Sun, 3 Dec 1995 12:40:25 +0100
Subject: compressibility and generalization
Message-ID: <95Dec3.124033+0100_met.116308+385@papa.informatik.tu-muenchen.de>


Eric Baum wrote:

>>>
(1) While it may be that in classical Lattice gas models, a gas does
not have high Kolmogorov complexity, this is not the origin of
the predictability exploited by physicists. Statistical mechanics
follows simply from the assumption that the gas is in a random one
of the accessible states, i.e. the states with a given amount of
energy. So *define* a *theoretical* gas as follows: Every time you
observe it,it is in a random accessible state. Then its
Kolmogorov complexity is huge (there are many accessible states)
but its macroscopic behavior is predictable. (Actually
this an excellent description of a real gas, given quantum mechanics.)
<<<

(1) The key expression here is ``the assumption that the gas is
in a random one of the *accessible* states''.  Since the accessible
states are defined to be those with equal energy, this greatly
restricts the number of possible states. By definition, it is
trivial to make a macro-level prediction like ``the total energy
will remain constant''.  In turn, there are relatively short
descriptions of a given history of such a gas.  With true random
gas, however, there are no invariants eliminating most of the
possible states. This makes its history incompressible.

(2) Back to: what does this have to do with machine learning? As a
first step, we may simply apply Solomonoff's theory of inductive
inference to a dynamic system or ``universe''. Loosely speaking,
in a universe whose history is compressible, we may expect to
generalize well.  A simple, old counting argument shows: most
computable universes are incompressible. Therefore, in most
computable universes you won't generalize well (this is related
to what has been (re)discovered in NFL).

(3) Hence, the best we may hope for is a learning technique with
good expected generalization performance in *arbitrary* compressible
universes. Actually, another restriction is necessary: the time
required for compression and decompression should be ``tolerable''.
To formalize the expression ``tolerable'' is subject of ongoing
research.

Juergen Schmidhuber
IDSIA
juergen at idsia.ch


From hicks at cs.titech.ac.jp  Sun Dec  3 00:32:43 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Sun, 3 Dec 1995 14:32:43 +0900
Subject: Is the universe finite?
Message-ID: <199512030532.OAA02207@euclid.cs.titech.ac.jp>


I would like to make 2 points.
One concerns a clarification of David Wolperts definition of the universe.
The second one is a thought problem meant to be an illustration 
of the inevitability of structure.

Point 1: 
David Wolpert writes:
(1) >Practically speaking of course, none of this is a concern in the real
>world. We are all (me included) quite willing to conclude there is
>structure in the real world.  But as was noted above, what we do in
>practice is not the issue. The issue is one of theory.

(2) >Okay. Now take all those problems together and view them as one huge
>training set. Better still, add in all the problems that Eric's
>anscestors addressed, so that the success of his DNA is also taken
>into account. That's still one training set. It's a huge one, but it's
>tiny in comparison to the full spaces it lives in.

The above statements seem  to me to be contradictory in some meaning.
"(1)" is saying we should, when discussing generalization, 
not concern ourselves with the real universe in which we live,
but should consider theoretical alternative universes as well.

On the other hand "(2)" seems to say that the real universe in which we live
is itself sufficiently "diverse" that any single approach to generalization
must on average be the same.

What is the universe about which we are talking?  Since mathematical models
exist in our minds and on paper in this universe, are they included?

I feel we ought to distinguish between a single universe (ours for example),
and the ensemble of possible universes.

Point 2:

Lets suppose a universe which is an N-dimensional binary (0/1) vector 
random variable X,
whose elements are iid with p(0)=p(1)=(1/2).  Apparently there is no structure
in this universe.

Now let us consider a universe which is a 
binary valued N by M matrix random variable AA
whose elements are also iid with p(0)=p(1)=(1/2).  
Let us draw a random instance A from AA.

Now we define an M-dimensional integer random variable Y
depending on X by p(y=Ax) = p(Ax), where x and y are instances of 
X and Y respectively.

If A happens to be chosen such that y is merely a subset of the elements
of x, then the prior p(y), like the prior p(x), will be uniform.
But for most choices of A, p(y) will not be uniform at all.

So, out of all the possible universes Y, most of them have structure.
This happens even though Y and AA have no structure.
The structure that Y will have is drawn from a uniform distribution 
(over AA), but we are only concerned with whether there will be structure
or not.

Of course, this proves nothing.  And now I am going to make a 
giant leap of analogy.

The following statements are not contradictory.

(a) In a universe drawn at random from 
the ensemble of all possible universes, we cannot expect to 
see any particular structure to be more likely that any other structure.

(b) In any given universe, we can expect structure to be present.

Would I be correct in saying that only (b) needs to be true in order
for cross-validation to be profitable?


Craig Hicks

Craig Hicks           hicks at cs.titech.ac.jp | Hisakata no, hikari nodokeki
Ogawa Laboratory, Dept. of Computer Science | Haru no hi ni, Shizu kokoro naku
Tokyo Institute of Technology, Tokyo, Japan | Hana no chiruran 
lab:03-5734-2187 home:03-3785-1974 | Spring smiles with sun beams 
fax (from abroad):		   | sifting down through cloudy dreams 
  +81(3)5734-2905 OGAWA LAB  	   | towards the anxious hearts
03-5734-2905 OGAWA LAB (from Japan)| beating pitter pat
[ Poem from Hyaku-nin i-syuu ->    | while flower petals scatter.


From arbib at pollux.usc.edu  Sun Dec  3 14:28:26 1995
From: arbib at pollux.usc.edu (Michael A. Arbib)
Date: Sun, 3 Dec 1995 11:28:26 -0800 (PST)
Subject: VISUOMOTOR COORDINATION: AMPHIBIANS, MODELS, AND COMPARATIVE STUDIES
Message-ID: <199512031928.LAA10890@pollux.usc.edu>


                          PRELIMINARY CALL FOR PAPERS

                                  Workshop on
       VISUOMOTOR COORDINATION: AMPHIBIANS, MODELS, AND COMPARATIVE STUDIES

                     Sedona, Arizona, November 22-24, 1996

Co-Directors: Kiisa Nishikawa (Northern Arizona University, Flagstaff) and
Michael Arbib (University of Southern California, Los Angeles).

Program Committee: Kiisa Nishikawa (Chair), Michael Arbib, Emilio Bizzi,
Chris Comer, Peter Ewert, Simon Gizster, Mel Goodale, Ananda Weerasuriya,
Walt Wilczynski, and Phil Zeigler.

Local Arrangements Chair: Kiisa Nishikawa.

This workshop is the sequel to four earlier workshops on the general theme
of "Visuomotor Coordination in Frog and Toad: Models and Experiments".  The
first two were organized by Rolando Lara and Michael Arbib at the
University of Massachusetts, Amherst (1981) and Mexico City (1982).  The
next two were organized by Peter Ewert and Arbib in Kassell and Los
Angeles, respectively, with the Proceedings published as follows:

Ewert, J.-P. and Arbib, M.A., Eds., 1989, Visuomotor Coordination:
Amphibians, Comparisons, Models and Robots, New York: Plenum Press.
Arbib, M.A.and  J.-P. Ewert, Eds., 1991, Visual Structures and Integrated
Functions, Research Notes in Neural Computing 3, Heidelberg, New York:
Springer-Verlag.

The time is ripe for a fifth Workshop on this theme, with the more generic
title "Visuomotor Coordination: Amphibians, Models, and Comparative
Studies".  The Workshop will be held in Sedona - a beautiful small resort
town set in dramatic red hills in Arizona - straight after the Society for
Neuroscience meeting in 1996.  Next year, Neuroscience ends on Thursday,
November 21, 1996, in Washington, DC, so people can fly to Phoenix that
evening, meet Friday, Saturday, and Sunday, and fly home Monday November
25th (so that US types not going to Neuroscience get the Saturday stopover
that they could not get if we met before Neuroscience).

The aim is to study the neural mechanisms of visuomotor coordination in
frog and toad both for their intrinsic interest and as a target for
developments in computational neuroscience, and also as a basis for
comparative and evolutionary studies.  The list of subsidiary themes given
below is meant to be representative of this comparative dimension, but is
not intended to be exhaustive.  In each case, the emphasis (but not the
exclusive emphasis) will be on papers which contribute to the development
of both modeling and experimentation.

Central Theme: Visuomotor Coordination in Frog and Toad

Subsidiary Themes:
Visuomotor Coordination: Comparative and Evolutionary Perspectives
Reaching and Grasping in Frog, Pigeon, and Primate
Cognitive Maps
Auditory Communication (with emphasis on spatial behavior and sensory
integration)
Sensory Control of Motor Pattern Generators

Formal registration information will be available in March of 1996.
Scientists who wish to present papers are asked to send three copies of
extended abstracts no later than March 31st, 1996 to:

Kiisa Nishikawa
Department of Biological Sciences
Northern Arizona University
Flagstaff, AZ 86011-5640

Notification of the Program Committee's decision will be sent out no later
than May 31st, 1996.

A decision as to whether or not to publish a proceedings is still pending.


From theresa at umiacs.UMD.EDU  Mon Dec  4 10:13:47 1995
From: theresa at umiacs.UMD.EDU (Theresa)
Date: Mon, 04 Dec 1995 10:13:47 -0500
Subject: Postdoc Position in Neural Modeling
Message-ID: <199512041513.KAA05125@skippy.umiacs.UMD.EDU>


The University of Maryland Institute for Advanced Computer Studies (UMIACS)
invites applications for post doctoral positions, beginning summer/fall '96
in the following areas:  Real-time Video Indexing, Natural Language Processing,
and Neural Modeling.  Exceptionally strong candidates from other areas will
also be considered.

UMIACS, a state-supported research unit, has been the focal point for 
interdisciplinary and applications-oriented research activities in computing
on the College Park campus.  The Institute's 40 faculty members conduct
research in high performance computing, software engineering, artificial
intelligence, systems, combinatorial algorithms, scientific computing, and
computer vision.

Qualified applicants should send a 1 page statement of research interest,
curriculum vitae and the names and addresses of 3 references to:

	Prof. Joseph Ja'Ja'
	UMIACS
 	A.V. Williams Building
	University of Maryland
	College Park, MD 20742

by April 1.  UMIACS strongly encourages applications from minorities and
women.
EOE/AA


From howse at eece.unm.edu  Mon Dec  4 11:12:34 1995
From: howse at eece.unm.edu (James W. Howse)
Date: Mon, 04 Dec 1995 09:12:34 -0700
Subject: Dissertation Available
Message-ID: <9512041612.AA27407@opus.eece.unm.edu>


The following PhD dissertation is available by FTP:

Gradient and Hamiltonian Dynamics: Some Applications to Neural Network
              Analysis and System Identification


                       James W. Howse


			  Abstract

The work in this dissertation is based on decomposing system dynamics into the
sum of dissipative (e.g., convergent) and conservative (e.g., periodic)
components.  Intuitively, this can be viewed as decomposing the dynamics into
a component normal to some surface and components tangent to other surfaces.
First, this decomposition was applied to existing neural network architectures
to analyze their dynamic behavior. Second, this formalism was employed to
create models which learn to emulate the behavior of actual systems.  The
premise of this approach is that the process of system identification can be
considered in two stages: model selection and parameter estimation.  In this
dissertation a technique is presented for constructing dynamical systems with
desired qualitative properties.  Thus, the model selection stage consists of
choosing the dissipative and conservative portions appropriately so that a
certain behavior is obtainable.  By choosing the parametrization of the models
properly, a learning algorithm has been devised and proven to always converges
to a set of parameters for which the error between the output of the actual
system and the model vanishes.  So these models and the associated learning
algorithm are guaranteed to solve certain types of nonlinear identification
problems.

Retrieval: 
   ftp ftp.eece.unm.edu
   login as anonymous
   cd howse
   get dissertation.ps.Z

This is a PostScript file compressed with compress.  The dissertation is 133
pages long and formatted to print single-sided.  If there are any retrieval or
printing problems please let me know.  I would welcome any comments or
suggestions regarding the dissertation.

No hardcopies are available.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

  James Howse - howse at eece.unm.edu
   __  __  __  __   _    _
  /\ \/\ \/\ \/\ \/\ `\_/ `\   University of New Mexico
  \ \ \ \ \ \ `\\ \ \       \   Department of EECE, 224D
   \ \ \ \ \ \ , ` \ \ `\_/\ \   Albuquerque, NM 87131-1356
    \ \ \_\ \ \ \`\ \ \ \_',\ \   Telephone: (505) 277-0805
     \ \_____\ \_\ \_\ \_\ \ \_\   FAX: (505) 277-1413 or (505) 277-1439
      \/_____/\/_/\/_/\/_/  \/_/

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

From zhuh at helios.ASTON.ac.uk  Mon Dec  4 15:33:50 1995
From: zhuh at helios.ASTON.ac.uk (zhuh)
Date: Mon, 4 Dec 1995 20:33:50 +0000
Subject: compressibility and generalization
Message-ID: <28443.9512042033@sun.aston.ac.uk>

On the implecations of No Free Lunch Theorem(s) by David Wolpert,

> From: Juergen Schmidhuber <schmidhu at informatik.tu-muenchen.de>
>
> (3) Hence, the best we may hope for is a learning technique with
> good expected generalization performance in *arbitrary* compressible
> universes. Actually, another restriction is necessary: the time
> required for compression and decompression should be ``tolerable''.
> To formalize the expression ``tolerable'' is subject of ongoing
> research.

However, the deeper NFL Theorem states that this is still impossible:

1. The *non-existence* of structure guarantees any algorithm will 
neither win nor lose, compared with the "random algorithm", in the long 
run. If this were all that is there, then NFL would be just tautology. 

2. The *mere existence* of structure guarantees a (not uniformly-random)
algorithm as likely to lose you a million as to win you a million, 
even in the long run.  It is the *right kind* of structure that makes 
a good algorithm good.

3. This is by far one of the most important implications of NFL, yet my 
sample from Connectionist show that it is safe to make the posterior 
prediction that if someone criticises NFL as irrelevent, then he has not 
got this far yet.

In conclusion: "for arbitrary environment there is an optimal algorithm"
is drastically different from "there is an optimal algorithm for arbitrary
environment", whatever restrictions you make on the word "arbitrary".

--
Huaiyu Zhu, PhD                   email: H.Zhu at aston.ac.uk
Neural Computing Research Group   http://neural-server.aston.ac.uk/People/zhuh
Dept of Computer Science          ftp://cs.aston.ac.uk/neural/zhuh
    and Applied Mathematics       tel: +44 121 359 3611 x 5427
Aston University,                 fax: +44 121 333 6215
Birmingham B4 7ET, UK              


From dhw at santafe.edu  Mon Dec  4 19:49:47 1995
From: dhw at santafe.edu (David Wolpert)
Date: Mon, 4 Dec 95 17:49:47 MST
Subject: Non-randomness is no panacea
Message-ID: <9512050049.AA16646@sfi.santafe.edu>


Craig Hicks writes:


>>>>>
(1) >Practically speaking of course, none of this is a concern in the real
>world. We are all (me included) quite willing to conclude there is
>structure in the real world.  But as was noted above, what we do in
>practice is not the issue. The issue is one of theory.

(2) >Okay. Now take all those problems together and view them as one huge
>training set. Better still, add in all the problems that Eric's
>anscestors addressed, so that the success of his DNA is also taken
>into account. That's still one training set. It's a huge one, but it's
>tiny in comparison to the full spaces it lives in.

The above statements seem  to me to be contradictory in some meaning.
>>>>

Not at all. The second statement is concerned with theoretical issues,
whereas the first one is concerned with practical issues. The
distinction is ubiquitous in science and engineering. Even in the
little corner of academia known as supervised learning, most people
are content to distinguish the concerns of COLT (theory) from those of
what-works-in-practice.

>>>
"(1)" is saying we should, when discussing generalization, 
not concern ourselves with the real universe in which we live,
but should consider theoretical alternative universes as well.
>>>

Were you referring to (2) instead? Neither statement says anything
like "we should not concern ourselves with the real universe".


>>>
On the other hand "(2)" seems to say that the real universe in which we live
is itself sufficiently "diverse" that any single approach to generalization
must on average be the same.
>>>

Again, I would have hoped that nothing I have said could be construed
as saying something like that. It may or may not be true, but you said
it, not me. :-) I am sorry if you were somehow given the wrong
impression.


>>>>
I feel we ought to distinguish between a single universe (ours for example),
and the ensemble of possible universes.
>>>>

This is a time-worn concern. Read up on the past two centuries worth
of battles between Bayesians and non-Bayesians...


>>>>
Lets suppose a universe which is an N-dimensional binary (0/1) vector 
random variable X,
whose elements are iid with p(0)=p(1)=(1/2).  Apparently there is no structure
in this universe.
>>>>

NO!!! Forgive my ... passion, but as I've said many times now, even in
a purely random universe, there are many very deep distinctions
between the behavior of different learning algorithms (and in this
sense there is plenty of "structure"). Like head-to-head minimax
distinctions. (Or uniform convergence theory ala Vapnik.) Please read
the relevent papers! ftp.santafe.edu, pub/dhw_ftp, nfl.ps.1.Z and
nfl.ps.2.Z.


>>>>
(b) In any given universe, we can expect structure to be present.

Would I be correct in saying that only (b) needs to be true in order
for cross-validation to be profitable?
>>>>

Nope. The structure can just as easily negate the usefulness of
xvalidation as establish it. And in fact, the version of NFL in which
one fixes the target and then averages over generalizers says that the
state of the universe is (in a certain precise sense), by itself,
irrelevent. Structure or not; that fact alone can not determine the
utility of xvalidation. 

***

Although I think it is at best tangential to further discuss
Kolmogorov complexity, Juergen Schmidhuber's recent comment deserves a
response. He writes:


>>>>>
(2) Back to: what does this have to do with machine learning? As a
first step, we may simply apply Solomonoff's theory of inductive
inference to a dynamic system or ``universe''. Loosely speaking,
in a universe whose history is compressible, we may expect to
generalize well.
>>>>

How could this be true? Nothing has been specified in Juergen's
statement about the loss function, how test sets are generated (IID
vs. off-training-set vs. who knows what), the generalizer used, how it
is related (if at all) to the prior over targets (a prior which, I
take it, Juergen wishes to be "compressible"), the noise process,
whether there is noise in the inputs as well as the outputs, etc.,
etc. Yet all of those factors are crucial in determining the efficacy
of the generalizer.

Obviously if your generalizer *knows* the "compression scheme of the
universe", knows the noise process, etc., then it will generalize
well. Is that what you're saying Juergen? It reduces to saying that if
you know the prior, you can perform Bayes-optimally. There is
certainly no disputing that statement.

It is worth bearing in mind though that NFL can be cast in terms of
averages over priors. In that guise, it says that there are just as
many priors - just as many ways of having a universe be
"compressible", loosely speaking - for which your favorite algorithm
dies as there are for which it shines.

In fact, it's not hard to show that an average over only those priors
that are more than a certain distance from the uniform prior results
in NFL - under such an average, for OTS error, etc., all algorithms
have the same expected performance.

The simply fact of having a non-uniform prior does not mean that
better-than-random generalization arises.

***

Structure, compressibility, whatever you want to call it; it can hurt
just as readily as it can help. The simple claim that there is
non-randomness in the universe does not establish that any particular
algorithm performs better than randomly.

To all those who dispute this, I ask that they present a theorem,
relating generalization error to "compressibility". (To do this of
course, they will have to specify the loss function, noise, etc.) Not
words, but math, and not just math concerning Kolmogorov complexity
considered in isolation. Math presenting a formal relationship between
generalization error and "compressibility". (A relationship that
doesn't reduce to the statement that if you have information
concerning the prior, you can exploit it to generalize well - no
rediscovery of the wheel please.)


David Wolpert


From hicks at cs.titech.ac.jp  Mon Dec  4 20:40:08 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Tue, 5 Dec 1995 10:40:08 +0900
Subject: compressibility and generalization
In-Reply-To: Juergen Schmidhuber's message of Sun, 3 Dec 1995 12:40:25 +0100 <95Dec3.124033+0100_met.116308+385@papa.informatik.tu-muenchen.de>
Message-ID: <199512050140.KAA05180@euclid.cs.titech.ac.jp>


On Sun, 3 Dec 1995 12:40:25, Juergen Schmidhuber's wrote:
>(2) Back to: what does this have to do with machine learning? As a
>first step, we may simply apply Solomonoff's theory of inductive
>inference to a dynamic system or ``universe''. Loosely speaking,
>in a universe whose history is compressible, we may expect to
>generalize well.  A simple, old counting argument shows: most
>computable universes are incompressible. Therefore, in most
>computable universes you won't generalize well (this is related
>to what has been (re)discovered in NFL).

In an earlier communication I hypothesized that a typical universe would have
structure that could be exploited by cross-validation.  This communication
from Juergen Schmidhuber contradicts my hypothesis, I think, because of the
existence the "simple, old counting argument" showing that "most computable
universes are incompressible".  I stand corrected.

The point I really wanted clarified was what was meant by the asseration
that in a typical universe

(A)	cross-validation works as well as anti-cross validation


I will just talk about the problem of (determinisitic or stochastic) function
estimation.  I can accept that for any set of model functions, there will be
an infinity of problems where cross-validation will be of no assistance,
because that model does not have the capacity to predict future input/output
realtions from any finite set of examples from the past.  This could be either
becuase the true function is pure noise, or because it looks like pure noise
from the perspective of any function from the set of candidate model
functions.  In this there will be no correlation between predictions and
samples, and cross-validation will do its job of telling us that the
generalization error is not decreasing.

However, I interpret the assertion that anti-cross validation can be expected
to work as well as cross-validation to mean that we can equally well expect
cross-validation to lie.  That is, if cross-validation is telling us that the
generalization error is decreasing, we can expect, on average, that the true
generalization error is not decreasing.

Isn't this a contradiction, if we assume that the samples are really randomly
chosen?  Of course, we can a posteriori always choose a worst case function
which fits the samples taken so far, but contradicts the learned model
elsewhere.  But if we turn things around and randomly sample that deceptive
function anew, the learned model will probably be different, and
cross-validation will behave as it should.

I think this follows from the principle that the empirical distribution over
an ever larger number of samples converges to the the true distribution of a
single sample (assuming the true distribution is stationary).

Does assertion (A) mean that this principle fails in alternative universes?

Respectfully Yours,

	Craig Hicks

Craig Hicks           hicks at cs.titech.ac.jp 
Ogawa Laboratory, Dept. of Computer Science 
Tokyo Institute of Technology, Tokyo, Japan 


From juergen at idsia.ch  Tue Dec  5 12:50:01 1995
From: juergen at idsia.ch (Juergen Schmidhuber)
Date: Tue, 5 Dec 95 18:50:01 +0100
Subject: Compressibility and Generalization
Message-ID: <9512051750.AA00953@fava.idsia.ch>

Shahab Mohaghegh requested a definition of 
``compressibility of the history of a universe''. 

Let S(t) denote the state of a computable universe
at discrete time step t. Let's suppose S(t) can be 
described by n bits. The history of the universe
between time step 1 (big bang) and time step t is 
compressible if it can be computed by an algorithm
whose size is clearly less than tn bits.

Given a particular computing device, most histories 
are incompressible: there are 2^tn possible histories, 
but there are less than (1/2)^c * 2^tn algorithms 
with less than 2^(tn-c) bits (c is a small positive
constant).  With most possible universes, the mutual 
algorithmic information between past and future is zero, 
and previous experience won't help to generalize well
in the future.

There are a few compressible or ``regular'' universes, 
however. To use ML terminology, some of them allow for 
``generalization by analogy''. Some of them allow for 
``generalization by chunking''. Some of them allow for
``generalization by exploiting invariants''. Etc. It
would be nice to have a method that can generalize well
in *arbitrary* regular universes.

Juergen Schmidhuber
IDSIA

P.S.: Sorry, I meant to say: 
there are less than (1/2)^c * 2^tn 
algorithms with less than tn-c bits.
JS


From gluck at pavlov.rutgers.edu  Tue Dec  5 16:52:15 1995
From: gluck at pavlov.rutgers.edu (Mark Gluck)
Date: Tue, 5 Dec 1995 16:52:15 -0500
Subject: Faculty Openings at Rutgers-Newark for Connectionist Modelers Interested in Cog Sci/Cog Neuro
Message-ID: <199512052152.QAA16557@pavlov.rutgers.edu>


The following junior faculty openings at Rutgers-Newark may be of interest
to connectionist modelers working in the area of Cognitive Psychology
and Cognitive Neuroscience. Although a purely theoretical researcher
would be considered, someone who combines both theoretical/computational
modeling and experimental research would be prefered:

- Mark Gluck

 
CENTER FOR MOLECULAR AND BEHAVIORAL NEUROSCIENCE
COGNITIVE NEUROSCIENCE
 
	One faculty position in human cognitive neuroscience is
available at the assistant to full professor level. Scientists with a
research focus on the neurobiological basis of higher cortical
function in humans, who would be stimulated by the integrative focus
and collaborative research environment of the Center for Molecular and
Behavioral Neuroscience, are encouraged to apply.  Research areas
include (but are not limited to) human experimental neuropsychology,
neuropsychiatry, brain imaging and neuroplasticity, cognitive
neuroscience, neurolinguistics, development, human electrophysiology,
computational neuroscience, neural basis of speech, attention, memory,
perception, emotion, psychophysics and behavioral genetics. State of
the art laboratories and equipment for human research, and a doctoral
program in Behavioral and Neural Science are available in the Center.
Additional information on our program, research facilities,and faculty
can be obtained over the internet at:
http://www.cmbn.rutgers.edu/bns-home.html. Neuroscientists interested
in brain/behavior relationships in normal and/or clinical populations
should send CV, names of three references and a brief letter of
research goals and philosophy to: Dr. Paula Tallal, Center for
Molecular and Behavioral Neuroscience, Rutgers University, 197
University Avenue, Newark, New Jersey, 07102.  Phone:  (201) 648-1080
x3200. Fax: (201) 648-1272. Email: tallal at axon.rutgers.edu. 

COGNITIVE PSYCHOLOGY, ASSISTANT PROFESSOR (TWO POSITIONS)

	The Department of Psychology at the Newark Campus of Rutgers
University invites Ph.D. applications for one tenure track and one
term (non-tenure track) Assistant Professor position to expand its
program in Cognitive Experimental Psychology. One position is in the
area of Attention and the second is in Social Cognition, or Cognitive
Development. The positions call for candidates with an active research
program and who are effective teachers at both the graduate and
undergraduate levels. Candidates must be prepared to teach a variety
of undergraduate courses.  Send CV and three letters of recommendation
to Professor Harold I. Siegel, Acting Chair, Department of
Psychology-Cognitive Search, Rutgers University, Newark, NJ 07102.  


----- End Included Message -----


From juergen at idsia.ch  Wed Dec  6 04:39:11 1995
From: juergen at idsia.ch (Juergen Schmidhuber)
Date: Wed, 6 Dec 95 10:39:11 +0100
Subject: Non-randomness is no panacea.
Message-ID: <9512060939.AA02202@fava.idsia.ch>


In response to David's response dated Mon, 4 Dec 95:

I wrote ``Loosely speaking, in a universe whose
history is compressible, we may expect to generalize
well.''. To make this more precise, let us consider a
very simple 1-bit universe --- suppose the problem
is to extrapolate a  sequence of symbols (bits, without
loss of generality).  We have already observed a bitstring
s and would like to predict the next bit.  Let si denote
the event ``s is followed by symbol i'' for  i in {0,1}.

David is absolutely right by reminding us that we need a
prior before applying Bayes. And he is right by pointing
out that only if we have information concerning the prior,
we can exploit it to generalize well. In the context of
the present discussion, however, an interesting point is:
there is a special prior that is biased towards
*arbitrary* compressibility/structure/regularity.

Following Solomonoff/Levin/Chaitin/Li&Vitanyi, define P(s), the
a priori probability of a bitstring s, as the probability of
guessing a (halting) program that computes s on a universal
Turing machine U. Here, the way of guessing is defined by the
following procedure: initially, the input tape consists of
a single square.  Whenever the scanning head of the input
tape shifts to the right, do: (1) Append a new square.
(2) With probability 1/2 fill it with a 0; with probability 1/2
fill it with a 1.  Bayes tells us
P(s0|s) = P(s|s0)P(s0)/P(s) P(s0/P(s); P(s1|s) = P(s1)/P(s).
We are going to predict ``the next bit will be 0'' if
P(s0) > P(s1), and vice versa.  Due to the coding theorem
(Levin 74, Chaitin 75), P(si) = O((1/2)^K(si)) for  i in
{0,1} (K(x) denotes x' Kolmogorov complexity), the continuation
with lower Kolmogorov complexity will (in general) be more
likely. If s is ``noisy'' then this will be reflected by
its relatively high Kolmogorov complexity.

I am not saying anything new here. I'd just like to point
that if you know nothing about your universe except that it
is regular in some way, then P is of interest.  Sadly, most
possible universes are completely irregular and incompressible.
But for the few (but infinetly many) that are not, P is a
prior to consider (at least if we don't care for computing
time and constant factors).

Perhaps there are too many threads in the current discussion.
I'll shut up for a while.

Juergen Schmidhuber
IDSIA


From goldfarb at unb.ca  Wed Dec  6 15:54:00 1995
From: goldfarb at unb.ca (Lev Goldfarb)
Date: Wed, 6 Dec 1995 16:54:00 -0400 (AST)
Subject: Compressibility and Generalization
In-Reply-To: <9512051750.AA00953@fava.idsia.ch>
Message-ID: <Pine.SUN.3.91.951206110450.2520A-100000@jupiter.sun.csd.unb.ca>

On Tue, 5 Dec 1995, Juergen Schmidhuber wrote:

> ``compressibility of the history of a universe''. 
> 
> There are a few compressible or ``regular'' universes, 
> however. To use ML terminology, some of them allow for 
> ``generalization by analogy''. Some of them allow for 
> ``generalization by chunking''. Some of them allow for
> ``generalization by exploiting invariants''. Etc. It
> would be nice to have a method that can generalize well
> in *arbitrary* regular universes.

For a proposal how to capture formally the concept of an "arbitrary
regular universe" for the purposes of inductive learning (and
generalization), i.e.  the concept of a "combinative" representation in a
universe, see the two references below as well as the original two papers
published in Pattern Recognition (and mentioned in each of the two
references). The structure of objects in the universe was discussed on the
INDUCTIVE list. 

It appears, that the concept of a "symbolic" representation has to be
formalized first (via the concept of transformation system), and the
fundamentally new concept of *inductive class structure*, not present in
other ML models, becomes of critical importance. The issue of dynamic
object representation, so conspicuously (and not surprisingly) absent from
the ongoing (classical) "statistical" discussion of inductive learning, is 
also brought to the fore. 

1. L. Goldfarb and S. Nigam, The unified learning paradigm: A foundation 
   for AI, in V. Honavar and L. Uhr, eds., Artificial Intelligence and 
   Neural Networks: Steps toward Principled Integration, Academic Press, 
   1994.
2. L. Goldfarb , J. Abela, V.C. Bhavsar, V.N. Kamat, Can a vector space 
   based learning model discover inductive class generalization in a 
   symbolic environment? Pattern Recognition Letters 16, 719-726, 1995.


-- Lev Goldfarb 

From N.Sharkey at dcs.shef.ac.uk  Thu Dec  7 07:24:09 1995
From: N.Sharkey at dcs.shef.ac.uk (N.Sharkey@dcs.shef.ac.uk)
Date: Thu, 7 Dec 95 12:24:09 GMT
Subject: CALL FOR ROBOTICS PAPERS
Message-ID: <9512071224.AA11298@entropy.dcs.shef.ac.uk>


			CALL FOR PAPERS

	      ** LEARNING IN ROBOTS AND ANIMALS **
                   An AISB-96 two-day workshop

University of Sussex, Brighton, UK: April, 1st & 2nd, 1996
Co-Sponsored by IEE Professional Group C4 (Artificial Intelligence)

WORKSHOP ORGANISERS:
Noel Sharkey (chair), University of Sheffield, UK.
Gillian Hayes, University of Edinburgh, UK.
Jan Heemskerk, University of Sheffield, UK.
Tony Prescott, University of Sheffield, UK.

PROGRAMME COMMITTEE:
Dave Cliff, UK.
Marco Dorigo, Italy.
Frans Groen, Netherlands.
John Hallam, UK.
John Mayhew, UK.
Martin Nillson, Sweden
Claude Touzet, France
Barbara Webb, UK.
Uwe Zimmer, Germany.
Maja Mataric, USA.

For Registration Information: alisonw at cogs.susx.ac.uk

In the last five years there has been an explosion of research on
Neural Networks and Robotics from both a self-learning and an
evolutionary perspective. Within this movement there is also a growing
interest in natural adaptive systems as a source of ideas for the
design of robots, while robots are beginning to be seen as an
effective means of evaluating theories of animal learning and
behaviour.  A fascinating interchange of ideas has begun between a
number of hitherto disparate areas of research and a shared science of
adaptive autonomous agents is emerging.  This two-day workshop
proposes to bring together an international group to both present
papers of their most recent research, and to discuss the direction of
this emerging field.


WORKSHOP FORMAT:
The workshop will consist of half-hour presentations with at least 15
minutes being allowed for discussion at the end of each presentation.
Short videos of mobile robot systems may be included in presentations.
Proposals for robot demonstrations are also welcome. Please contact
the workshop organisers if you are considering bringing a robot as
some local assistance can be arranged.  The workshop format may change
once the number of accepted papers is known, in particular, there may
be some poster presentations.


WORKSHOP CONTRIBUTIONS:
Contributions are sought from researchers in any field with an
interest in the issues outlined above.

Areas of particular interest include the following

 * Reinforcement, supervised, and imitation learning methods for
   autonomous robots

 * Evolutionary methods for robotics

 * The development of modular architectures and reusable representations

 * Computational models of animal learning with relevance to robots,
   robot control systems modelled on animal behaviour

 * Reviews or position papers on learning in autonomous agents

Papers will ideally emphasise real world problems, robot implementations,
or show clear relevance to the understanding of learning in both
natural and artificial systems. 

Papers should not exceed 5000 words length. Please submit four hard copies
to the Workshop Chair (address below) by 30th January, 1996.
All papers will be refereed by the Workshop Committee and other
specialists. Authors of accepted papers will be notified by 24th February 

Final versions of accepted papers must be submitted by 10th March, 1996.
A collated set of workshop papers will be distributed to workshop attenders.
We are currently negotiating to publish the workshop proceedings as a book.

SUBMISSIONS TO:
Noel Sharkey 
Department of Computer Science 
Regent Court                     
University of Sheffield 
S1 4DP, Sheffield, UK       
email: n.sharkey at dcs.sheffield.ac.uk 

For further information about AISB96

ftp ftp.cogs.susx.ac.uk  

login as <anonymous>
Password: <your email address>
cd pub/aisb/aisb96


From mkearns at research.att.com  Thu Dec  7 13:39:00 1995
From: mkearns at research.att.com (Michael J. Kearns)
Date: Thu, 7 Dec 95 13:39 EST
Subject: COLT 96 Call for Papers, ASCII
Message-ID: <m0tNlET-000q4nC@radish.research.att.com>


______________________________________________________________________
		      CALL FOR PAPERS---COLT '96

	  Ninth Conference on Computational Learning Theory
		      Desenzano del Garda, Italy
		       June 28 -- July 1, 1996
______________________________________________________________________

The Ninth Conference on Computational Learning Theory (COLT  '96) will
be held in the town of Desenzano del Garda,  Italy, from  Friday, June
28,  through  Monday,  July 1, 1996.   COLT  '96  is  sponsored by the
Universita`  degli Studi  di Milano.   We invite papers  in all  areas 
that relate  directly to the analysis  of learning algorithms  and the 
theory  of machine  learning,  including neural networks,  statistics,
statistical physics, Bayesian/MDL estimation, reinforcement  learning,
inductive inference, knowledge  discovery in databases,  robotics, and
pattern recognition.    We  also encourage the  submission  of  papers
describing   experimental results  that  are supported by  theoretical
analysis.

ABSTRACT SUBMISSION.
Authors should  submit  fifteen  copies (preferably two-sided)  of  an
extended abstract to:
				   
		     Michael Kearns --- COLT '96
		 AT&T Bell Laboratories, Room 2A-423
			 600 Mountain Avenue
		  Murray Hill, New Jersey 07974-0636
	    Telephone(for overnight mail): (908) 582-4017

Abstracts must be RECEIVED by FRIDAY JANUARY 12,  1996.  This deadline
is  firm.     We   are also  allowing   electronic  submissions  as an
alternative to submitting  hardcopy.  Instructions  for  how to submit
papers    electronically  can  be    obtained by   sending  email   to
colt96 at cs.cmu.edu with subject "help", or from our web site:

	       http://www.cs.cmu.edu/~avrim/colt96.html

which will also be used to provide other  program-related information.
Authors will   be  notified of  acceptance  or rejection on  or before
Friday, March  15, 1996.   Final camera-ready papers   will  be due by
Friday, April 5.   Papers  that have  appeared   in  journals or other
conferences, or that are being submitted to other conferences, are not
appropriate for submission  to  COLT.  An exception to  this policy is
that COLT and STOC have agreed  that a paper can be  submitted to both
conferences, with the understanding that a paper will be automatically
withdrawn from COLT if accepted to STOC.

ABSTRACT FORMAT.
The  extended  abstract  should  include a  clear   definition of  the
theoretical model used and a clear description of the results, as well
as a  discussion of their  significance, including comparison to other
work.   Proofs or proof sketches  should  be included. If the abstract
exceeds 10 pages,  only the first 10 pages  may be examined.   A cover
letter   specifying the contact  author and  his  or her email address
should accompany the abstract.

PROGRAM FORMAT.
At the discretion of the program committee, the program may consist of
both long and short talks,  corresponding to longer and shorter papers
in  the proceedings.   The short talks  will  also be  coupled with  a
poster presentation.

PROGRAM CHAIRS.
Avrim Blum (Carnegie Mellon University) and Michael Kearns  (AT&T Bell
Laboratories).

CONFERENCE AND LOCAL ARRANGEMENTS CHAIRS.
Nicolo`  Cesa-Bianchi  (Universita`  di  Milano)  and  Giancarlo Mauri
(Universita` di Milano).

PROGRAM COMMITTEE.
Martin Anthony (London School of Economics), 
Avrim Blum (Carnegie Mellon University),
Bill Gasarch (University of Maryland), 
Lisa Hellerstein (Northwestern University), 
Robert Holte (University of Ottawa), 
Sanjay Jain (National University of Singapore), 
Michael Kearns (AT&T Bell Laboratories),
Nick Littlestone (NEC Research Institute), 
Yishay Mansour (Tel Aviv University), 
Steve Omohundro (NEC Research Institute), 
Manfred Opper (University of Wuerzburg), 
Lenny Pitt (University of Illinois), 
Dana Ron (Massachusetts Institute of Technology), 
Rich Sutton (University of Massachusetts)

COLT, ML, AND EUROCOLT.
The Thirteenth International  Conference on  Machine Learning (ML '96)
will be held right after COLT '96,  on July 3--7 in  Bari, Italy.   In
cooperation  with COLT, the  EuroCOLT  conference will not  be held in
1996.

STUDENT TRAVEL.  
We anticipate some funds will be available to partially support travel
by student   authors.   Details will be  distributed   as  they become
available.


From hicks at cs.titech.ac.jp  Thu Dec  7 19:49:53 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Fri, 8 Dec 1995 09:49:53 +0900
Subject: compressibility and generalization
In-Reply-To: William Finnoff's message of Thu, 7 Dec 95 15:55:52 MST <9512072255.AA25329@predict.com>
Message-ID: <199512080049.JAA10560@euclid.cs.titech.ac.jp>


finnoff at predict.com (William Finnoff) wrote:
>Reading some of the recent postings concerning NFL theorems, it appears
>that there is still some misunderstandings about what they refer to in
>the versions dealing with statistical inference.   For example,  Craig
>Hicks writes:
>> (paraphrase: I want to clarify the meaning of the following assertion)
>>  (A) cross-validation works as well as anti-cross 
>>      validation (paraphrase: on average)

finnoff at predict.com (William Finnoff) continued:
>An example of this 
>would be the case of a two by two contingency table
>where the inputs are, say, 0=patient received treatment A,
>1=patient received treatment B, and values of the dependent variable
>are 0=patient died within three months, or 1=patient still alive
>after three months.  ... Using the example given above, this corresponds
>to cases where the training data contains no examples of 
>of a patient receiving one of the treatments (for example, where
>the training data only contains examples of patients
>that have received treatment A).   

Since there is no data for treatment B, how can we use cross-validation?  In
this case statement (A) above is not wrong, but it is implicitly occuring
within a context where there is no data to use for cross-validation.  If so
isn't it rather a trivial statement?  Possibly misleading?

finnoff at predict.com (William Finnoff) continued:
>The NFL theorems state that in this case, unless there is some other prior
>information available about the performance of treatment B in keeping patients
>alive, all predictions are equivalent in their average expected performance.

I certainly wouldn't expect cross-validation to work when it can't even be
used.  And I think it would work just as well as anti-cross validation,
whatever that is, where anti-cross validation is also not being used.  In
fact, both would score `0', not only on average, but every time, since they
are not being used.

----

After further study and reading postings to this list 
my current understanding is that (A) merely means that for any problem
	(cross validation >= 0)
in the sense that it will never be deceptive (never < 0) 
taking the average across the ensemble of samplings.

However, by taking a straight average over a certain infinite
(and arguably universal) ensemble of problems we can obtain 
	Expectation[cross validation] = 0
because in this ensemble the positive scoring problems are an infinitely small 
proportion.

This is exciting, because in our universe at the present time evidently 
	Expectation[cross validation] > 0,
which implies a non uniform prior over the ensemble of problems.
Or are we just choosing our problems unfairly?  
And if so, what algorithm are we using (or is using us) to choose them?

Craig Hicks           hicks at cs.titech.ac.jp 
Ogawa Laboratory, Dept. of Computer Science 
Tokyo Institute of Technology, Tokyo, Japan 

PS.  I do not claim to be clear on all the issues, 
or be free from misunderstandings by any means.  

PSS. What is anti-cross validation?


From WALTSCH at vms.cis.pitt.edu  Thu Dec  7 22:27:49 1995
From: WALTSCH at vms.cis.pitt.edu (WALTSCH@vms.cis.pitt.edu)
Date: Thu, 07 Dec 1995 23:27:49 -0400 (EDT)
Subject: Faculty position is Cognitive Neuroscience Univ. of Pittsburgh
Message-ID: <01HYJKVPQW36AM35MW@vms.cis.pitt.edu>

********Faculty Opening in Cognitive Neuroscience*************

The Department of Psychology at the University of Pittsburgh seeks a 
faculty member at the assistant professor level who studies human cognitive
neuroscience. The faculty member must have a strong empirical background and a
program of research that brings together neuroscience and behavioral 
techniques and an interest graduate and undergraduate teaching in this 
area. Candidates are likely to become affiliated with Center for 
the Neural Basis of Cognition between the University of Pittsburgh and 
Carnegie Mellon University. For additional information, 
see WWW httyp://neurocog.lrdc.pitt.edu/search 

   Applications should be sent to:
       Cognitive Neuroscience Search
       455 Langley Hal
       Psychology Department 
       University of Pittsburgh
       PGH PA 15260. 

       Applications should include:
           1. a statement of research and teaching interest 
           2. a CV 
           3. copies of selected publications 
           4. three letters of reference.

Initial consideration will begin January 15, 1996, though applications
arriving after that date may be considered. 

The University of Pittsburgh is an Equal Opportunity/Affirmative Action 
Employer. Women and minority candidates are especially encouraged to apply. 


From esann at dice.ucl.ac.be  Fri Dec  8 12:39:48 1995
From: esann at dice.ucl.ac.be (esann@dice.ucl.ac.be)
Date: Fri, 8 Dec 1995 18:39:48 +0100
Subject: ESANN extended deadline
Message-ID: <199512081737.SAA18067@ns1.dice.ucl.ac.be>

Dear Colleagues,

The deadline to submit papers to the ESANN'96 conference (the 4th European
Symposium on Artificial Neural Networks, which will be held in Bruges,
Belgium, on April 24-26, 1996) was December 8th, 1995 (today !) as
announced in the call for papers.

However, as you know, there are important strikes in France and in other
countries, and many of you have problems to meet this deadline because of
the post office strike (it is even worst because of the airport strike in
Belgium...).  So we are pleased to announce that we will accept submission
of papers until Friday December 15th, 1996 (so next Friday!).  Please
however ensure that the printed copies (no e-mail or fax please) will reach
the conference secretariat (see address below), together with the required
information (as described in the call for papers) before this date.  Please
use private mail delivery services if necessary, and don't forget that in
most countries Chronopost in NOT a private mail service (for example,
because of the strike, the French Chronopost service was not working this
week...), while DHL, TNT Mailfast and other companies are private services,
and so could be more efficient in the next few days...

If you still have problems to meet the new deadline, please contact me
personally at the following e-mail address:
        esann at dice.ucl.ac.be
and we will try to arrange another way to transfer your paper.

Please feel free to contact me if you need any other information about the
submission of papers.
Sincerely yours,

Michel Verleysen


_____________________________
D facto publications -
        conference services
45 rue Masui
1210 Brussels
Belgium
tel: +32 2 245 43 63
fax: +32 2 245 46 94
_____________________________


From giles at research.nj.nec.com  Fri Dec  8 14:18:39 1995
From: giles at research.nj.nec.com (Lee Giles)
Date: Fri, 8 Dec 95 14:18:39 EST
Subject: reprint available
Message-ID: <9512081918.AA20599@alta>


The following conference paper published in the 2nd International IEEE
Conference on "Massively Parallely Processing Using Optical
Interconnections," October, 1995 is now available via the NEC Research
Institute archive:

____________________________________________________________________________________

          "Predictive Control of Opto-Electronic Reconfigurable 
             Interconnection Networks Using Neural Networks"

         Majd F. Sakr[1,2], Steven P. Levitan[2], C. Lee Giles[1,3], 
         Bill G. Horne[1], Marco Maggini[4], Donald M. Chiarulli[5] 

     [1] NEC Research Institute, 4 Independence Way, Princeton, NJ  08540
 [2] Electrical Engineering Department, U. of Pittsburgh, Pittsburgh, PA 15261
            [3] UMIACS, U. of Maryland, College Park, MD 20742
[4] Universit` di Firenze, Dipartimento di Sistemi e Informatica, 
	50139 Firenze, Italy   
    [5] Computer Science Department, U. of Pittsburgh, Pittsburgh, PA 15260
                                      

                                  Abstract

Opto-electronic reconfigurable interconnection networks are limited by
significant control latency when used in large multiprocessor systems. This
latency is the time required to analyze the current traffic and reconfigure
the network to establish the required paths. The goal of latency hiding is
to minimize the effect of this control overhead. In this paper, we
introduce a technique that performs latency hiding by learning the patterns
of communication traffic and using that information to anticipate the need
for communication paths. Hence, the network provides the required
communication paths before a request for a path is made. In this study, the
communication patterns (memory accesses) of a parallel program are used as
input to a time delay neural network (TDNN) to perform on-line training and
prediction. These predicted communication patterns are used by the
interconnection network controller that provides routes for the memory
requests.  Based on our experiments, the neural network was able to learn
highly repetitive communication patterns, and was thus able to predict the
allocation of communication paths, resulting in a reduction of
communication latency.

------------------------------------------------------------------------------

http://www.neci.nj.nec.com/homepages/giles.html
ftp://external.nj.nec.com/pub/giles/papers/MPPOI.95.ps.Z

------------------------------------------------------------------------------


--                                 
C. Lee Giles / Computer Sciences / NEC Research Institute / 
4 Independence Way / Princeton, NJ 08540, USA / 609-951-2642 / Fax 2482
http://www.neci.nj.nec.com/homepages/giles.html
==


From mablume at sdcc10.ucsd.edu  Fri Dec  8 17:03:18 1995
From: mablume at sdcc10.ucsd.edu (Matthias Blume)
Date: Fri, 8 Dec 1995 14:03:18 -0800 (PST)
Subject: Fuzzy ART architecture papers online
Message-ID: <199512082203.OAA06153@e3329-4.ucsd.edu>

Dear Connectionists,

Two papers describing a simple and efficient architecture for Fuzzy ART and 
Fuzzy ARTMAP are now available online.  (Sorry, hardcopies are not available.)

------------------------------------------------------------------------------
Matthias Blume and Sadik C. Esener, 
An efficient mapping of Fuzzy ART onto a neural architecture (5 pages), 
submitted to Neural Networks.

A novel mapping of the Fuzzy ART algorithm onto a neural network architecture 
is described.  The architecture does not utilize bi-directional synapses, 
weight transport, or weight duplication, and requires one fewer layer of 
processing elements than the architecture originally proposed by Carpenter, 
Grossberg, & Rosen (1991).  In the new architecture, execution of the 
algorithm takes constant time per input vector regardless of the relationship 
between the input and existing templates, and several control signals are 
eliminated.  This mapping facilitates hardware implementation of Fuzzy ART and 
furthermore serves as a tool for envisioning and understanding the algorithm.

Keywords:  Fuzzy ART, Fuzzy ARTMAP, parallel hardware, neural architecture.

ftp://archive.cis.ohio-state.edu/pub/neuroprose/blume.fam_arch.ps.Z
http://icse1.ucsd.edu/~mablume/nnletter.ps

------------------------------------------------------------------------------
Matthias Blume and Sadik C. Esener, 
Optoelectronic Fuzzy ARTMAP processor, 
Optical Computing, Vol. 10, 1995 OSA Technical Digest Series
(Optical Society of America, Washington, DC, 1995), p. 213-215, March 1995.

The Fuzzy ARTMAP algorithm can perform well even with weights truncated to 4 
bits during training.  Furthermore, only the weights corresponding to one 
processing element are updated after each training sample.  Finally, it 
converges rapidly and relatively uniformly with little dependence on the 
particular choice of adjustable parameter values and initial state.  These 
characteristics are particularly advantageous for parallel optoelectronic 
implementations.  We map Fuzzy ARTMAP onto an architecture which satisfies the 
constraints of the hardware, and suggest an implementation which is an 
appropriate combination of optical and electronic technology.  The proposed 
mapping of the algorithm onto a neural architecture is efficient, requiring 
only an input layer and one processing layer per fuzzy ART module, and 
requiring neither weight transport nor multiple copies of weights.  The 
proposed optoelectronic system is simple, yet versatile, and relies on proven 
components.

Keywords:  Parallel optoelectronic hardware, Fuzzy ART, neural architecture.

ftp://archive.cis.ohio-state.edu/pub/neuroprose/blume.oe_fam.ps.Z
http://icse1.ucsd.edu/~mablume/OSA95.ps

------------------------------------------------------------------------------

- Matthias Blume
  ECE department, UCSD
  matthias at ucsd.edu
  http://icse1.ucsd.edu/~mablume  
  

From mpp at watson.ibm.com  Fri Dec  8 19:27:29 1995
From: mpp at watson.ibm.com (Michael Perrone)
Date: Fri, 8 Dec 1995 19:27:29 -0500 (EST)
Subject: NFL Summary
Message-ID: <9512090027.AA26165@austen.watson.ibm.com>

Hi Everyone,

There has been a lot of confusion regarding the "No Free Lunch" theorems.
Below, I try to summarize what I feel to be the key points.

NFL in a Nutshell:
------------------
   If you make no assumptions about the target function then on average,
   all learning algorithms will have the same generalization performance.

Apparent Contradiction and Resolution:
--------------------------------------
   Contradiction: Lots of theoretical results regarding generalization
   claim to make no assumptions about the target function.
   Resolution: These theoretical results DO make assumption (which may
   or may no be explicit) regarding the target.

Importance of NFL:
------------------
   The NFL results in and of itself is not terribly interesting because
   it's assumption (that we make no assumptions) is NEVER true.

   What makes NFL important is that it emphasizes in a very striking way
   that it is the ASSUMPTIONS that we make about our learning domains
   that MAKE ALL THE DIFFERENCE.

   Therefore, I see NFL *NOT* as a criticism of theoretical generalization
   results; but rather, as a call to examine the assumptions underlying
   these results because it is there that we can potentially learn the
   most about machine learning.

Examples of Unstated Assumptions:
---------------------------------
   In practise, there are numerous assumption that we as a community
   usually make when we attempt to learn a task using out favorite
   algorithm.  Below, I list just a few obvious ones.

   1) The training and testing data are IID.
   2) The data distribution is "smooth" (i.e. "near" data points are
      in general more similar than "far" data points).  This can also be
      interpreted as some differentiability conditions.
   3) NN's approximate real-world functions reasonably well.
   4) Starting with small intial weights is good.
   5) Overfitting is bad - early stopping is good.
   6) Gaussian error models are the best thing since machine sliced bread.

REALLY INTERESTING STUFF:
-------------------------
   I think that the NFL results point towards what I feel are extremely
   interesting research topics:

      Exactly what are the assumptions that certain theoretical results
        require?
      Exactly how do these assumptions affect generalization?
      Which assumptions are necessary/sufficient?
      How do different assumptions compare?
      Can we identify a set of assumptions that are equivalent to the
        assumption that CV model selection improves generalization?
      Can we do the same for early stopping?  Bagging?
        (You can be damn sure I can do this for averaging... :-)
      Etc, etc, ...

Caveat:
-------
   All of the above is conditioned on the assumptions that David Wolpert
   did his math correctly when deriving the NFL theorems...  :-)

I hope all of this helps clear things up.
Comments?

Regards,
Michael
-------------------------------------------------------------------------
   Michael P. Perrone                          914-945-1779 (office)
   IBM - Thomas J. Watson Research Center      914-945-4010 (fax)
   P.O. Box 704 / Rm 36-207                    914-245-9746 (home)
   Yorktown Heights, NY 10598                  mpp at watson.ibm.com
-------------------------------------------------------------------------

From jlm at crab.psy.cmu.edu  Sat Dec  9 17:35:01 1995
From: jlm at crab.psy.cmu.edu (James L. McClelland)
Date: Sat, 9 Dec 95 17:35:01 EST
Subject: TR Announcement
Message-ID: <9512092235.AA21814@crab.psy.cmu.edu.psy.cmu.edu>


The following Technical Report is available both electronically from
our own FTP server or in hard copy form.  Instructions for obtaining 
copies may be found at the end of this post.

========================================================================

     Stochastic Interactive Processing, Channel Separability, and
     Optimal Perceptual Inference: An Examination of Morton's Law

               Javier R. Movellan & James L. McClelland

                    Technical Report PDP.CNS.95.4
                            December 1995

In this paper we examine a regularity found in human perception,
called Morton's law, in which stimulus and context have independent
influences on perception.  This regularity has been used in the past
to argue that perception is a feed-forward, non-interactive process.
Building on earlier work by McClelland ( Cognitive Psychology, 1991)
we illustrate how Morton's law may emerge from stochastic interactions
between simple processing units.  To this end we consider the
properties of interactive diffusion networks, the continuous
stochastic limit of standard artificial neural models.  If, as we
believe, human information processing involves using noisy processing
elements to process potentially noisy inputs, such models may
ultimately serve as foundations for a theory of human information
processing.  We show that Morton's law emerges in recurrent diffusion
networks when the units are organized into separable channels,
feed-forward processing is not a necessary condition for Morton's law
to hold.  Failures to exhibit Morton's law provide evidence that the
information channels are not separable. This result can be used to
analyze cognitive models as well as actual brain structures. Finally,
we illustrate how diffusion networks can be organized to implement
optimal Bayesian perceptual inference.

=======================================================================

Retrieval information for pdp.cns TRs:

unix> ftp 128.2.248.152                 # hydra.psy.cmu.edu
Name: anonymous
Password: <email address>
ftp> cd pub/pdp.cns
ftp> binary
ftp> get pdp.cns.95.4.ps.Z              # gets this tr
ftp> quit
unix> zcat pdp.cns.95.4.ps.Z | lpr      # or however you print postscript

NOTE:  

The compressed file is 255910 bytes long.
Uncompressed, the file is 727359 byes long.

The printed version is 66 total pages long.

For those who do not have FTP access, physical copies can be requested from
Barbara Dorney <bd1q+ at andrew.cmu.edu>.

For a list of available PDP.CNS Technical Reports:

> get README

For the titles and abstracts:

> get ABSTRACTS

From hicks at cs.titech.ac.jp  Sun Dec 10 09:24:29 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Sun, 10 Dec 1995 23:24:29 +0900
Subject: NFL Summary
In-Reply-To: Michael Perrone's message of Fri, 8 Dec 1995 19:27:29 -0500 (EST) <9512090027.AA26165@austen.watson.ibm.com>
Message-ID: <199512101424.XAA13664@euclid.cs.titech.ac.jp>


Micheal Perrone writes:
>   I think that the NFL results point towards what I feel are extremely
>   interesting research topics:
> ...
>      Can we identify a set of assumptions that are equivalent to the
>      assumption that CV model selection improves generalization?

CV is nothing more than the random sampling of prediction ability.  If the
average over the ensemble of samplings of this ability on 2 different models A
and B come out showing that A is better than B, then by definition A is better
than B.  This assumes only that the true domain and the ensemble of all
samplings coincide.  Therefore CV will not, on average, cause a LOSS in
prediction ability.  That is, when it fails, it fails gracefully,
on average.  It cannot be consistently deceptive.

(A quick note:  Sometimes it is advocated that a complexity 
parameter be set by splitting the data set into training and testing,
and using CV.  Then with the complexity parameter fixed the
whole data set can be used to train the other parameters.
Behind this is an ASSUMPTION about the independence of the 
complexity from the other parameters.  Of course it often works
in practice, but it violates the principle in the above paragraph,
so I do not count this as real CV here.)

Two prerequisites exist to obtain a GAIN with CV

	1) The objective function must be "compressible".  I.e., it cannot 
	   be noise.
        2) We must have a model which can recognize the structure in 
           the data.  This structure might be quite hard to see, as in chaotic
	   signals.  

I think NFL says that on average CV will not obtain GAINful results, because
the chance that a randomly selected problem and a randomly selected algorithm
will hit it off is vanishingly small.  (Or even any fixed problem and a
randomly selected algorithm.)

But I think it tells us something more important as well.  It tells us that
not using CV means we are always implicitly trusting our a priori knowledge.
Any reasonable learning algorithm can always predict the training data, or a
"smoothed" version of it.  But because of the NFL theorem, this, over the
ensemble of all algorithms and problems, means nothing.  On average there will
be no improvement in the off training set error.  Fortunately, CV will report
this fact by showing a zero correlation between prediction and true value on
the off training set data. (Of course this is only the performance of CV on
average over the ensemble of off training set datas; CV may be deceptive for a
single off training set data.)  Thus, we shouldn't think we can do away with
CV unless we admit to having great faith in our prior.

Going back to NFL, I think it poses another very interesting problem:
Supposing we have "a foot in the door".  That is, an algorithm which makes
some sense of the data by showing some degree of prediction capability.  Can
we always use this prediction ability to gain better prediction ability?  Is
there some kind of ability to perform something like steepest descent over the
space of algorithms, ONCE we are started on a slope?  Is there a provable 
snowball effect?

I think NFL reminds us that we are already rolling down the hill,
and we shouldn't think otherwise.

Craig Hicks
Tokyo Institute of Technology

From goldfarb at unb.ca  Sun Dec 10 10:52:29 1995
From: goldfarb at unb.ca (Lev Goldfarb)
Date: Sun, 10 Dec 1995 11:52:29 -0400 (AST)
Subject: NFL Summary
In-Reply-To: <9512090027.AA26165@austen.watson.ibm.com>
Message-ID: <Pine.SUN.3.91.951210102350.6128G-100000@jupiter.sun.csd.unb.ca>

On Fri, 8 Dec 1995, Michael Perrone wrote:

> NFL in a Nutshell:
> ------------------
>    If you make no assumptions about the target function 

     [specifically, about the axiomatic structure of the sample space 
      and the inductive generalization, i.e. which ones are the most 
                                               general for the purpose]
                                         

Strangely as it may sound at first, try to inductively learn the subgroup
of some large group with the group structure completely hidden. No
statistics will reveal the underlying group structure. 

Objects in the universe do have structure, especially when they have to 
be represented, as we have learned from the data types in computer science:
TO REPRESENT AN OBJECT IS TO MAKE SOME ASSUMPTIONS ABOUT THE OPERATIONS 
RELATED TO ITS MANIPULATION.

Cheers,
         Lev Goldfarb

From XIAODONG at rivendell.otago.ac.nz  Sun Dec 10 20:46:21 1995
From: XIAODONG at rivendell.otago.ac.nz (Xiaodong Li, Otago University, New Zealand)
Date: Mon, 11 Dec 1995 14:46:21 +1300
Subject: Paper available
 "Connectionist Model Based on an Optical Thin-Film Model"
Message-ID: <01HYONVDU5GYLBVSXM@rivendell.otago.ac.nz>

FTP-host: archive.cis.ohio-state.edu
FTP-filename:/pub/neuroprose/xli.thinfilm.ps.Z

The file xli.thinfilm.ps.Z is now available for ftp from Neuroprose repository.


	Connectionist Learning Using an Optical Thin-Film Model (4 pages)

			Martin Purvis and Xiaodong Li
			Computer and Information Science 
			University of Otago
			Dunedin, New Zealand

ABSTRACT:

An alternative connectionist architecture to the one based on the neuroanatomy 
of biological organisms is described.  The proposed architecture is based on 
an optical thin-film multilayer model, with the thicknesses of thin-film layers
serving as adjustable 'weights' for the computation.  Inputs are encoded into 
the corresponding refractive indices of individual thin-film layers, while the 
outputs are typically measured by the overall reflection coefficients off the 
thin-film layers, at different wavelengths.  The nature of the model and some 
example calculations (a pattern recognition and the classification on the iris 
data set) that exhibit behaviour typical of conventional connectionist 
architectures are described.  This model has also been used in solving the XOR 
and 16 four-bit parity problems, and it has demonstrated comparable performance
to that of a conventional feed-forward neural netwrok model using 
Back-propagation learning. 

This paper is also available at the proceeding of the Second New Zealand 
International Two-Stream Conference on Artificial Neural Nteworks and Expert 
Systems (ANNES'95), IEEE Computer Society Press, Los Almamitos, California, 
1995, pp. 63-66.

Comments are greatly appreciated.


-- Xiaodong Li 
Email: Xiaodong at otago.ac.nz
Http: http://divcom.otago.ac.nz:800/COM/INFOSCI/SECML/xdli/xiao.htm
(Postscript file of this paper is also available here at my hoempage)

From prechelt at ira.uka.de  Mon Dec 11 07:11:32 1995
From: prechelt at ira.uka.de (Lutz Prechelt)
Date: Mon, 11 Dec 1995 13:11:32 +0100
Subject: NN Benchmarking WWW homepage
Message-ID: <"iraun1.ira.487:11.12.95.12.12.22"@ira.uka.de>


The homepage of the very successful NIPS*95 workshop on benchmarking
has now been converted into a repository for information about
benchmarking issues: Status quo, methodology, facilities, and
related info.

I kindly ask everybody who has additional information that should
be on the page (in particular sources or potential sources of
learning data of all kinds) to submit that information to me.
Other comments are also welcome.

The URL is

http://wwwipd.ira.uka.de/~prechelt/NIPS_bench.html

The page is also still reachable over the benchmarking workshop
link on the NIPS*95 homepage.

Below is a textual version of the page.

  Lutz

Lutz Prechelt (http://wwwipd.ira.uka.de/~prechelt/)  | Whenever you 
Institut f. Programmstrukturen und Datenorganisation | complicate things,
Universitaet Karlsruhe;  D-76128 Karlsruhe;  Germany | they get
(Phone: +49/721/608-4068, FAX: +49/721/694092)       | less simple.


===============================================

Benchmarking of learning algorithms

information repository page 


Abstract: Proper benchmarking of (neural network and other)
learning architectures is a prerequisite for orderly progress in
this field. In many published papers deficiencies can be observed
in the benchmarking that is performed.
A workshop about NN benchmarking at NIPS*95 addressed the
status quo of benchmarking, common errors and how to avoid
them, currently existing benchmark collections, and, most
prominently, a new benchmarking facility including a results
database.
This page contains pointers to written versions or slides of most
of the talks given at the workshop plus some related material.
The page is intended to be a repository for such information to
be used as a reference by researchers in the field. Note that most
links lead to Postscript documents. Please send any additions or
corrections you might have to Lutz Prechelt
(prechelt at ira.uka.de). 


Workshop Chairs: 

   Thomas G. Dietterich <tgd at chert.cs.orst.edu>, 
   Geoffrey Hinton <hinton at cs.toronto.edu>, 
   Wolfgang Maass <maass at igi.tu-graz.ac.at>, 
   Lutz Prechelt <prechelt at ira.uka.de> [communicating
   chair] 
   Terry Sejnowski <terry at salk.edu> 


Assessment of the status quo:

 *  Lutz Prechelt. A quantitative study of current
   benchmarking practices.
   A quantitative survey of 400 journal articles of 1993 and
   1994 on NN algorithms. Most articles used far too few
   problems during benchmarking. 
 *  Arthur Flexer. Statistical Evaluation of Neural
   Network Experiments: Minimum Requirements and
   Current Practice. Says that it is insufficient what is
   reported about the benchmarks and how. 

Methodology:

 *  Tom Dietterich. Experimental Methodology
   Benchmarking types, correct statistical testing, synthetic
   versus real-world data, understanding via algorithm
   mutation or data mutation, data generators. 
 *  Lutz Prechelt. Some notes on neural learning
   algorithm benchmarking.
   A few general remarks about volume, validity,
   reproducibility, and comparability of benchmarking;
   DOs and DON'Ts. 
 *  Brian Ripley. What can we learn from the study of
   the design of experiments?
   (Only two slides, though). 
 *  Brian Ripley. Statistical Ideas for Selecting Network
   Architectures.
   (Also somewhat related to benchmarking.) 

Benchmarking facilities:

 *  Previously available NN benchmarking data
   collections
      CMU nnbench, 
      UCI machine learning databases archive, 
      Proben1, 
      StatLog data, 
      ELENA data. 
   Advantages of these: UCI is large and growing and
   popular, Statlog has largest and most orderly collection
   of results available (in a book, though), and Proben1 is
   most easy to use and best supports reproducible
   experiments. Elena and nnbench have no particular
   advantages.
   Disadvantages: UCI and Probem1 have too few and too
   unstructured results available, Proben1 is also inflexible
   and small, Statlog is partially confidential and neither
   data nor results collection are growing. 
 *  Carl Rasmussen and Geoffrey Hinton. DELVE: A
   thoroughly designed benchmark collection
   A proposal of data, terminology, and procedures and a
   facility for the collection of benchmarking results.
   This is the newly proposed standard for benchmarking
   NN (and other) learning algorithms. DELVE is currently
   still under construction at the University of Toronto. 

Other sources of data:

   (Thanks to Nici Schraudolph <schraudo at salk.edu>)
   There is a large amount of game data about the board
   game Go available on the net. One starting point is here.
   Others are the Go game database project, and the Go
   game server. The database holds several hundred
   thousand games of Go and could for instance be used for
   advanced reinforcement learning projects. 


Last correction: 1995/12/11
Please send additions and corrections to Lutz Prechelt,
prechelt at ira.uka.de. 

To NIPS homepage.
To original homepage of this workshop. 

From mpp at watson.ibm.com  Mon Dec 11 08:42:59 1995
From: mpp at watson.ibm.com (Michael Perrone)
Date: Mon, 11 Dec 1995 08:42:59 -0500 (EST)
Subject: compressibility and generalization
In-Reply-To: <199512080049.JAA10560@euclid.cs.titech.ac.jp> from "hicks@cs.titech.ac.jp" at Dec 8, 95 09:49:53 am
Message-ID: <9512111342.AA25646@austen.watson.ibm.com>

[hicks at cs.titech.ac.jp wrote:]
> PSS. What is anti-cross validation?

Suppose we are given a set of functions and a crossvalidation data set.
The CV and Anti-CV algorithms are as follows:

     CV: Choose the function with the best  performance on the CV set.
Anti-CV: Choose the function with the worst performance on the CV set.

(And for this year's NIPS motif: Anti-EM:  Dorothy? Dorothy? :-)

Regards,
Michael
-------------------------------------------------------------------------
   Michael P. Perrone                          914-945-1779 (office)
   IBM - Thomas J. Watson Research Center      914-945-4010 (fax)
   P.O. Box 704 / Rm 36-207                    914-245-9746 (home)
   Yorktown Heights, NY 10598                  mpp at watson.ibm.com
-------------------------------------------------------------------------

From hicks at cs.titech.ac.jp  Mon Dec 11 20:01:05 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Tue, 12 Dec 1995 10:01:05 +0900
Subject: compressibility and generalization
In-Reply-To: "Michael Perrone"'s message of Mon, 11 Dec 1995 08:42:59 -0500 (EST) <9512111342.AA25646@austen.watson.ibm.com>
Message-ID: <199512120101.KAA16136@euclid.cs.titech.ac.jp>


"Michael Perrone" <mpp at watson.ibm.com> wrote:
>[hicks at cs.titech.ac.jp wrote:]
>> PSS. What is anti-cross validation?
>Suppose we are given a set of functions and a crossvalidation data set.
>The CV and Anti-CV algorithms are as follows:
>     CV: Choose the function with the best  performance on the CV set.
>Anti-CV: Choose the function with the worst performance on the CV set.

case 1: 
*	Either the target function is (noise/uncompressible/has no structure),
or none of the candidate functions have any correlation with the target
function.*
	In this case both Anti-CV and CV provide (ON AVERAGE) equal
improvement in prediction ability: none.  For that matter so will ANY method
of selection.
	Moreover, if we plot a graph of the number of data used for training
vs. the estimated error (using the residual data), we will (ON AVERAGE) see no
decrease in estimated error.  Since CV provides an estimated prediction error,
it can also tell us "you might as well be using anti-cross validation, or
random selection for that matter, because it will be equally useless".

case 2: 
*	The target (is compressible/has structure), and some the candidate
functions are positively correlated with the target function.*
	In this case CV will outperform anti-CV (ON AVERAGE).


By ON AVERAGE I mean the expectation across the ensemble of samples for
a FIXED target function.  This is different from the ensemble and distribution
of target functions, which is a much bigger question.  We known much already
about about the ensemble of samples from a fixed target function.  I am not
avoiding the issue of the ensemble or distribution of target functions, but
merely showing that we have 2 general cases, and that in both of them CV is
never WORSE than anti-CV.  It follows that whatever the distribution of
targets is, CV is never worse (ON AVERAGE) than anti-CV.

I don't believe this contradicts NFL in any way.  It just clarifies the
role that CV can play.  

Learning and monitoring prediction error go hand in hand.
This is even more true for cases when the underlying function 
may be changing and the data has the form of an infinite stream.


Craig Hicks
Tokyo Institute of Technology


From GIOIELLO at cres.it  Mon Dec 11 19:13:43 1995
From: GIOIELLO at cres.it (GIOIELLO)
Date: Tue, 12 Dec 1995 01:13:43 +0100
Subject: A neural net based OCR demo for both Windows/DOS and Mac OS is
 available
Message-ID: <01HYP9T0BSPU934ROD@cres.it>

Dear Netters,

	An OCR demo for Mac OS is available at the following URL:

	 ftp://ftpcsai.diepa.unipa.it/pub/demos/OCR-demo.cpt.hqx

	A Windows and DOS version is also available at the following URL:

         ftp://ftpcsai.diepa.unipa.it/pub/demos/OCR-Win.zip

this latter version offers a more rich set of capabilities too. The OCR
is based on a three-layer MLP. The conjugate gradient descent techniques
were used while training the net. Training and test set were those of NIST.

	The related papers will be found at the following URL:

		ftp://ftpcsai.diepa.unipa.it/pub/papers/handwritten

	Several VLSI architectures to implement the OCR device using
a digital implementation of the proposed MLP are also described in the
papers.

	An overwiev of the activities we carry on can be found at the following
URL:

 http://wwwcsai.diepa.unipa.it/research/projects/vlsinn/handcare/handcare.html

					Best Regards,
			
					Giuseppe A. M. Gioiello

E-Mail:   gioiello at diepa.unipa.it

   URL:   http://wwwcsai.diepa.unipa.it/people/doctors/gioiello/gioiello.html

From ernst at kuk.klab.caltech.edu  Tue Dec 12 12:02:22 1995
From: ernst at kuk.klab.caltech.edu (Ernst Niebur)
Date: 12 Dec 1995 17:02:22 GMT
Subject: Training opportunities in Computational Neuroscience at Johns Hopkins University
Message-ID: <ERNST.95Dec12090222@kuk.klab.caltech.edu>

 
The Zanvyl Krieger Mind/Brain Institute at Johns Hopkins University is
an interdisciplinary research center devoted to the investigation of
the neural mechanisms of mental function and particularly to the
mechanisms of perception: How is complex information represented and
processed in the brain, how is it stored and retrieved, and which
brain centers are critical for these operations?

The Institute intends to significantly enhance its research program in
Computational Neuroscience and encourages students with interest in
this domain to apply for the graduate program in the Neuroscience
department. Research opportunities exist in all of the laboratories of
the Institute. Interdisciplinary projects, involving the student in
more than one laboratory, are particularly encouraged.

At present, MBI faculty include (listed with primary field of interest
and methodology used):


C. Ed Connor, PhD: Visual selective attention (electrophysiology in
the awake behaving monkey).

Stewart Hendry, PhD: Organization and plasticity of mammalian cerebral
cortex (primate neuroanatomy).

Steve S. Hsiao, PhD: Neurophysiology of tactile perception
(electrophysiology in the awake behaving monkey).

Kenneth O. Johnson, PhD: Neurophysiology of the somatosensory system
(electrophysiology in the awake behaving monkey).

Guy McKhann, MD (Director of MBI): Cognitive and neurologic outcomes
after cardiac surgery; immunologic attack on peripheral motor axonal
membranes in the human and experimental animal (neurology).

Ernst Niebur, PhD: Theoretical Neuroscience (computational and
mathematical modeling).

Gian F Poggio, PhD: Analysis of Stereopsis and Texture
(electrophysiology in the awake behaving monkey).

Michael A. Steinmetz, PhD: Neurophysiological mechanisms in
visual-spatial perception (electrophysiology in the awake behaving
monkey).

Ruediger von der Heydt, PhD: Neural mechanisms of visual perception
(electrophysiology in the awake behaving monkey).


Additional research opportunities exist in collaborative work with
faculty in the Psychology Department (located next door to the
Mind/Brain Institute), in particular with Drs. Howard Egeth
(attention, perception, cognition), Michael Rudd (computational
vision, psychophysics), Trisha Van Zandt (mathematical modelling,
neural networks and memory), and Steven Yantis (visual perception,
attention, mathematical modeling).

All students accepted to the PhD program of the Neuroscience
department receive full tuition remission plus a stipend at or above
the National Institutes of Health predoctoral level. The Mind/Brain
Institute is located on the very attractive Homewood campus in
Northern Baltimore.

Applicants should have a B.S. or B.A. with a major in any of the
biological or physical sciences. Applicants are required to take the
Graduate Record Examination (GRE), both the aptitude tests and an
advanced test, or the Medical College Admission Test. Further
information on the admission procedure can be obtained from the
Department of Neuroscience:

Director of Graduate Studies
Neuroscience Training Program
Department of Neuroscience
The Johns Hopkins University School of Medicine
725 Wolfe Street 
Baltimore, MD 21205

Completed applications (including three letters of recommendation and
either GRE scores or Medical College Admission Test scores) must be
_received_ by January 1, 1996 at the above address. Candidates for
whom this is impossible, or those who need additional information,
should immediately contact

Prof. Ernst Niebur 
The Zanvyl Krieger Mind/Brain Institute
Johns Hopkins University 
3400 N. Charles Street 
Baltimore, MD 21218
niebur at jhu.edu
--
Ernst Niebur					Krieger Mind/Brain Institute
Asst. Prof. of Neuroscience			Johns Hopkins University
niebur at jhu.edu					3400 N. Charles Street
(410)516-8643, -8640 (secr), -8648 (fax)	Baltimore, MD 21218

From dhw at santafe.edu  Tue Dec 12 17:25:06 1995
From: dhw at santafe.edu (David Wolpert)
Date: Tue, 12 Dec 95 15:25:06 MST
Subject: The last of a dying thread
Message-ID: <9512122225.AA00709@sfi.santafe.edu>


Some comments on the NFL thread.


Huaiyu Zhu writes

>>>
2. The *mere existence* of structure guarantees a (not uniformly-random)
algorithm as likely to lose you a million as to win you a million, 
even in the long run.  It is the *right kind* of structure that makes 
a good algorithm good.
>>>

This is a crucial point. It also seems to be one lost on many of the
contributors to this thread, even those subsequent to Zhu's
posting. Please note in particular that the knowledge that "the
universe is highly compressible" can NOT, by itself, be used to
circumvent NFL.

I can only plead again: Those who are interested in this issue should
look at the papers directly, so they have at least passing familiarity
with the subject before disussing it. :-)

ftp.santafe.edu, pub/dhw_ftp, nfl.1.ps.Z and nfl.2.ps.Z.


Craig Hicks then writes:

>>>
However, I interpret the assertion that anti-cross validation can be expected
to work as well as cross-validation to mean that we can equally well expect
cross-validation to lie.  That is, if cross-validation is telling us that the
generalization error is decreasing, we can expect, on average, that the true
generalization error is not decreasing.

Isn't this a contradiction, if we assume that the samples are really randomly
chosen?  Of course, we can a posteriori always choose a worst case function
which fits the samples taken so far, but contradicts the learned model
elsewhere.  But if we turn things around and randomly sample that deceptive
function anew, the learned model will probably be different, and
cross-validation will behave as it should.
>>>

That's part of the power of the NFL theorems - they prove that Hicks'
intuition, an intuition many people share, is in fact wrong.


>>>
I think this follows from the principle that the empirical distribution over
an ever larger number of samples converges to the the true distribution of a
single sample (assuming the true distribution is stationary).
>>>

Nope. The central limit theorem is not directly germane. See all the
previous discussion on NFL and Vapnik.


>>>>
CV is nothing more than the random sampling of prediction ability.  If the
average over the ensemble of samplings of this ability on 2 different models A
and B come out showing that A is better than B, then by definition A is better
than B.  This assumes only that the true domain and the ensemble of all
samplings coincide.  Therefore CV will not, on average, cause a LOSS in
prediction ability.  That is, when it fails, it fails gracefully,
on average.  It cannot be consistently deceptive.
	Fortunately, CV will report this (failure to generalize) by
showing a zero correlation between prediction and true value on the
off training set data. (Of course this is only the performance of CV
on average over the ensemble of off training set datas; CV may be
deceptive for a single off training set data.)
>>>

This is wrong (or at best misleading). Please read the NFL papers. In
fact, if the head-to-head minimax hypothesis concerning xvalidation
presented in those papers is correct, xvalidation is wrong more often
than it is right. In which case CV is "deceptive" more often (!!!)
than not.


Lev Goldfarb wrote

>>>
Strangely as it may sound at first, try to inductively learn the subgroup
of some large group with the group structure completely hidden. No
statistics will reveal the underlying group structure. 
>>>

It may help if people read some of the many papers (Cox, deFinnetti,
Erickson and Smith, etc., etc.) that prove that the only consistent
way of dealing with uncertainty is via probability theory. In other
words, there is nothing *but* statistics, in the real world. (Perhaps
occuring in prior knowledge that you're looking for a group, but
statistics nonetheless.)


David Wolpert

From lemm at LORENTZ.UNI-MUENSTER.DE  Wed Dec 13 09:46:52 1995
From: lemm at LORENTZ.UNI-MUENSTER.DE (Joerg_Lemm)
Date: Wed, 13 Dec 1995 15:46:52 +0100
Subject: NFL and practice
Message-ID: <9512131446.AA13879@xtp141.uni-muenster.de>

Some remarks to Craig Hicks arguments on crossvalidation and NFL in general
from my point of view:

One may discuss NFL for theoretical reasons, but
the conditions under which NFL-Theorems hold
are not those which are normally met in practice.

1.) In short, NFL assumes that data, i.e. information of the form y_i=f(x_i),
do not contain information about function values on a non-overlapping test set.
This is done by postulating "unrestricted uniform" priors, 
or uniform hyperpriors over nonumiform priors... (with respect to Craig's 
two cases this average would include a third case: target and model are 
anticorrelated so anticrossvalidation works better) and "vertical" likelihoods.
So, in a NFL setting data never say something about function values 
for new arguments.
This seems rather trivial under this assumption and one has to ask
how natural is such a NFL situation.

2.) Information of the form y_i=f(x_i) is rather special and not what
we normally have. There is much information which is not of this 
"single sharp data" type. (Examples see below.)

There is absolutly no reason why information which depends on more than
one f(x_i) should not be incorporated. (This can be done using nonuniform 
priors or in a way more symmetrical to "sharp data".)
NFL just describes the situation in which we don't have
any such information but much of the (then quite useless)
"sharp data". But these sharp data are not less (maybe more) obscure
as other forms of information.

Information which is not of this "single sharp data" form but includes
many or all f(x_i) to produce one answer normally induces correlations 
between target and generalizer if included into the generalizer. 
At the same time there is no real off training set anymore!

Examples:

3) Informations like symmetries (even if only approximate), maxima,
Fouriercomponents (and much, much more ...) involve more than one f(x_i).
Fouriercomponents, for example, can be seen as sharp data but for different 
basisvectors, i.e. asking for momentum instead of location.
This shows again, that the definition of "sharp data" corresponds to choosing 
a "basis of questions" and is no natural entity!!!


4) Real measurements (especially of continuous variables)
normally do also NOT have the form y_i=f(x_i) !
They mostly perform some averaging over f(x_i) or
at least they have some noise on the x_i (as small as you like, but present).
In the latter case of "sharp" noise posing the same question several times 
gives you also an average of several (nearby) y 
with different x_i of the underlying true function.
In both cases the averaging is equivalent to regularization
for the "effective" function which we can observe!!!
This shows that smoothness of the expectation (in contrast to uniform priors) 
is the result of the measurement process and therefore
is a real phenomena for "effective" functions.
There is no need to see it just as a subjective prior!
(The same could be said on a quantummechanical level, but that's another story.)
It follows that NFL results do NOT hold 
for the "effective" functions in such situations,
even if assuming NFL for the underlying true functions. 

5.) NFL again:
Averaging or noise in the input space of the x_i requires a
probability distribution in that space
which can be defined independently from a specific function.
Noise means that x_i is a random variable dependend from 
an actual question z_i, i.e. p(actual argument = x_i | question=z_i)
and it is f(z_i) which we can observe.

If you don't accept a given p(x_i|z_i), I am sure you can average over 
"all possible" of such relations with unrestricted "uniform" priors to 
find that it is impossible to obtain any information about any function 
without assuming a priori that you know something about what you are asking.
This could be seen as another NFL-Theorem for questions: You do not even get 
informations about a single function value if you don't know (assume,define) 
a priori what you are asking!

6.) With respect to the underlying "true" function
off-training set error itself, an important concept for NFL, is in general 
no longer a measurable quantity if input noise or averaging is present!! 
(For simplicity let's assume that noise or averaging includes all 
questions x_i. Then in the case of noise you only have a probability 
for the x_i to belong to the "true" training set 
and averaging includes all questions x_i.)
So for the "true" functions there remains nothing NFL can say something about
and for the "effective" functions NFL is not valid!

To conclude: 

In many interesting cases "effective" function values contain information 
about other function values and NFL does not hold!

The very special handling of "sharp data" in comparison to other 
information must be discussed for much more learning theories.

Joerg Lemm 
(Institute for Theoretical Physics I, University of Muenster, Germany)


From wray at ptolemy-ethernet.arc.nasa.gov  Wed Dec 13 17:06:42 1995
From: wray at ptolemy-ethernet.arc.nasa.gov (Wray Buntine)
Date: Wed, 13 Dec 95 14:06:42 PST
Subject: one revised paper and NIPS slides by Buntine
Message-ID: <9512132206.AA08307@ptolemy.arc.nasa.gov>


Dear Connectionists

Please note the following two WWW resources.   One, a forthcoming journal
paper, and the other, slides from a NIPS'95 Workshop presentation.
Also, please note my new address, email, and company.  I am no longer at
Heuristicrats.

Wray Buntine                                   
Thinkbank, Inc.                                +1 (510) 540-6080 [voice]
1678 Shattuck Avenue, Suite 320                +1 (510) 540-6627   [fax]
Berkeley, CA 94709                                    wray at Thinkbank.COM


============  Article

URL:	http://www.thinkbank.com/wray/graphbib.ps.Z
        (about 240Kb compressed)

TITLE:   A guide to the literature on learning probabilistic
         networks from data
AUTHOR:         Wray Buntine, Thinkbank
JOURNAL:    Accepted for IEEE Trans. on Knowledge and Data Eng.,
	Final draft submitted.

ABSTRACT: This literature review discusses different methods under the
general rubric of learning Bayesian networks from data, and includes some
overlapping work on more general probabilistic networks.  Connections are
drawn between the statistical, neural network, and uncertainty communities,
and between the different methodological communities, such as Bayesian,
description length, and classical statistics.  Basic concepts for learning
and Bayesian networks are introduced and methods are then reviewed.  Methods
are discussed for learning parameters of a probabilistic network, for
learning the structure, and for learning hidden variables.  The presentation
avoids formal definitions and theorems, as these are plentiful in the
literature, and instead illustrates key concepts with simplified examples.

KEYWORDS:  Bayesian networks, graphical models, hidden variables,
learning, learning structure, probabilistic networks, knowledge discovery

===========  Talk

URL:    http://www.thinkbank.com/wray/refs.html
   	(and look under Talks for NIPS) 

TITLE:  Compiling Probabilistic Networks and Some Questions this Poses.
AUTHOR:  Wray Buntine 
WORKSHOP:    NIPS'95 Workshop on Learning Graphical Models

ABSTRACT:
Probabilistic networks (or similar) provide a high-level language that 
can be used as the input to a compiler for generating a learning or 
inference algorithm.  Example compilers are BUGS (inputs a Bayes 
net with plates) by Gilks, Spiegelhalter, et al., and MultiClass (inputs 
a dataflow graph) by Roy.  This talk will cover three parts:  (1) an 
outline of the arguments for such compilers for probabilistic 
networks, (2) an introduction to some compilation techniques, and 
(3) the presentation of some theoretical challenges that compilation 
poses.

High-level language compilers are usually justified as a rapid 
prototyping tool.  In learning, rapid prototyping arises for the 
following reasons:  good priors for complex networks are not obvious 
and experimentation can be required to understand them;  several 
algorithms may suggest themselves and experimentation is required 
for comparative evaluation.  These and other justifications will be 
described in the context of some current research on learning 
probabilistic networks, and past research on learning classification 
trees and feed-forward neural networks.  Techniques for compilation 
include the data flow graph, automatic differentiation, Monte Carlo 
Markov Chain samplers of various kinds, and the generation of C 
code for certain exact inference tasks.  With this background, I will 
then pose a number of research questions to the audience. 

===========

From bernabe at cnm.us.es  Tue Dec 12 07:39:41 1995
From: bernabe at cnm.us.es (Bernabe Linares B.)
Date: Tue, 12 Dec 95 13:39:41 +0100
Subject: two papers in neuroprose
Message-ID: <9512121239.AA17985@cnm1.cnm.us.es>

FTP-host:  archive.cis.ohio-state.edu
FTP-file:  pub/neuroprose/bernabe.art1-nn.ps.Z (30 pages, 257846 bytes)
           pub/neuroprose/bernabe.art1-vlsi.ps.Z (26 pages, 311686 bytes)

The files "bernabe.art1-nn.ps.Z" and "bernabe.art1-vlsi.ps.Z" are now
available for copying from the Neuroprose repository. They contain two
papers which have been accepted for publication in the following journals:

PAPER1:  Journal: IEEE Transactions on VLSI Systems
         Title: "A Real-Time Clustering Microchip Neural Engine"
         File: bernabe.art1-vlsi.ps.Z

PAPER2:  Journal: Neural Networks
         Title: "A Modified ART1 Algorithm more suitable for VLSI
                 Implementations"
         File: bernabe.art1-nn.ps.Z

Authors: Teresa Serrano-Gotarredona and Bernabe Linares-Barranco
Filiation: National Microelectronics Center (CNM), Sevilla, SPAIN.

Sorry, no hardcopies available.


Brief description of papers follows:

--------------------------------------------------------------------
PAPER1:
-------

File: bernabe.art1-vlsi.ps.Z, 26 pages, 311686 bytes.

Title: "A Real-Time Clustering Microchip Neural Engine"

                            Abstract
This paper presents an analog current-mode VLSI implementation of an
unsupervised clustering algorithm. The clustering algorithm is based on the
popular ART1 algorithm [1], but has been modified resulting in a more
VLSI-friendly algorithm [2], [3] that allows a more efficient hardware 
implementation with simple circuit operators, little memory requirements,
modular chip assembly capability, and higher speed figures. The chip described
in this paper implements a network that can cluster 100 binary pixels input
patterns into up to 18 different categories. Modular expansibility of the 
system is directly possible by assembling an NxM array of chips without any
extra interfacing circuitry, so that the maximum number of clusters is 18xM
and the maximum number of bits of the input pattern is Nx100. Pattern
classification and learning is performed in 1.8us, which is an equivalent
computing power of 4.4x10^9 connections per second plus connection-updates
per second. The chip has been fabricated in a standard low cost 1.6um
double-metal single-poly CMOS process, has a die area of 1cm^2, and is mounted
in a 120-pin PGA package. Although internally the chip is analog in nature,
it interfaces to the outside world through digital signals, and thus has a true
asynchronous digital behavior. Experimental chip test results are available,
obtained through digital chip test equipment. Fault tolerance at the system
level operation is demonstrated through the experimental testing of faulty
chips.

--------------------------------------------------------------------
PAPER2:
-------

File: bernabe.art1-nn.ps.Z, 30 pages, 257846 bytes.

Title: "A Modified ART1 Algorithm more suitable for VLSI Implementations"

                          Abstract
This paper presents a modification to the original ART1 algorithm
[Carpenter, 1987a] that is conceptually similar, can be implemented in hardware
with less sophisticated building blocks, and maintains the computational
capabilities of the originally proposed algorithm. This modified ART1 
algorithm (which we will call here ART1m) is the result of hardware motivated
simplifications investigated during the design of an actual ART1 chip
[Serrano, 1994, 1996]. The purpose of this paper is simply to justify
theoretically that the modified algorithm preserves the computational
properties of the original one and to study the difference in behavior
between the two approaches.

--------------------------------------------------------------------
ftp instructions are:

% ftp archive.cis.ohio-state.edu
Name : anonymous
Password: <your e-mail address>
ftp> cd pub/neuroprose
ftp> binary
ftp> get bernabe.art1-nn.ps.Z  
ftp> get bernabe.art1-vlsi.ps.Z
ftp> quit
% uncompress bernabe.art1-nn.ps.Z
% uncompress bernabe.art1-vlsi.ps.Z
% lpr -P<your_printer> bernabe.art1-nn.ps
% lpr -P<your_printer> bernabe.art1-vlsi.ps

These files are also available from the node "ftp.cnm.us.es", user
"anonymous", directory /pub/bernabe/publications,
files: "NN_art1theory_96.ps.Z" and "TVLSI_art1chip_96.ps.Z".

Any feedback will be appreciated. Thanks,

Dr. Bernabe Linares-Barranco
National Microelectronics Center (CNM)
Dept. of Analog Design
Ed. CICA, Av. Reina Mercedes s/n, 
41012 Sevilla, SPAIN.
Phone: 34-5-4239923, Fax: 34-5-4624506, 
E-mail: bernabe at cnm.us.es


From bishopc at helios.aston.ac.uk  Wed Dec 13 14:52:48 1995
From: bishopc at helios.aston.ac.uk (Prof. Chris Bishop)
Date: Wed, 13 Dec 1995 19:52:48 +0000
Subject: New Book: Neural Networks for Pattern Recognition
Message-ID: <1400.9512131952@sun.aston.ac.uk>


--------------------------------------------------------------------
NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK
--------------------------------------------------------------------


              "Neural Networks for Pattern Recognition"
              -----------------------------------------

                       Christopher M. Bishop

                     (Oxford University Press)


        Full details at:  http://neural-server.aston.ac.uk/NNPR/


This book provides the first comprehensive treatment of neural
networks from the perspective of statistical pattern recognition.

 * 504 pages
 * 160 figures
 * 129 graded exercises
 * a self-contained introduction to statistical pattern recogniton
 * an extensive treatment of Bayesian methods
 * paperback and hardback editions
 * 300 references


Contents:
---------

1.  Statistical Pattern Recognition
2.  Probability Density Estimation
3.  Single-layer Networks
4.  The Multi-layer Perceptron
5.  Radial Basis Functions
6.  Error Functions
7.  Parameter Optimization Algorithms
8.  Pre-processing and Feature Extraction
9.  Learning and Generalization
10. Bayesian Techniques

                              *****

   Instructors wishing to use this text as the basis for a course may 
   request a complimentary examination copy from the publishers. 
   (USA: fax request to 212-726-6442 with brief description of the course)

                              *****

Ordering information:
---------------------

ISBN 
0-19-853864-2 paperback
0-19-853849-9 hardback

USA:   45 dollars paperback
----   98 dollars hardback
       Credit card orders:
       Tel: 1-800-451-7556 (toll free)

  By post, send payment to:
       Order Dept.
       Oxford University Press
       2001 Evans Road
       Cary, NC 27513
       USA
       (3 dollars shipping for first copy, 1 dollar each thereafter)

Canada: Tel: 1-800-387-8020 (toll free)
-------

UK:    25 pounds paperback
---    55 pounds hardback
       Tel: 01536 454 534 (from the UK)
       Tel: +44 1536 454 534 (from abroad)

  By post, send payment to: 
       CWO Department
       Oxford University Press
       Saxon Way West, Corby
       Northants NN18 9ES, UK
       (3.53 pounds postage)

  By fax:
     01536 746 337 (from the UK)
     +44 1536 746 337 (from abroad)

---------------------------------------------------------------------- 

  Prof. Christopher M. Bishop        Tel. +44 (0)121 333 4631
  Neural Computing Research Group    Fax. +44 (0)121 333 4586
  Dept. of Computer Science          c.m.bishop at aston.ac.uk
    & Applied Mathematics            http://neural-server.aston.ac.uk/
  Aston University                   
  Birmingham B4 7ET, UK

----------------------------------------------------------------------


From zhuh at helios.aston.ac.uk  Thu Dec 14 13:12:43 1995
From: zhuh at helios.aston.ac.uk (zhuh)
Date: Thu, 14 Dec 1995 18:12:43 +0000
Subject: No free lunch for Cross Validation!
Message-ID: <2237.9512141812@sun.aston.ac.uk>

Dear Colleagues,

A little while ago someone claimed that 
    Cross validation will benefit from the presence of any structure,
    and if there is no structure it does no harm; 
yet
    NFL explicitly states that a structure can be equally good or
    bad for any given method, depending on how they match each other;
yet
    It was further claimed that they do not conflict with each other.
 
I was quite curious and did the following five-minute experiment to
find out which is correct.
 
Suppose we have a Gaussian variable x, with mean mu and unit variance.
We have the following three estimators for estimating mu from a
sample of size n.
  A: The sample mean.  It is optimal both in the sense of Maximum
Likelihood and Least Mean Squares.
  B: The maximum of sample.  It is a bad estimator in any reasonable sense.
  C: Cross validation to choose between A and B, with one extra data point.
 
The numerical result with n=16 and averaged over 10000 samples, gives
mean squared error:
        A: 0.0627    B: 3.4418    C: 0.5646
This clearly shows that cross validation IS harmful in this case,
despite the fact it is based on a larger sample.  NFL still wins!
 
Many of you might jump on me at this point: But this is a very
artificial example, which is not what normally occurs in practice.
To this I have two answers, short and long.
 
The short answer is from principle.  Any counter-example, however 
artificial it is, clearly demolishes the hope that cross validation
is a "universally beneficial method".
 
The longer answer is divided in several parts, which hopefully will
answer any potential criticism from any aspect:
 
1. The cross validation is performed on extra data points.  We are not
requiring it to perform as good as the mean on 17 data points.  If it 
cannot extract more information from the one extra data point, a minimum
requirement is that it keeps the information in the original 16 points. 
But it can't even do this.
 
2. The maximum of a sample is the 100 percentile. The median is the 50
percentile, which is in fact a quite reasonable estimator.  Let us use
a larger cross validation set (of size k), and replace B with a
different percentile.  The result is that, for the median, CV needs k>2
to work. For 70 percentile CV needs k>16.  The required k increases 
dramatically with the percentile.
 
3. It is not true that we have set up a case in which cross validation 
can't win.  There is indeed a small probability that a sample can be so 
bad that the sample maximum is even a better estimate than the sample 
mean.  However to utilise such rare chances to good effect k must be at 
least several hundred (maybe exponential) while n=16.  We know such k 
exists since k=infinity certainly helps.  Yet to adopt such a method 
is clearly absurd.
 
4. Although we have chosen estimator A to be the known optimal
estimator in this case, it can be replaced by something else. For
example, both A and B can be some reasonable averages over
percentiles, so that without detailed analysis it may appear doing
cross validation might give a C which is better than both A and B.  
Such believes can be defeated by similar counter-examples.
 
5. The above scheme of cross validation may appear different from what
is familiar, but here is a "practical example" which shows that it is
indeed what people normally do.  Suppose we have a random variable
which is either Gaussian or Cauchy.  Consider the following three
estimators:
    A: Sample mean: It has 100% efficiency for Gaussian, and 0%
efficiency for Cauchy.
    B: Sample median: It is 2/pi=63.66% efficient for Gaussian and
8/pi^2=81.06% efficient for Cauchy.
    C: Cross validation on an additional sample of size k, to choose
between A and B.
Intuitively it appears quite reasonable to expect cross validation to
pick out the correct one, for most of the time, so that, if averaged
over all samples, C ought to be superior to both A and B.  But no!!
This will depend on the PRIOR mixing probability of these two sub-models.  
If the variable is in fact always Gaussian, then we have just seen that 
if n=16, CV will be worse unless k>2.  The same is even more true in 
the reversed order, since the mean is an essentially useless estimator 
for Cauchy. 

6. In any of the above cases, "anti cross validation" would be even
more disastrous.

If you are not convinced by these arguments, or if you want to know 
more about efficiency, then maybe the following reference can help:
Fisher, R.A.: Theory of statistical estimation, Proc. Camb. Phil. Soc.,
Vol. 122, pp. 700-725, 1925.
 
If you are more or less convinced, I have the following speculation:
 
Several centuries ago, the French Academy of Science (or is it the
Royal Society?) made a decision that they would no longer examine 
inventions of "perpetual motion machines", on the ground that the Law
of Energy Conservation was so reliable that it would defeat any such
attempt.  History proved that this was a wise decision, which assisted
the effort of designing machines which utilise energy in fuel.
 
Should we expect the same fate for "the universally beneficial
methods" in the face of NFL?  Should we put more effort in designing
methods which use prior information? 

   posterior information <= prior information + data information.

--
Huaiyu Zhu, PhD                   email: H.Zhu at aston.ac.uk
Neural Computing Research Group   http://neural-server.aston.ac.uk/People/zhuh
Dept of Computer Science          ftp://cs.aston.ac.uk/neural/zhuh
    and Applied Mathematics       tel: +44 121 359 3611 x 5427
Aston University,                 fax: +44 121 333 6215
Birmingham B4 7ET, UK              


From C.Campbell at bristol.ac.uk  Thu Dec 14 11:21:26 1995
From: C.Campbell at bristol.ac.uk (I C G Campbell)
Date: Thu, 14 Dec 1995 16:21:26 +0000 (GMT)
Subject: New Web Page (Bristol University, UK)
Message-ID: <199512141621.QAA11250@zeus.bris.ac.uk>


The Neural Computing Research Group at Bristol University, UK
has recently set up a WWW page describing their interests at:

http://www.fen.bris.ac.uk/engmaths/research/neural/neural.html

Our interests cover three main areas: theory of neural 
computation, modelling simple neurobiological systems and
applications of neural computing in engineering. Collectively
we have produced in excess of 100 publications related to
neural computing in these topic areas. Further details
about these publications, current research interests and
research grants may be found on the above page.

Merry Xmas

Colin Campbell
University of Bristol


From robert at fit.qut.edu.au  Thu Dec 14 19:24:04 1995
From: robert at fit.qut.edu.au (Robert Andrews)
Date: Fri, 15 Dec 1995 10:24:04 +1000
Subject: Rule Extraction Mailing List
Message-ID: <199512150024.KAA15975@ocean.fit.qut.edu.au>

=-=-=-=-= RULE EXTRACTION FROM ARTIFICIAL NEURAL NETWORKS =-=-=-=-=-=-=-=-

                ANNOUNCEMENT OF MAILING LIST


Rule Extraction from Artificial Neural Networks and the related field of
Rule Refinement are topics of increasing interest and importance. This is to
announce the formation of a moderated mailing list for researchers and
students interested in these areas.

If you are interested in becoming a subscriber to this list please send the
following information by return mail:

                    Name:
Organisation/Institution:
          E-mail Address:


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Mr Robert Andrews                          
School of Information Systems            robert at fit.qut.edu.au
Faculty of Information Technology        R.Andrews at qut.edu.au
Queensland University of Technology      +61 7 864 1656 (voice)
GPO Box 2434                  _--_|\     +61 7 864 1969 (fax)
Brisbane  Q 4001            /      QUT
Australia                   \_.--._/     http://www.fit.qut.edu.au/staff/~robert
                                  v
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


From l.s.smith at cs.stir.ac.uk  Fri Dec 15 05:12:09 1995
From: l.s.smith at cs.stir.ac.uk (Dr L S Smith (Staff))
Date: Fri, 15 Dec 1995 10:12:09 GMT
Subject: TR on generalization available
Message-ID: <19951215T101209Z.KAA27913@katrine.cs.stir.ac.uk>

Dear all:

We have a new TR available by ftp from here:

CCCN Technical report CCCN-21, December 1995.

A Theoretical Study of the Generalization Ability of Feed-Forward Neural  
Networks.

M J Roberts.

By making assumptions on the probability distribution of the potentials  
in a
feed-forward neural network we have derived lower bounds for the  
generalization ability of the network in terms of the number of training  
patterns.  The
results are consistent with simulations carried out on a simple  
geometrical function.

The URL is ftp://ftp.cs.stir.ac.uk/pub/tr/cccn/TR21.ps.Z

If you really can't access this hard copies are available, but only as a  
last resort.

Dr Leslie S. Smith
Dept of Computing and Mathematics, Univ of Stirling
Stirling FK9 4LA Scotland

lss at cs.stir.ac.uk   (NeXTmail welcome)
Tel (44) 1786 467435 Fax (44) 1786 464551
www http://www.cs.stir.ac.uk/~lss/


From bastiane at irit.fr  Fri Dec 15 09:07:57 1995
From: bastiane at irit.fr (bastiane@irit.fr)
Date: Fri, 15 Dec 1995 15:07:57 +0100
Subject: Call for papers for DYNN'96
Message-ID: <199512151407.PAA05193@irit.irit.fr>


                        CALL FOR PAPERS FOR DYNN'96


                        International workshop on

            NEURAL NETWORKS DYNAMICS AND PATTERN RECOGNITION.

                           Toulouse - France    

                        12 and 13 of March 1996


Organized by ONERA-CERT

Sponsored by DRET of French MOD, US Air Force Scientific Research and Pole
Universitaire Europeen de Toulouse.

Organizers:  Manuel SAMUELIDES (ONERA-CERT), Bernard DOYON (INSERM),
Gregory TARR (US AF), Simon THORPE (CNRS).

Practical Information: Emmanuel DAUCE (dauce at cert.fr)
                       ***********************

OBJECTIVES OF THE WORKSHOP.
***************************

This workshop is designed to allow information exchange and discussion
between theoretical scientists working on models of neuronal dynamics and
engineersnners who are looking for efficient devices to process sensor
information.

Continuous activation state units as well as Integrate and Fire neurons 
or oscillators are elementary components of Dynamical Neural Networks.
Attractor neural networks as well as transitory data-driven dynamics will
be considered. The common features between these models is the conversion
of spatial information into spatio-temporal data flow which allows specific
processing.

Mathematical models involved use dynamical systems and stochastic processes.
They will be compared to the results of numerical simulations and the latest
neuro-physiological data concerning the dynamics of biological neural nets.

The main aim of the workshop is to encourage significant advances concerning
the dynamics of biologically plausible neural networks and their applications
to pattern recognition.
                       ***********************


ORGANIZATION OF THE WORKSHOP.
*****************************

Scheduled talks will take place on the 12th and the 13 th of March. There will
be invited talks as well as submitted contributions. About 24 talks of 30
minutes will be scheduled with time for discussion and panels.

Informal discussion and collective work may be scheduled on the 14 th.

Extended abstract (one or two pages) of submitted contribution have to be
send for acceptation by e-mail to dauce at cert.fr or by  post to Manuel
Samuelides, 
DERI ONERA-CERT, BP 4025, 31055 Toulouse CEDEX, FRANCE.  


Provisional list of invited lecturers: J.P.AUBIN, M.COTTRELL, J.DEMONGEOT,
J.DAYHOFF,G.DREYFUS, M.HIRSCH, J.TAYLOR.

(This list will be  completed)


The number of attendants to the workshop is limited
to 40 in order to allow living exchange and real discussion.
Copies of abstracts and slides will be provided to participants.

The registration fees amount to FF 1,200 including 2 nights with american
breakfast(11th and 12 th) at a first class hotel in downtown Toulouse
(Holyday Inn, Crown Plaza), two lunches on the site of the
workshop, the workshop banquet, transportation to and from CERT, coffee
beaks, the general costs of the workshop facilities and equipment.

Payment should be made either by check payable to
" AGENT COMPTABLE DU CERT ONERA " in French francs only
or by Bank transfer to
"AGENT COMPTABLE DU CERT ONERA"
Bank: Societe Generale Ramonville Saint Agne
Account N? 30003 /02117/ 00037291008/93
Please state the workshop reference: DYNN'96
 on all transactions.


                       ***********************

IMPORTANT DATES:
****************

15th of January: Dead-line for contributions and declarations of interest.

31 th of January: Date for signification of accepted contribution
and expedition of final programming of the workshop

15 th of February: Dead-Line for Inscriptions to the workshop.

To avoid postage delay, e-mail will be accepted as a usual communication

If you want to attend DYNN'96 please use your computer to reply at once

--------------------------------------------------------------------------------
Name
Organization
Adress
e mail
(  ) wishes the information about the final program
(  ) wishes to attend DYNN'96
(  ) will submit a contribution entitled:


-----------------------------------------------------------------------------

Please send your reply to the following e-mail       dauce at cert.fr

or to
        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
        x  Professor Manuel SAMUELIDES x
        x   DERI ONERA-CERT            x
        x   BP 4025                    x
        x   31055 Toulouse CEDEX       x
        x   FRANCE                     x
        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


Manuel SAMUELIDES
-----------------------------------------------------------------
for research:
Chercheur a l'ONERA-CERT        samuelid at cert.fr

for Teaching
Professeur a l'ENSAE              Manuel.Samuelides at supaero.fr
                                Tel: (33) 62 17 81 06
                                Fax: (33) 62 17 83 30


From lemm at LORENTZ.UNI-MUENSTER.DE  Fri Dec 15 09:28:49 1995
From: lemm at LORENTZ.UNI-MUENSTER.DE (Joerg_Lemm)
Date: Fri, 15 Dec 1995 15:28:49 +0100
Subject: NFL and practice
Message-ID: <9512151428.AA24811@xtp141.uni-muenster.de>

Huaiyu Zhu responsed to
>> One may discuss NFL for theoretical reasons, but
>> the conditions under which NFL-Theorems hold
>> are not those which are normally met in practice.
and wrote
>Exactly the opposite.  The theory behind NFL is trivial (in some sense).
>The power of NFL is that it deals directly with what is rountinely
>practiced in the neural network community today.

That depends on how you understand practice.
E.g. in nearly all cases functions are somewhat smooth.
This is a prior which exists in reality (for example because
of input noise in the measuring process). 
And the situation would we hopeless
if we would not use this fact in practice.
(That is just what also NFL says.)
But, if Huaiyu means that it is necessary to think about
the priors in "practice" explicitly, then I fully aggree!

But what I wanted to say is: 
WE DO HAVE "PRIORS" (BETTER SAY CORRELATIONS BETWEEN
ANSWERS TO DIFFERENT QUESTIONS) IN MOST CASES 
and they are NOT obscure, but very often 
at least as well MEASUREABLE
as "normal" sharp data y_i=f(x_i).
Even more: situations without "priors" are VERY artificial.
So if we specify the "priors" (and the lesson from NFL is
that we should if we want to make a good theory) 
then we cannot use NFL anymore.(What should it be used for then?)
 
>Joerg continued with examples of various priors of practical concern,
>including smoothness, symmetry, positive correlation, iid samples, etc.
>These are indeed very important priors which match the real world,
>and they are the implicit assumptions behind most algorithms.
>
>What NFL tells us is: If your algorithm is designed for such a prior,
>then say so explicitly so that a user can decide whether to use it.
>You can't expect it to be also good for any other prior which you have
>not considered.  In fact, in a sense, you should expect it to perform
>worse than a purely random algorithm on those other priors.

Maybe the problem is that Huaiyu Zhu uses the word "PRIOR" for every
information which is not of the sharp data form y_i=f(x_i).
It suggests that we know something before starting our generalizer.
NO, that is not the normal case!!! I mentioned many examples 
(like measurement with input noise) where "priors" are just normal
information which should be used DURING learning like sharp data!
(Sharp data might be even not available at all!) And of course using
wrong "priors" is similar to using wrong sharp data.
But I fully aggree that most algorithm uses "prior" information
only implicitly and that there is a lot of theoretical work to do.

In response to
>> In many interesting cases "effective" function values contain information
>> about other function values and NFL does not hold!
Huaiyu Zhu continues 
>This is like saying "In many interesting cases we do have energy sources,
>and we can make a machine running forever, so the natural laws against
>`perpetual motion machines' do not hold."
                     
Indeed, it is a little bit like that, but a system without energy sources
is a much better approximation for some real world systems
compared to a world without "priors"
(i.e. without correlated answers over different questions)!
So the energy law is useful,
but models for worlds without correlated information are NOT,
except maybe that they tell us to include the correlation
properly! 

Joerg Lemm
(Institute for Theoretical Physics I, University of Muenster, Germany)


From shastri at ICSI.Berkeley.EDU  Fri Dec 15 16:34:24 1995
From: shastri at ICSI.Berkeley.EDU (Lokendra Shastri)
Date: Fri, 15 Dec 1995 13:34:24 PST
Subject: Technical report --- negated knowledge and inconsistency 
Message-ID: <199512152134.NAA06683@kulfi.ICSI.Berkeley.EDU>


        Dealing with negated knowledge and inconsistency in a neurally
	motivated model of memory and reflexive reasoning.

        Lokendra Shastri and Dean J. Grannes
        TR-95-041
	ICSI
        August 1995

	Recently, SHRUTI has been proposed as a connectionist model of
	rapid reasoning. It demonstrates how a network of simple neuron-
	like elements can encode a large number of specific facts as well
	as systematic knowledge (rules) involving n-ary relations, quanti-
	fication and concept hierarchies, and perform a class of reasoning
	with extreme efficiency. The model, however, does not deal with
	negated facts and rules involving negated antecedents and 
	consequents. We describe an extension of SHRUTI that can encode
	positive as well as negated knowledge and use such knowledge
	during reflexive reasoning. The extended model explains how an
	agent can hold inconsistent knowledge in its long-term memory
	without being ``aware'' that its beliefs are inconsistent, but
	detect a contradiction whenever inconsistent beliefs that are 
	within a certain inferential distance of each other become 
	co-active during an episode of reasoning. Thus the model is not
	logically omniscient, but detects contradictions whenever it tries 
	to use inconsistent knowledge. The extended model also explains how
	limited attentional focus or action under time pressure can lead an
	agent to produce an erroneous response.  A biologically signficant
	feature of the model is that it uses  only local inhibition to
	encode negated knowledge. Like the basic model, the extended model
	encodes and propagates dynamic bindings using temporal synchrony.

	Key Words: long-term memory; rapid reasoning; dynamic bindings;
		   synchrony; knowledge representation; neural oscillations;
		   short-term memory; negation; inconsistent knowledge.


ftp-server:	ftp.icsi.berkeley.edu (128.32.201.55)
ftp-file:	/pub/techreports/1995/tr-95-041.ps.Z


Lokendra Shastri
International Computer Science Institute
1947 Center Street, Suite 600
Berkeley, CA 94704
http://www.icsi.berkeley.edu/~shastri

==========================

Detailed instructions for retrieving the report:

	unix% ftp ftp.icsi.berkeley.edu
	Name (ftp.icsi.berkeley.edu:): anonymous
	Password: your_name at your_machine
	ftp> cd /pub/techreports/1995
	ftp> binary
	ftp> get tr-95-041.ps.Z
	ftp> quit
	unix% uncompress tr-95-041.ps.Z
	unix% lpr tr-95-041.ps


If your name server does not know about ftp.icsi.berkeley.edu, use
128.32.201.55 instead.

All files in this archive can also be obtained through an
e-mail interface in case direct ftp is not available. To obtain
instructions, send mail containing the line `send help' to:

	 ftpmail at ICSI.Berkeley.EDU

As a last resort, hardcopies may be ordered for a small fee.
Send mail to info at ICSI.Berkeley.EDU for more information.


From cherkaue at cs.wisc.edu  Fri Dec 15 19:03:15 1995
From: cherkaue at cs.wisc.edu (cherkaue@cs.wisc.edu)
Date: Fri, 15 Dec 1995 18:03:15 -0600
Subject: No free lunch for Cross Validation!
Message-ID: <199512160003.SAA03324@mozzarella.cs.wisc.edu>

In reply to Huaiyu Zhu's message <zhuh at helios.aston.ac.uk>

> ...
>
>A little while ago someone claimed that 
>    Cross validation will benefit from the presence of any structure,
>    and if there is no structure it does no harm; 
>
> ...
>
>Suppose we have a Gaussian variable x, with mean mu and unit variance.
>We have the following three estimators for estimating mu from a
>sample of size n.
>  A: The sample mean.  It is optimal both in the sense of Maximum
>Likelihood and Least Mean Squares.
>  B: The maximum of sample.  It is a bad estimator in any reasonable sense.
>  C: Cross validation to choose between A and B, with one extra data point.
>
>The numerical result with n=16 and averaged over 10000 samples, gives
>mean squared error:
>        A: 0.0627    B: 3.4418    C: 0.5646
>This clearly shows that cross validation IS harmful in this case,
>despite the fact it is based on a larger sample.  NFL still wins!
 

You forgot

   D: Anti-cross validation to choose between A and B, with one extra data
      point.


I don't understand your claim that "cross validation IS harmful in this case."
You seem to equate "harmful" with "suboptimal." Cross validation is a technique
we use to guess the answer when we don't already know the answer. You give
technique A the benefit of your prior knowledge of the true answer, but C must
operate without this knowledge. A fair comparison would pit C against D, not C
against A. As you say:

>6. In any of the above cases, "anti cross validation" would be even
>more disastrous.

Kevin Cherkauer
Computer Sciences Dept.
University of Wisconsin-Madison
cherkauer at cs.wisc.edu

From pkso at castle.ed.ac.uk  Sat Dec 16 10:06:41 1995
From: pkso at castle.ed.ac.uk (P Sollich)
Date: Sat, 16 Dec 95 15:06:41 GMT
Subject: Thesis on Query Learning available
Message-ID: <9512161506.aa29855@uk.ac.ed.castle>

FTP-host: archive.cis.ohio-state.edu
FTP-filename: /pub/neuroprose/Thesis/sollich.thesis.tar.Z


Dear fellow connectionists,

the following Ph.D. thesis is now available for copying from the
neuroprose archive:


                      ASKING INTELLIGENT QUESTIONS ---
                THE STATISTICAL MECHANICS OF QUERY LEARNING
 
                              Peter Sollich
                          Department of Physics
                       University of Edinburgh, U.K.

                                 Abstract:		

  This thesis analyses the capabilities and limitations of query learning
  by using the tools of statistical mechanics to study learning in
  feed-forward neural networks.
  
  In supervised learning, one of the central questions is the issue of
  generalization: Given a set of training examples in the form of
  input-output pairs produced by an unknown {\em teacher} rule, how can
  one generate a {\em student} which {\em generalizes}, i.e., which
  correctly predicts the outputs corresponding to inputs not contained in
  the training set? The traditional paradigm has been to study learning
  from {\em random examples}, where training inputs are sampled randomly
  from some given distribution.  However, random examples contain
  redundant information, and generalization performance can thus be
  improved by {\em query learning}, where training inputs are chosen such
  that each new training example will be maximally `useful' as measured by
  a given {\em objective function}. 
  
  We examine two common kinds of queries, chosen to optimize the objective
  functions, generalization error and entropy (or information),
  respectively.  Within an extended Bayesian framework, we use the
  techniques of statistical mechanics to analyse the average case
  generalization performance achieved by such queries in a range of
  learning scenarios, in which the functional forms of student and teacher
  are inspired by models of neural networks.  In particular, we study how
  the efficacy of query learning depends on the form of teacher and
  student, on the training algorithm used to generate students, and on the
  objective function used to select queries.  The learning scenarios
  considered are simple but sufficiently generic to allow general
  conclusions to be drawn. 
  
  We first study perfectly learnable problems, where the student can
  reproduce the teacher exactly.  From an analysis of two simple model
  systems, the high-low game and the linear perceptron, we conclude that
  query learning is much less effective for rules with continuous outputs
  -- provided they are `invertible' in the sense that they can essentially
  be learned from a finite number of training examples -- than for rules
  with discrete outputs.  Queries chosen to minimize the entropy generally
  achieve generalization performance close to the theoretical optimum
  afforded by minimum generalization error queries, but can perform worse
  than random examples in scenarios where the training algorithm is
  under-regularized, i.e., has too much `confidence' in corrupted training
  data. 
  
  For imperfectly learnable problems, we first consider linear students
  learning from nonlinear perceptron teachers and show that in this case
  the structure of the student space determines the efficacy of queries
  chosen to minimize the entropy in {\em student} space.  Minimum {\em
  teacher} space queries, on the other hand, perform worse than random
  examples due to lack of feedback about the progress of the student.  For
  students with discrete outputs, we find that in the absence of
  information about the teacher space, query learning can lead to
  self-confirming hypotheses far from the truth, misleading the student to
  such an extent that it will not approximate the teacher optimally even
  for an infinite number of training examples.  We investigate how this
  problem depends on the nature of the noise process corrupting the
  training data, and demonstrate that it can be alleviated by combining
  query learning with Bayesian techniques of model selection.  Finally, we
  assess which of our conclusions carry over to more realistic neural
  networks, by calculating finite size corrections to the thermodynamic
  limit results and by analysing query learning in a simple two-layer
  neural network.  The results suggest that the statistical mechanics
  analysis is often relevant to real-world learning problems, and that the
  potentially significant improvements in generalization performance
  achieved by query learning can be made available, in a computationally
  cheap manner, for realistic multi-layer neural networks. 
  

Criticism, comments and suggestions are welcome.
Merry Christmas everyone!

Peter Sollich

--------------------------------------------------------------------------
 Peter Sollich                           Department of Physics
                                         University of Edinburgh
 e-mail: P.Sollich at ed.ac.uk              Kings Buildings
 phone: +44 - (0)131 - 650 5236          Mayfield Road
                                         Edinburgh EH9 3JZ, U.K.
--------------------------------------------------------------------------

RETRIEVAL INSTRUCTIONS: Get `sollich.thesis.tar.Z' from the `Thesis'
subdirectory of the neuroprose archive.  Uncompress, and unpack the
resulting tar file (on UNIX: uncompress sollich.thesis.tar.Z; tar xf - <
sollich.thesis.tar).  This will yield the postscript files listed below. 
Contact me if there are any problems with retrieval and or printing. 

QUICK GUIDE for busy readers: For a first look, see sollich_title.ps (has
abstract and table of contents).  File sollich_chapter1.ps contains a
general introduction to query learning and an overview of the
literature.  Finally, for a summary of the main results and open
questions, see sollich_chapter9.ps.

LIST OF FILES:
------------------------------------------------------------------------------
Filename             No of  Size in KB   Contents
                     pages  (compressed/
                            uncompressed)
------------------------------------------------------------------------------
sollich_title.ps     8       37/  75     Title, Declaration, 
                                              Acknowledgements, Publications, 
                                              Abstract, Table of contents
------------------------------------------------------------------------------
sollich_chapter1.ps  8       48/  98     Introduction
------------------------------------------------------------------------------
sollich_chapter2.ps  10      48/ 101     A probabilistic framework for 
                                              query selection
------------------------------------------------------------------------------
sollich_chapter3.ps  21     128/ 376     Perfectly learnable problems: 
                                              Two simple examples
------------------------------------------------------------------------------
sollich_chapter4.ps  19     135/ 337     Imperfectly learnable problems: 
                                              Linear students
------------------------------------------------------------------------------
sollich_chapter5.ps  40     228/ 565     Query learning assuming the 
                                              inference model is correct
------------------------------------------------------------------------------
sollich_chapter6.ps  12     244/1050     Combining query learning and 
                                              model selection
------------------------------------------------------------------------------
sollich_chapter7.ps  20     217/ 558     Towards realistic neural networks I:
                                              Finite size effects
------------------------------------------------------------------------------
sollich_chapter8.ps  24     136/ 299     Towards realistic neural networks II:
                                              Multi-layer networks
------------------------------------------------------------------------------
sollich_chapter9.ps  5       31/  59     Summary and Outlook
------------------------------------------------------------------------------
sollich_bib.ps       8       37/  68     Bibliography
------------------------------------------------------------------------------

From zhuh at helios.aston.ac.uk  Mon Dec 18 08:11:50 1995
From: zhuh at helios.aston.ac.uk (zhuh)
Date: Mon, 18 Dec 1995 13:11:50 +0000
Subject: NFL and practice
Message-ID: <4332.9512181311@sun.aston.ac.uk>


I accidentally sent my reply to Joerg Lemm, instead of Connnetionist.
Since he replied to the Connectionist, I'll reply here as well, and
include my original posting at the end.

I quite agree with Joerg's observation about learning algorithms in
practice, and the priors they use.  The key difference is

	Is it legitimate to be vague about prior?

Put it another way,

	Do you claim the algorithm can pick up whatever prior automatically,
	instead of being specified before hand?

My answer is NO, to both questions, because for an algorithm to be good on 
any prior is exactly the same as for an algorithm to be good without prior,
as NFL told us.

For purely cosmetic reasons, it might be helpful to translate the 
useless "No free lunch theorem" :-)

	Without specifying a particular prior, any algorithm is as good as 
	random guess,

into the equivalent, but infinitely more useful, "You have to pay for lunch
Theorem" :-)

	For an algorithm to perform better than random guess, a particular 
	prior should be specified.

On a more practical level,

> E.g. in nearly all cases functions are somewhat smooth.
Do you specify the scale on which it is smooth?

> This is a prior which exists in reality (for example because
> of input noise in the measuring process). 
If you average smoothness over all scales, in a certain uniform way, you get
a prior which contains no smoothness at all.  If you average them in a non-
uniform way, you actually specify a non-uniform prior, which is the crucial
piece of information for any algorithm to work at all.

> And the situation would we hopeless
> if we would not use this fact in practice.
It would still be hopeless if we only used the fact of "somewhat smooth",
instead of specifying how smooth.  See the following for theory and examples:

Zhu, H. and Rohwer, R.:
  Bayesian regression filters and the issue of priors, 1995. To appear in 
  Neural Computing and Applications.
  ftp://cs.aston.ac.uk/neural/zhuh/reg_fil_prior.ps.Z

My original posting is enclosed as the following:

----- Begin Included Message -----


From imlm at tuck.cs.fit.edu  Mon Dec 18 16:39:40 1995
From: imlm at tuck.cs.fit.edu (IMLM Workshop (pkc))
Date: Mon, 18 Dec 1995 16:39:40 -0500
Subject: CFP: AAAI-96 Workshop on Integrating Multiple Learned Models
Message-ID: <199512182139.QAA10740@tuck.cs.fit.edu>

		    CALL FOR PAPERS/PARTICIPATION


		 INTEGRATING MULTIPLE LEARNED MODELS
	FOR IMPROVING AND SCALING MACHINE LEARNING ALGORITHMS

	       to be held in conjunction with AAAI 1996
			   Portland, Oregon
			     August 1996


Most modern machine learning research uses a single model or learning
algorithm at a time, or at most selects one model from a set of
candidate models. Recently however, there has been considerable
interest in techniques that integrate the collective predictions of a
set of models in some principled fashion.  With such techniques often
the predictive accuracy and/or the training efficiency of the overall
system can be improved, since one can "mix and match" among the
relative strengths of the models being combined.

The goal of this workshop is to gather researchers actively working in
the area of integrating multiple learned models, to exchange ideas and
foster collaborations and new research directions.  In particular, we
seek to bring together researchers interested in this topic from the
fields of Machine Learning, Knowledge Discovery in Databases, and
Statistics.

Any aspect of integrating multiple models is appropriate for the
workshop. However we intend the focus of the workshop to be improving
prediction accuracies, and improving training performance in the
context of large training databases.

More precisely, submissions are sought in, but not limited to, the
following topics:

1) Techniques that generate and/or integrate multiple learned
   models. In particular, techniques that do so by:

	* using different training data distributions
		(in particular by training over different partitions
		of the data)
	* using different output classification schemes
		(for example using output codes)
       	* using different hyperparameters or training heuristics
		(primarily as a tool for generating multiple models)

	2) Systems and architectures to implement such strategies. In particular:

        * parallel and distributed multiple learning systems
        * multi-agent learning over inherently distributed data

A paper need not be submitted to participate in the workshop, but
space may be limited so contact the organizers as early as possible if
you wish to participate.

The workshop format is planned to encompass a full day of half hour
presentations with discussion periods, ending with a brief period for
summary and discussion of future activities.  Notes or proceedings for
the workshop may be provided, depending on the submissions received.


Submission requirements:

i) A short paper of not more than 2000 words detailing recent research
results must be received by March 18, 1996.

ii) The paper should include an abstract of not more than 150 words,
and a list of keywords.  Please include the name(s), email
address(es), address(es), and phone number(s) of the author(s) on the
first page.  The first author will be the primary contact unless
otherwise stated.

iii) Electronic submissions in postscript or ASCII via email are
preferred.  Three printed copies (preferrably double-sided) of your
submission are also accepted.

iv) Please also send the title, name(s) and email address(es) of the
author(s), abstract, and keywords in ASCII via email.


Submission address:

	imlm at cs.fit.edu

	Philip Chan
	IMLM Workshop
	Computer Science
	Florida Institute of Technology
	150 W. University Blvd.
        Melbourne, FL 32901-6988
	407-768-8000 x7280 (x8062)
	407-984-8461 (fax)


Important Dates:

	Paper submission deadline:	March 18, 1996
	Notification of acceptance:	April 15, 1996
	Final copy:			May 13, 1996


Chairs:

        Salvatore Stolfo, Columbia University		sal at cs.columbia.edu
        David Wolpert, Santa Fe Institute		dhw at santafe.edu
	Philip Chan, Florida Institute of Technology	pkc at cs.fit.edu


General Inquiries:

Please address general inquiries to one of the co-chairs or send them
to:

	imlm at cs.fit.edu

Up-to-date workshop information is maintained on WWW at:

	http://cs.fit.edu/~imlm/ or
	http://www.cs.fit.edu/~imlm/


From ces at negi.riken.go.jp  Mon Dec 18 20:36:45 1995
From: ces at negi.riken.go.jp (ces@negi.riken.go.jp)
Date: Tue, 19 Dec 95 10:36:45 +0900
Subject: PhD Thesis Announcement : nonlinear filters 
Message-ID: <9512190136.AA21982@negi.riken.go.jp>


    FTP-host: archive.cis.ohio-state.edu
    FTP-filename: /pub/neuroprose/Thesis/chng.thesis.ps.Z


Dear fellow connectionists,

the following Ph.D. thesis is now available for copying from the
neuroprose archive: (Sorry, no hardcopies available.)


- -----------------------------------------------------------------------

		Applications of nonlinear filters with
		the linear-in-the-parameter structure
 
			   Eng-Siong CHNG
               	Department of Electrical Engineering
                     University of Edinburgh, U.K.

                              Abstract:		


The subject of this thesis is the application of nonlinear filters,
with the linear-in-the-parameter structure, to time series prediction
and channel equalisation problems.

In particular, the Volterra and the radial basis function (RBF) 
expansion techniques are considered to implement the nonlinear filter
structures. These approaches, however, will generate filters with 
very large numbers of parameters. As large filter models require 
significant implementation complexity, they are undesirable for practical
implementations.  To reduce the size of the filter, the orthogonal least 
squares (OLS) algorithm is considered to perform  model selection.  
Simulations were conducted to study the effectiveness  of subset models 
found using this algorithm, and the results indicate that this selection 
technique is adequate for many practical applications.
The other aspect of the OLS algorithm studied is its implementation 
requirements. Although the OLS algorithm  is very efficient, the required
computational complexity is still substantial. To reduce the processing 
requirement, some fast OLS methods are examined.

Two major applications of nonlinear filters are considered in this thesis.
The first involves the use of nonlinear filters  to predict time series
which possess nonlinear dynamics.  To study the performance of the
nonlinear predictors, simulations were conducted to compare the
performance of these predictors with conventional linear predictors.
The simulation results confirm that nonlinear predictors normally perform
better than linear predictors. Within this study, the application of RBF 
predictors to time series  that exhibit  homogeneous nonstationarity is
also considered.  This type of time series possesses  the same characteristic
throughout the time sequence apart from local variations of mean and trend. 

The second application involves the use of  filters for symbol-decision 
channel equalisation. The decision function of the  optimal  symbol-decision
equaliser is first derived  to show that it is nonlinear, and that
it may be realised explicitly using  a RBF filter. Analysis is then carried
out to illustrate the difference between the optimum equaliser's performance
and that of the conventional linear equaliser. In particular, the effects of 
delay order on the equaliser's decision boundaries and bit error rate (BER)
performance are studied. The minimum mean square error (MMSE) optimisation 
criterion for training the linear equaliser is also examined  to illustrate 
the sub-optimum nature of such a criterion. To improve the linear equaliser's 
performance, a method which adapts the equaliser by minimising the BER is 
proposed. Our results indicate that the linear equalisers 
performance is normally improved by using the minimum BER criterion.
The decision feedback equaliser (DFE) is also examined. We propose a
transformation using the feedback inputs to change  the DFE problem
to a feedforward equaliser problem. This unifies the treatment of the
equaliser structures with and without decision feedback.

	-----------------------------------------------------------


Criticism, comments and suggestions are welcome.
Merry Christmas everyone!

Eng Siong

- --------------------------------------------------------------------------
 Eng Siong CHNG                          Lab. for ABS, 
					 Frontier Research Programme,
					 RIKEN,
 email : ces at negi.riken.go.jp		 2-1 Hirosawa, Wako-Shi,
 					 Saitama 351-01,
					 JAPAN.
- --------------------------------------------------------------------------


    RETRIEVAL INSTRUCTIONS: 

    FTP-host: archive.cis.ohio-state.edu
    FTP-filename: /pub/neuroprose/Thesis/chng.thesis.ps.Z

    File size : 1715073 bytes
    Number of pages : 165 pages

unix> ftp archive.cis.ohio-state.edu
Connected to archive.cis.ohio-state.edu.
220 archive.cis.ohio-state.edu FTP server ready.
Name: anonymous
331 Guest login ok, send ident as password.
Password:neuron
230 Guest login ok, access restrictions apply.
ftp> binary
200 Type set to I.
ftp> cd pub/neuroprose/Thesis
250 CWD command successful.
ftp> get chng.thesis.ps.Z
200 PORT command successful.
150 Opening BINARY mode data connection for chng.thesis.ps.Z
226 Transfer complete.
ftp> quit
221 Goodbye.


unix> uncompress chng.thesis.ps.Z
unix> lpr chng.thesis.ps  (postscript printer)


Contact me if there are any problems with retrieval and or printing. 


------- End of Forwarded Message


From hag at santafe.edu  Mon Dec 18 21:22:57 1995
From: hag at santafe.edu (Howard A. Gutowitz)
Date: Mon, 18 Dec 1995 19:22:57 -0700 (MST)
Subject: Exploring the Space of CA
Message-ID: <9512190222.AA29140@sfi.santafe.edu>


Announcing:

"Exploring the Space of Cellular Automata"

Cellular automata can be thought of  as a
restricted kind of neural net, in which the
cells take on only a finite set of values,
and connections are local and regular.
 
This is set of interactive web pages designed to help
you learn about CA, and the use of the lambda
parameter to find critical regions in the space of
CA.

Credits: 

    Concept: Chris Langton 
    CA simulation program: Patrick Hayden. 
    cgi interface: Eric Carr. 
    Text: Chris Langton , Howard Gutowitz, and Eric Carr. 


Available from: http://alife.santafe.edu/alife/topics/ca/caweb 


-- 
Howard Gutowitz                |   hag at neurones.espci.fr
ESPCI                          |   http://www.santafe.edu/~hag
Laboratoire d'Electronique     |   home:   (331) 4707-3843
10 rue Vauquelin               |   office: (331) 4079-4697
75005 Paris, France            |   fax:    (331) 4079-4425 

From hicks at cs.titech.ac.jp  Mon Dec 18 23:58:07 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Tue, 19 Dec 1995 13:58:07 +0900
Subject: NFL, practice, and CV
Message-ID: <199512190458.NAA28669@euclid.cs.titech.ac.jp>


Huaiyu Zhu wrote:
>You can't make every term positive in your balance sheet, if the grand
>total is bound to be zero.

There ARE functions which are always non-negative, but which under 
an appropriate measure integrate to 0.
It only requires that 

	1) the support of the non-negative values is vanishingly small,
	2) the non-negative values are bounded 

So the above statement by Dr. Zhu is not true.  In fact I think this ability
for pointwise positive values to dissapear under integration is key to the
"zero-sum" aspect of the NFL theorem holding true, despite the fact that we
obviously see so many examples of working algorithms.

My key point:  A zero-sum (infinite) universe doesn't require negative values.

----

There is another important issue which needs to be clarified, and that is the
definition of CV and the kinds of problems to which it can be applied.  Now
anybody can make whatever definition they want, and then come to some
conclusions based upon that definition, and that conclusion may be correct
given that definition.  However, there are also advantages to sharing a common
intellectual currency.  

	I quote below from "An Introduction to the Bootstrap" by Efron and
Tibshirani, 1993, Chapter 17.1.  It describes well what I meant when I talked
monitoring prediction error in a previous posting, and describes CV as a
method for doing that.

==================================================

	In our discussion so far we have focused on a number of measures of
statistical accuracy: standard errors, biases, and confidence intervals.  All
of these are measures of accuracy for parameters of a model.  Prediction error
is a different quantity that measures how well a model predicts the response
value of a future observation.  It is often used for model selection, since
it is ensible ot choose a model that has the lowest prediction error among a
set of candidates.

	Cross-validation is a standard tool for estimating prediction error.
It is an old idea (predating the bootstrap) that has enjoyed a comeback in
recent years with the increase in available computing power and speed.  In
this chapter we discuss cross-validation, the bootstrap, and some other
closely related techniques for estimation of prediction error.

	In regression models, prediction error refers to the expected squared 
difference between a future response and its prediction from the model:

	PE = E(y - \hat{y})^2.
	
The expectation refers to repeated sampling from the true population. 
Prediction error also arises in th eclassification problem, where the
repsponse falls into one of k unordered classes.  For example, the possible
reponses might be Republican, Democrat, or Independent in a political survey.
In classification problems prediction error is commonly defined as the
probability of an incorrect classification

	PE = Prob(\hat{y} \neq y),

also called the misclassification rate.  The methods described in this chapter
apply to both definitions of prediction error, and also to others.

==================================================

Craig Hicks
Tokyo Institute of Technology

From zhuh at helios.aston.ac.uk  Tue Dec 19 10:14:20 1995
From: zhuh at helios.aston.ac.uk (zhuh)
Date: Tue, 19 Dec 1995 15:14:20 +0000
Subject: NFL, practice, and CV
Message-ID: <8208.9512191514@sun.aston.ac.uk>

This is in reply to the critisism by Craig Hicks and Kevin Cherkauer,
and will be my last posting in this thread.

Craig Hicks thought that my statement (A)

> >You can't make every term positive in your balance sheet, if the grand
> >total is bound to be zero.

is contradictory to his statements (B)

> There ARE functions which are always non-negative, but which under 
> an appropriate measure integrate to 0.
> It only requires that 
> 
> 	1) the support of the non-negative values is vanishingly small,
> 	2) the non-negative values are bounded 

But they are actually talking about different things.  There is a big 
difference between positive and non-negative.  For all practical purposes, 
the functions described by (B) can be regarded as identically zero.  

Translating back to the original topic, statement (B) becomes

(C) There are algorithms which are always no worse than random guessing,
on any prior, provided that 
	1) The priors on which it performs better than random guessing
	   have zero probability to occur in practice.
	2) It cannot be infinitely better on these priors.

It is true that something improbable may still be possible, but this is
only of academic interest.  In most of modern treatment of function spaces,
functions are only identified up to a set of measure zero, so that phrases
like "almost everywhere" or "almost surely" are redundent.

I suspect that due to the way NFL are proved, even (C) is impossible,
but this does not matter anyway, because (C) itself is of no practical
interest whatsoever.

> ----

Considering cross validation, Craig wrote
> 
> There is another important issue which needs to be clarified, and that is the
> definition of CV and the kinds of problems to which it can be applied.  Now
> anybody can make whatever definition they want, and then come to some
> conclusions based upon that definition, and that conclusion may be correct
> given that definition.  However, there are also advantages to sharing a common
> intellectual currency.  

Risking a little bit over-simplification, I would like to summarise the two
usages of CV as the following

(CV1)	A method for evaluating estimates,
(CV2)	A method for evaluating estimators.

The key difference is that in (CV1), a decision is made for each sample,
while in (CV2) a decision is made for all samples.

If (CV1) is applied on two algorithms A and B, then we can always define
a third algorithm C, by always choosing the estimate given by either A or
B which is favoured by (CV1).   But my previous counter-example shows
that averaging over all samples, C can be worse than A.  One may seek 
refuge in statements like "optimal decision for each sample does not mean 
optimal decision for all samples".  Well, such incoherent inference is the 
defining characteristic of non-Bayesian statistics. In Bayesian decision 
theory it is well known that
	A method is optimal iff it is optimal on almost all samples,
	(excluding various measure zero anomolies.)

The case of (CV2) is quite different.  It is of a higher level than
algorithms like A and B.  It is in fact a statistical estimator mapping
(D,A,f) to to a real number r, where D is a finite data set, A is a given 
algorithm, f is an objective function, and r is the predicted average 
performance.  It should therefore be compared with other such methods.
This appears not to be a topic considered in this discussion.

--------------
Kevin Cherkauer wrote
> 
> You forgot
> 
>    D: Anti-cross validation to choose between A and B, with one extra data
>       point.

Well, I did not forget that, as you have quoted below, point 6.  
> 
> I don't understand your claim that "cross validation IS harmful in this case."
> You seem to equate "harmful" with "suboptimal." 

See my original answer, points 1. and 4.

> 	Cross validation is a technique
> we use to guess the answer when we don't already know the answer. 

This is true for any statistical estimator.

>	You give
> technique A the benefit of your prior knowledge of the true answer, but C must
> operate without this knowledge. 

The prior knowledge is that the distribution is a unit Gaussian with
unspecified mean, the true answer is its mean.  No, they are not the 
same thing.  C also operates with the knowledge that the distribution 
is a unit Gaussian, but it refuses to use this knowledge (which implies 
A is better than B).  Instead, it insists on evaluating A and B on a 
cross validation set.  That's why it performs miserably.

>	A fair comparison would pit C against D, not C
> against A. As you say:
> 
> >6. In any of the above cases, "anti cross validation" would be even
> >more disastrous. 

If the definition was that "An algorithm is good if it is no worse than
the worst algorithm", then I would have no objection.  Well, almost any
algorithm would be good in this sense.  However, if the phrase "in any of 
the above cases" is droped without putting a prior restriction as remedy, 
then it's also true that all algorithm is as bad as the worst algorithm.

Huaiyu

PS. I think I have already talked enough about this subject so I'll shut 
up from now on, unless there's anything new to say. More systematic
treatment of these subjects instead of counter-examples can be found
in the ftp site below.

--
Huaiyu Zhu, PhD                   email: H.Zhu at aston.ac.uk
Neural Computing Research Group   http://neural-server.aston.ac.uk/People/zhuh
Dept of Computer Science          ftp://cs.aston.ac.uk/neural/zhuh
    and Applied Mathematics       tel: +44 121 359 3611 x 5427
Aston University,                 fax: +44 121 333 6215
Birmingham B4 7ET, UK              


From minton at ISI.EDU  Tue Dec 19 14:53:27 1995
From: minton at ISI.EDU (minton@ISI.EDU)
Date: Tue, 19 Dec 95 11:53:27 PST
Subject: JAIR article
Message-ID: <9512191953.AA11913@sungod.isi.edu>


Readers of this mailing list may be interested in the following JAIR
article, which was just published:

Weiss, S.M. and Indurkhya, N. (1995)
  "Rule-based Machine Learning Methods for Functional Prediction", 
   Volume 3, pages 383-403.
   PostScript: volume3/weiss95a.ps (527K)
	       compressed, volume3/weiss95a.ps.Z (166K)


   Abstract: We describe a machine learning method for predicting the
   value of a real-valued function, given the values of multiple input
   variables. The method induces solutions from samples in the form of
   ordered disjunctive normal form (DNF) decision rules. A central
   objective of the method and representation is the induction of
   compact, easily interpretable solutions.  This rule-based decision
   model can be extended to search efficiently for similar cases prior to
   approximating function values. Experimental results on real-world data
   demonstrate that the new techniques are competitive with existing
   machine learning and statistical methods and can sometimes yield
   superior regression performance.

The PostScript file is available via:
   
 -- comp.ai.jair.papers

 -- World Wide Web: The URL for our World Wide Web server is
       http://www.cs.washington.edu/research/jair/home.html

 -- Anonymous FTP from either of the two sites below:
      CMU:   p.gp.cs.cmu.edu        directory: /usr/jair/pub/volume3
      Genoa: ftp.mrg.dist.unige.it  directory:  pub/jair/pub/volume3

 -- automated email. Send mail to jair at cs.cmu.edu or jair at ftp.mrg.dist.unige.it
    with the subject AUTORESPOND, and the body GET VOLUME3/FILE-NM
    (e.g., GET VOLUME3/MOONEY95A.PS)
    Note: Your mailer might find our files too large to handle. Also, note  
    that compressed files cannot be emailed, since they are binary files.

 -- JAIR Gopher server: At p.gp.cs.cmu.edu, port 70. 

For more information about JAIR, check out our WWW or FTP sites, or
send electronic mail to jair at cs.cmu.edu with the subject AUTORESPOND
and the message body HELP, or contact jair-ed at ptolemy.arc.nasa.gov.


From lucas at scr.siemens.com  Tue Dec 19 12:26:15 1995
From: lucas at scr.siemens.com (Lucas Parra)
Date: Tue, 19 Dec 1995 12:26:15 -0500 (EST)
Subject: Preprint: Symplectic Nonlinear Component Analysis
Message-ID: <199512191726.MAA04146@owl.scr.siemens.com>


Dear fellow connectionists,

a preprint of the following NIPS*95 paper is available at:


  ftp://archive.cis.ohio-state.edu/pub/neuroprose/parra.nips95.ps.Z


             Symplectic Nonlinear Component Analysis

                        Lucas C. Parra 
                   Siemens Corporate Research
                     lucas at scr.siemens.com 


Statistically independent features can be extracted by finding a
factorial representation of a signal distribution. Principal Component
Analysis (PCA) accomplishes this for linear correlated and Gaussian
distributed signals. Independent Component Analysis (ICA), formalized
by Comon (1994), extracts features in the case of linear
statistical dependent but not necessarily Gaussian distributed
signals. Nonlinear Component Analysis finally should find a factorial
representation for nonlinear statistical dependent distributed
signals. This paper proposes for this task a novel feed-forward,
information conserving, nonlinear map - the explicit symplectic
transformations. It also solves the problem of non-Gaussian output
distributions by considering single coordinate higher order
statistics.

From jlm at crab.psy.cmu.edu  Wed Dec 20 18:16:31 1995
From: jlm at crab.psy.cmu.edu (James L. McClelland)
Date: Wed, 20 Dec 95 18:16:31 EST
Subject: Technical Report Available
Message-ID: <9512202316.AA19275@crab.psy.cmu.edu.psy.cmu.edu>


The following Technical Report is available electronically from our
FTP server or in hard copy form.  Instructions for obtaining copies
may be found at the end of this post.

========================================================================

	       On the Time Course of Perceptual Choice:
	  A Model Based on Principles of Neural Computation

		  Marius Usher & James L. McClelland

		  Carnegie Mellon University and the
	       Center for the Neural Basis of Cognition

                    Technical Report PDP.CNS.95.5
                            December 1995

The time course of information processing is discussed in a model
based on leaky, stochastic, non-linear accumulation of activation in
mutually inhibitory processing units. The model addresses data from
choice tasks using both time-controlled (e.g., deadline or response
signal) and standard reaction time paradigms, and accounts
simultaneously for aspects of data from both paradigms.  In special
cases, the model becomes equivalent to a classical diffusion process,
but in general a more complex type of diffusion occurs. Mutual
inhibition counteracts the effects of information leakage, allows
flexible choice behavior regardless of the number of alternatives, and
contributes to accounts of additional data from tasks requiring choice
with conflict stimuli and word identification tasks.

======================================================================

Retrieval information for pdp.cns TRs:

unix> ftp 128.2.248.152                 # hydra.psy.cmu.edu
Name: anonymous
Password: <email address>
ftp> cd pub/pdp.cns
ftp> binary
ftp> get pdp.cns.95.5.ps.Z              # gets this tr
ftp> quit
unix> zcat pdp.cns.95.5.ps.Z | lpr      # or however you print postscript

NOTE:  

The compressed file is 567,075 bytes long.
Uncompressed, the file is 1,768,398 byes long.

The printed version is 53 total pages long.

For those who do not have FTP access, physical copies can be requested from
Barbara Dorney <bd1q+ at andrew.cmu.edu>.

For a list of available PDP.CNS Technical Reports:

> get README

For the titles and abstracts:

> get ABSTRACTS

From dhw at santafe.edu  Wed Dec 20 20:00:48 1995
From: dhw at santafe.edu (David Wolpert)
Date: Wed, 20 Dec 95 18:00:48 MST
Subject: NFL once again, I'm afraid
Message-ID: <9512210100.AA06007@sfi.santafe.edu>

First and foremost, I would like to request that this NFL thread fade
out. It is only sowing confusion - people should read the papers on
NFL to understand NFL.

  [[ Moderator's note: I concur.  We've had enough "No Free Lunch" discussion
  for a while; people are starting to protest.  Future discussion should be
  done in email.  -- Dave Touretzky, CONNECTIONISTS moderator ]]

Full stop.

*After* that, after there is common grounding, we can all debate.
There is much else that connectionist is more appropriate for in the
meantime.

(To repeat: ftp.santafe.edu, pub/dhw_ftp, nfl.1.ps.Z and nfl.2.ps.Z.)

Please, I'm on my knees, use the time that would have been spent
thrashing at connectionist in a more fruitful fashion. Like by reading
the NFL papers. :-)

***

Hicks writes:

>>>
case 1: 
*	Either the target function is (noise/uncompressible/has no structure),
or none of the candidate functions have any correlation with the target
function.*
Since CV provides an estimated prediction error,
it can also tell us "you might as well be using anti-cross validation, or
random selection for that matter, because it will be equally useless".
>>>

This is wrong.

Construct the following algorithm: "If CV says one of the algorithms
under consideration has particularly low error in comparison to the
other, use that algorithm. Otherwise, choose randomly among the
algorithms."

Averaged over all targets, this will do exactly as well as the
algorithm that always guesses randomly among the algorithms. (For
zero-one loss, either OTS error or IID error with a big input space,
etc.)

So you cannot rely on CV's error estimate *at all* (unless you impose
a prior over targets or some such, etc.).

Alternatively, keep in mind the following simple argument: In its
uniform prior(targets) formulation, NFL holds even for error
distributions conditioned on *any* property of the training set. So in
particular, you can condition on having a training set for which CV
says "yep, I'm sure; choose that one". And NFL still holds. So even in
those cases where CV "is sure", by following CV, you'll die as often
as not.


>>>
case 2: 
*	The target (is compressible/has structure), and some the candidate
functions are positively correlated with the target function.*
	In this case CV will outperform anti-CV (ON AVERAGE).
>>>

This is wrong.

As has been mentioned many times, having structure in the target, by
itself, gains you nothing. And as has also been mentioned, if "the
candidate functions are positively correlated with the target
function", then in fact *anti-CV wins*.

READ THE PAPERS.


>>>
By ON AVERAGE I mean the expectation across the ensemble of samples for
a FIXED target function.  This is different from the ensemble and distribution
of target functions, which is a much bigger question.
>>>

This distinction is irrelevent. There are versions of NFL that address
both of these cases (as well as many others).

READ THE PAPERS.


*****


Lemm writes:

>>>
1.) In short, NFL assumes that data, i.e. information of the form y_i=f(x_i),
do not contain information about function values on a non-overlapping
test set.
>>>

This is wrong.

See all the previous discussion about how NFL holds even if you
restrict yourself to targets with a lot of structure. The problem is
that the structure can hurt just as easily as help. There is no need
for the data set to contain no information about the test set - simply
that the limited types of information can "confuse" the learning
algorithm at hand.

READ THE PAPERS.


>>>
This is done by postulating "unrestricted uniform" priors, 
or uniform hyperpriors over nonumiform priors...
>>>

This is wrong. There is (obviously) a version of NFL that holds for
uniform priors. And there is another version in which one averages
over all priors - so the uniform prior has measure 0. But one can also
restrict oneself to average only over those priors "with a lot of
structure", and again get NFL.

And there are many other versions of NFL in which there is *no* prior,
because things are conditioned on a fixed target. Exactly as in
(non-Bayesian) sampling theory statistics.

Some of those alternative NFL results involve saying "if you're
conditioning on a target, there are as many such targets where you die
as where you do well". 

Other NFL results never vary the target *in any sense*, even to
compare different targets. Rather they vary something concerning the
generalizer. This is the case with the more sophisticated xvalidation
results, for example.

READ THE PAPERS.


>>>
There is much information which is not of this 
"single sharp data" type. (Examples see below.)
>>>

*Obviously* if you have extra information and/or knowledge beyond that
in the training set, you can (often) do better than randomly. That's
what Bayesian analysis is all about. More generally, as I have proven
in [1], the probability of error can be written as a non-Euclidean
inner product between the learning algorithm and the posterior. So
obviously if your posterior is structured in an appropriate manner,
that can be exploited by the algorithm.

This was never the issue however. The issue had to do with "blind"
supervised learning, in which one has no such additional
information. Like in COLT, for example.

You're arguing apples and oranges here.


>>>
4) Real measurements (especially of continuous variables)
normally do also NOT have the form y_i=f(x_i) !
They mostly perform some averaging over f(x_i) or
at least they have some noise on the x_i (as small as you like, but present).
>>>

Again, this is obvious. And stated explicitly in the papers,
moreover. And completely irrelevent to the current discussion. The
issue at hand has *always* been "sharp" data. And if you look at
what's done in the neural net community, or in COLT, 95% of it assumes
"sharp data".

Indeed, there are many other assumptions almost always made and almost
never true that Lemm has missed. Like making a "weak filtering
assumption": assume the target and the distribution over inputs are
independent. But again, just like in COLT, we're starting simple here,
with such assumptions intact.

READ THE PAPERS.


>>>
This shows that smoothness of the expectation (in contrast to uniform priors) 
is the result of the measurement process and therefore
is a real phenomena for "effective" functions.
>>>

To give one simple example, what about with categorical data, where
there is not even a partial ordering over the inputs? What does
"locally smooth" even mean then?

And even if we're dealing with real valued spaces, if there's input
space noise, NFL simply changes to be a statement concerning test set
elements that are sufficiently far (on the scale of the input space
noise) from the elements of the training set. The input space noise
makes the math more messy, but doesn't change the underlying phenomenon.

(Readers interested in previous work on the relationship between local
(!) regularization, smoothness, and input noise should see Bishop's
Neural Computation article of about 6 months ago.)


>>>
Even more: situations without "priors" are VERY artificial.
So if we specify the "priors" (and the lesson from NFL is
that we should if we want to make a good theory) 
then we cannot use NFL anymore.(What should it be used for then?)
>>>

Sigh. 

1) I am a Bayesian whenever feasible. (In fact, I've been taken to
task for being "too Bayesian".) But situations without obvious priors
- or where eliciting the priors is not trivial and you don't have the
time - are in fact *very* common.

A simple example is a project I am currently involved on for detecting
phone fraud for MCI. Quick, tell me the prior probability that a
fraudulent call arises from area code 617 vs. the prior probability
that a non-fraudulent call does...

2) Essentially all of COLT is non-Bayesian. (Although some of it makes
assumptions about things like the support of the priors.) You haven't
a prayer of really understanding what COLT has to say without keeping
in mind the admonitions of NFL.

3) As I've now said until I'm blue in the face, NFL is only the
starting point. What it's "good for", beyond proving to people that
they must pay attention to their assumptions, be wary of COLT-type
claims, etc. is: head-to-head minimax theory, scrambled algorithms
theory, hypothesis-averaging theory, etc., etc., etc.

READ THE PAPERS.
 

****


Zhu writes:

>>>
I quite agree with Joerg's observation about learning algorithms in
practice, and the priors they use.  The key difference is

	Is it legitimate to be vague about prior?

Put it another way,

	Do you claim the algorithm can pick up whatever prior automatically,
	instead of being specified before hand?

My answer is NO, to both questions, because for an algorithm to be good on 
any prior is exactly the same as for an algorithm to be good without prior,
as NFL told us.
>>>

Yes!

Everybody, LISTEN TO ZHU!!!!


David Wolpert


[1] - Wolpert, D. "The Relationshop Between PAC, the Statistical
Physics Framework, the Bayesian Framework, and the VC Framework", in
"The Mathematics of Generalization", D. Wolpert (Ed.), Addison-Wesley,
1995

From terry at salk.edu  Wed Dec 20 20:34:15 1995
From: terry at salk.edu (Terry Sejnowski)
Date: Wed, 20 Dec 95 17:34:15 PST
Subject: Senior Position at GSU
Message-ID: <9512210134.AA16333@salk.edu>

Forwarded to Connectionists:

    Date: Mon, 18 Dec 1995 15:00:23 -0500 (EST)
    From: Donald Edwards <biodhe at gsusgi2.Gsu.EDU>
    Subject: job
    
    Dear friends and colleagues,
    	I am writing to let you know of a senior position in 
    computational neuroscience available here in the Department of Biology at 
    Georgia State University.  This person would join neurobiologists, 
    physicists, mathematicians and computer scientists in the newly 
    established Center for Neural Communication and Computation, and would 
    participate in the graduate program in Neurobiology in the Department of 
    Biology.  This person would also help guide the construction, equipping and 
    staffing of a Laboratory for Computational Neuroscience for which funds 
    have already been obtained from the George Research Alliance.  
    	Georgia State University is located in downtown Atlanta. 
    	For more information, please contact me at this address, or call at 
    (404) 651-3148.  
    	To apply, please send a letter of intent, c.v., and two letters 
    of reference to Search Committee for Computational Neuroscience, 
    Department of Biology, Georgia State University, Atlanta, GA 30302-4010.  
    FAX: (404) 651-2509.   
    	Please share this message with anyone who might be interested.
    	Thanks for your consideration,
    	Don Edwards


From erik at kuifje.bbf.uia.ac.be  Thu Dec 21 12:48:50 1995
From: erik at kuifje.bbf.uia.ac.be (Erik De Schutter)
Date: Thu, 21 Dec 95 17:48:50 GMT
Subject: Crete Course in Computational Neuroscience
Message-ID: <9512211748.AA27308@kuifje.bbf.uia.ac.be>

                 CRETE COURSE IN COMPUTATIONAL NEUROSCIENCE

                       AUGUST 25 - SEPTEMBER 21, 1996

                                CRETE, GREECE

DIRECTORS:    Erik  De Schutter (University of Antwerp, Belgium)
              Idan Segev (Hebrew University, Jerusalem, Israel)
              Jim Bower (California Institute of Technology, USA)
              Adonis Moschovakis (University of Crete, Greece)


The Crete Course in Computational Neuroscience introduces students to 
the practical application of computational methods in neuroscience, in 
particular how to create biologically realistic models of neurons and 
networks.  

The course consists of two complimentary parts.  A distinguished 
international faculty gives morning lectures on topics in experimental 
and computational neuroscience.  The rest of the day is spent learning 
how to use simulation software and how to implement a model of the 
system the student wishes to study.  The first week of the course 
introduces students to the most important techniques in modeling single 
cells, networks and neural systems.  Students learn how to use the 
GENESIS, NEURON, XPP and other software packages on their individual 
unix workstations.  During the following three weeks the lectures will 
be more general, moving from modeling single cells and subcellular 
processes through the simulation of simple circuits and large neuronal 
networks and, finally, to system level models of the cortex and the brain. 
The course ends with a presentation of the student modeling projects.

The Crete Course in Computational Neuroscience is designed for advanced 
graduate students and postdoctoral fellows in a variety of disciplines, 
including neurobiology, physics, electrical engineering, computer science 
and psychology.  Students are expected to have a basic background in 
neurobiology as well as some computer experience.  A total of 25 students 
will be accepted, the majority of whom will be from the European Union
and affiliated countries.  A tuition fee of 500 ECU ($700) covers travel 
to Crete, lodging and all course-related expenses for European nationals.  
We encourage students from the Far East and the USA to also apply to this
international course.

More information and application forms can be obtained:
   - WWW access: http://bbf-www.uia.ac.be/CRETE/Crete_index.html
   - by mail:  Prof. E. De Schutter
               Born-Bunge Foundation
               University of Antwerp - UIA, 	 
               Universiteitsplein 1
               B2610 Antwerp
               Belgium
   - email: crete_course at kuifje.bbf.uia.ac.be

APPLICATION DEADLINE:  April 10th, 1996.  Applicants will be notified of the
                       results of the selection procedures before May 1st.

FACULTY: M. Abeles (Hebrew University, Jerusalem, Israel), D.J. Amit 
         (University of Rome, Italy and Hebrew University, Israel), 
         R.E. Burke  (NIH, USA), C.E. Carr (University of Maryland, USA), 
         A. Destexhe (Universit Laval, Canada), R.J. Douglas (Institute of
         Neuroinformatics, Zurich, Switzerland), T. Flash (Weizmann 
         Institute, Rehovot, Israel), A. Grinvald (Weizmann Institute, 
         Israel), J.J.B. Jack (Oxford University, England), C. Koch 
         (California Institute of Technology, USA), H. Korn (Institut 
         Pasteur, France), A. Lansner (Royal Institute Technology, Sweden), 
         R. Llinas (New York University, USA), E. Marder (Brandeis
         University, USA), M. Nicolelis (Duke University, USA), J.M. Rinzel 
         (NIH, USA), W. Singer (Max-Planck Institute, Frankfurt, Germany), 
         S. Tanaka (RIKEN, Japan), A.M. Thomson (Royal Free Hospital, 
         England), S. Ullman (Weizmann Institute, Israel), Y. Yarom 
         (Hebrew University, Israel).

The Crete Course in Computational Neuroscience is supported by the 
European Commission (4th Framework Training and Mobility of Researchers 
program) and by The Brain Science Foundation (Tokyo). 

Local administrative organization: the Institute of Applied and 
Computational Mathematics of FORTH (Crete, GR).


From udah075 at kcl.ac.uk  Thu Dec 21 12:53:21 1995
From: udah075 at kcl.ac.uk (Rasmus Petersen)
Date: Thu, 21 Dec 95 17:53:21 GMT
Subject: studentships for European students
Message-ID: <3027.9512211753@maths1.mth.kcl.ac.uk>

**************************************************************

Studentships - For EU Students - Please note new age limit

It was agreed by the Human Resources Committee and endorsed by the Executive
Board of NEuroNet in Paris that up to 10,000 ECU be allocated for 
studentships each year.

These provide support for registration, accommodation and travel to
designated workshops and conferences with a significant tutorial component.
(The studentships are a fixed value).

Up to 22 studentships of 450 ECU each will be available for the NEuroFuzzy
'96 workshop and tutorials in Prague from 16th-18th April 1996.
Applications for these studentships must be received in the NEuroNet Office
before 31st December 1995.  Successful applicants will be notified in
January 1996.

Up to 20 studentships of 500 ECU each will be available for the ICANN '96
conference in Bochum, Germany from 16th-19th July 1996.  Applications for 
these studentships must be received in the NEuroNet Office
before 3rd March 1996.  Successful applicants will be notified in
April 1996.

Applicants for studentships are limited to full-time students, who are EU
nationals, and aged 30 years or less.  (Priority will be given to applicants
aged under 25 years of age).  All applications should be accompanied by a 
letter of support from the applicant's Head of Department and should contain 
verification of the applicant's age, status as a student and nationality.

All applications will be reviewed by the Human Resources Committee of NEuroNet.

Please apply in writing to the NEuroNet Administrator:

Ms Terhi Garner
NEuroNet                       
Department of Electronic and Electrical Engineering 
King's College London                                    
Strand, London WC2R 2LS, UK                        
Fax: +44 (0) 171 873 2559

***********************************************************************


From dhw at santafe.edu  Fri Dec 29 19:54:42 1995
From: dhw at santafe.edu (dhw@santafe.edu)
Date: Fri, 29 Dec 95 17:54:42 MST
Subject: Postdoc opening
Message-ID: <9512300054.AA17781@yaqui>


The Santa Fe Institute is soliciting applications for a TXN
postdoctoral fellow. The fellow is expected to perform research in
Machine Learning, Artificial Intelligence, or related areas of
statistics.

Information about the SFI can be found at http://www.santafe.edu/.

Candidates should have a Ph.D. (or expect to receive one soon) and should
have backgrounds in computer science, mathematics, statistics, or
related fields. 

Applicants should submit a curriculum vitae, list of publications,
statement of research interests, and three letters of
recommendation. Please submit your materials in one complete
package. Incomplete applications will not be considered.

All application materials must be received by March 1, 1996. Decisions
will be made by April, 1996. Send complete application packages only,
preferably hard copy, to:

       TXN Postdoctoral Committee
       Attention: David Wolpert
       Santa Fe Institute
       1399 Hyde Park Road
       Santa Fe, New Mexico 87501

Include your e-mail address and/or fax number.

The SFI is an equal opportunity employer. Women and minorities are
encouraged to apply.

From bozinovs at delusion.cs.umass.edu  Sun Dec 31 17:55:53 1995
From: bozinovs at delusion.cs.umass.edu (bozinovs@delusion.cs.umass.edu)
Date: Sun, 31 Dec 1995 17:55:53 -0500
Subject: New Book
Message-ID: <9512312255.AA25407@delusion.cs.umass.edu>


Dear Connectionists,

Happy New Year to everybody!

At the end of the year I have a pleasure to announce a new book in
the field.

Advertisment:
*********************************************************************
New Book!   New Book!   New Book!   New Book!   New Book!   New Book!
---------------------------------------------------------------------

                        CONSEQUENCE DRIVEN SYSTEMS
                        CONSEQUENCE DRIVEN SYSTEMS 
                        CONSEQUENCE DRIVEN SYSTEMS

                          by  Stevo Bozinovski

*201 pages
*79 figures
*27 algorithm descriptions
*8 tables

Among its special features, the book:
---------------------------------------
** provides a unified theory of response-sensitive teaching and learning 
 
** as a result of that theory describes a generic architecture of a 
neuro-genetic agent capable of performing in 1) consequence sensitive 
teaching, 2) reinforcement learning, and 3) self-reinforcement learning 
paradigms 

** describes the Crossbar Adaptive Array (CAA) architecture, an 1981
neural network developed within the Adaptive Networks Group, as an 
example of a neuro-genetic agent

** explains how the CAA architecture was the first neural network that 
solved a delayed reinforcement learning task, the Dungeons-and-Dragons
task, in 1981

** explains how the 1981 learning method (shown on the cover of the 
book) is actually the well known, 1989 rediscovered,  Q-learning method  

** introduces the Benefit-Cost CAA (B-C CAA), as extension of the 1981 
Benefit-only CAA architecture 

** introduces at-subgoal-go-back algorithm as modification of the 
1981 at-goal-go-back CAA algorithm

** introduces a new type of neuron, denoted as Provoking Adaptive Unit,
for dealing with tasks of Distributed Consequence Programming

** illustrates the usage of those neurons as routers in a 
routing-in-networks-with-faults task

** uses parallel programming technique in describing the algorithms
throughout the book
-----------------------------------------

Ordering information
ISBN 9989-684-06-5, Gocmar Press, 1995 
price: $15, paperback

For further information contact the author: bozinovs at cs.umass.edu

**********************************************************************

CONTENTS:  

1. INTRODUCTION

1.1. The framework
1.2. Agents and architectures
1.3. Neural architectures
1.3.1. Greedy policy neural architectures
1.3.2. Recurrent architectures
1.3.3. Crossbar architectures 
1.3.4. Subsumption architecture adaptive arrays
1.4. Problems. Emotional Graphs
1.5. Games. Emotional Petri Nets
1.6. Parallel programming
1.7. Bibliographical and other notes


2.  CONSEQUENCE LEARNING AGENTS: A STRUCTURAL THEORY 

2.1. The agent-environment interface
2.2. A taxonomy of learning paradigms
2.3. Classes of consequence learning agents
2.4. A generic consequence learning architecture
2.5. Learning rules and routines
2.6. Bibliographical and other notes

3. CONSEQUENCE DRIVEN TEACHING

3.1. Class T agents
3.2. Learners
3.3. Teachers
3.3.1. Toward a theory of teaching systems
3.3.2. Teaching strategies
3.4. Curriculums
3.4.1. Curriculum grammars and languages
3.4.2. Curriculum space approach
3.5. Pattern classification teaching as integer programming
3.6. Pattern classification teaching as dynamic programming
3.7. Bibliographical and other notes

4. EXTERNAL REINFORCEMENT LEARNING 

4.1. Reinforcement learningh NG agents
4.2. Associative Search Network (ASN)
4.2.1. Basic ASN
4.2.2. Reionforcement predictive ASN
4.3. Actor-Critic architecture
4.4. Bibliographical and other notes

5. SELF-REINFORCEMENT LEARNING

5.1. Conceptual framework
5.2. Self-reinforcement learning and the NG agents
5.3. The Crossbar Adaptive Array architecture
5.4. How it works
5.4.1. Defining primary goals from the genetic environment
5.4.2. Secondary reinforcement mechanism
5.4.3. The CAA learning method
5.5. Example of a CAA architecture
5.6. Solving problems with a CAA architecture 
5.6.1. Learning in emotional graphs: Maze running
5.6.2. Learning in loosely defined emotional graphs: Pole balancing
5.7. Another example of a CAA architecture
5.8. Using entropy in Markov Decision Processes
5.9. Issues on the genetic environment
5.9.1. CAA architecture as an optimization architecture
5.9.2. Complemetarity with the Genetic Algorithms
5.9.3. Self-reinforcement: Genetic environment approach
5.10. Bibliographical and other notes

6. CONSEQUENCE PROGRAMMING

6.1. Dynamic Programming and Markov Decision Problems
6.2. Introducing cost in the CAA architecture
6.3. Q-learning
6.4. A taxonomy of the CAA-method based learning algorithms
6.5. Producing optimal solution in a stochastic environment
6.6. Distributed Consequence Programming: A neural theory
6.6.1. Provoking units: Axon provoked neurons
6.6.2. An illustration: Routing in client-server networks with faults
6.7. Bibliographical and other notes

7. SUMMARY

8. REFERENCES

9. INDEX
*********************************************************************


From dhw at santafe.edu  Fri Dec  1 11:18:19 1995
From: dhw at santafe.edu (David Wolpert)
Date: Fri, 1 Dec 95 09:18:19 MST
Subject: Correcting misunderstandings about NFL
Message-ID: <9512011618.AA27395@sfi.santafe.edu>


This posting is to correct some misunderstandings that were recently
posted concerning the NFL theorems. I also draw attention to some of
the incorrect interpretations commonly ascribed to certain COLT
results.

***


Joerg Lemm writes:

>>>
1.) If there is no relation between the function values
    on the test and training set
    (i.e. P(f(x_j)=y|Data) equal to the unconditional P(f(x_j)=y) ),
    then, having only training examples y_i = f(x_i) (=data) 
    from a given function, it is clear that I cannot learn anything 
    about values of the function at different arguments, 
    (i.e. for f(x_j), with x_j not equal to any x_i = nonoverlapping
test set).
>>>

Well put. Now here's the tough question: Vapnik *proves* that it is
unlikely (for large enough training sets and small enough VC dimension
generalizers) for error on the training set and full "generalization
error" to be grealy different. Regardless of the target. Using this,
Baum and Haussler even wrote a paper "What size net gives valid
generalization?" in which no assumptions whatsoever are made about the
target, and yet the authors are able to provide a response the
question of their title. HOW IS THAT POSSIBLE GIVEN WHAT YOU JUST
WROTE????

NFL is "obvious". And so are VC bounds on generalization error (well,
maybe not "obvious"). And so is the PAC "proof" of Occam's razor. And
yet the latter two bound generalization error (for those cases where
training set error is small enough) without making any assumptions
about the target. What gives?

The answer: The math of those works is correct. But far more care must
be exercised in the interpretation of that math than you will find in
those works. The care involves paying attention to what goes on the
right-hand side of the conditioning bars in one's probabilities, and
the implications of what goes there.  Unfortunately, such conditioning
bar are completely absent in those works...

(In fact, the sum-total of the difference between Bayesian and COLT
approaches to supervised batch learning lies in what's on the
right-hand side of those bars, but that's another story. See [2].)

As an example, it is widely realized that VC bounds suffer from being
worst-case.  However there is another hugely important caveat to those
bounds. The community as a whole simply is not aware of that caveat,
because the caveat concerns what goes on the right-hand side of the
conditioning bar, and this is NEVER made explicit.

This caveat is the fact that VC bounds do NOT concern

Pr(IID generalization error |
	observed error on the training set, training set size,
					VC dimension of the generalizer).

But you wouldn't know that to read the claims made on behalf of those
bounds ...

To give one simple example of the ramifications of this: Let's say you
have a favorite low-VC generalizer. And in the course of your career
you parse though learning problems, either explicitly or (far more
commonly) without even thinking about it. When you come across one
with a large training set on which your generalizer has small
generalization error, you want to invoke Vapnik to say you have
assuraces about full generalization error.

Well, sorry. You don't and you can't. You simply can't escape Bayes by
using confidence intervals. Confidence intervals in general (not just
in VC work) have the annoying property that as soon as you try to use
them, very often you contradict the underlying statistical assumptions
behind them. Details are in [1] and in the discussion of "We-Learn-It
Inc." in [2].


>>>
2.) We are considering two of those (influence) relations P(f(x_j)=y|Data):
    one, named A, for the true nature (=target) and one, named B, for our 
    model under study (=generalizer).
    Let P(A and B) be the joint probability distribution for the
    influence relations for target and generalizer.

3.) Of course, we do not know P(A and B), but in good old Bayesian tradition,
    we can construct a (hyper-)prior P(C) over the family of probability 
    distributions of the joint distributions C = P(A and B).
 
4.) NFL now uses the very special prior assumption
    P(A and B) = P(A)P(B)
>>>

If I understand you correctly, I would have to disagree. NFL also
holds with your P(C) being any prior assumption - more formally,
averaging over all priors, you get NFL. So the set of priors for which
your favorite algorithm does *worse than random* is just as large as
the set for which it does better. (In this sense, the uniform prior is
a typical prior, not a pathological one, out on the edge of the
space. It is certainly not a "very special prior".)

In fact, that's one of the major points of NFL - it's not to see what
life would be like if this or that were uniform, but to use such
uniformity as a mathematical tool, to get a handle on the underlying
geometry of inference, the size of the various spaces (e.g., the size
of the space of priors for which you lose to random), etc.

The math *starts* with NFL, and then goes on to many other things (see
[1]). It's only the beginning chapter of the text book.


>>>
I say that it is rational to believe 
(and David does so too, I think) that in real life cross-validation 
works better in more cases than  anti-cross-validation.
>>>

Oh, most definitely.

There are several issues here: 1) what gives with all the "prior-free"
general proofs of COLT, given NFL, 2) purely theoretical issues (e.g.,
as mentioned before, characterizing the relationship between target
and generalizers needed for xval. to beat anti-xval.) and 3) perhaps
most provocatively of all, seeing if NFL (and the associated
mathematical structure) can help you generalize in the real world
(e.g., with head-to-head minimax distinctions between generalizers).


***


Finally, Eric Baum weighs in:


>>>
Barak Pearlmutter remarked that saying
	We have *no* a priori reason to believe that targets with "low
	Kolmogorov complexity" (or anything else) are/not likely to
	occur in the real world.
(which I gather was a quote from David Wolpert?)
is akin to saying we have no a priori reason to believe there is non-random
structure in the world, which is not true, since we make great
predictions about the world.
>>>

Well, let's get a bit formal here. Take all the problems we've ever
tried to make "great predictions" on. Let's even say that these
problems were randomly chosen from those in the real world (i.e., no
selection effects of people simply not reporting when their
predictions were not so great). And let's for simplicity say that all
the predictions were generated by the same generalizer - the algorithm
in the brain of Eric Baum will do as a straw man.

Okay. Now take all those problems together and view them as one huge
training set. Better still, add in all the problems that Eric's
anscestors addressed, so that the success of his DNA is also taken
into account. That's still one training set. It's a huge one, but it's
tiny in comparison to the full spaces it lives in.

Saying we (Eric) makes "great predictions" simply means that the
xvalidation error of our generalizer (Eric) on that training set is
small. (You train on part of the data, and predict on the rest.)
Formally (!!!!!), this gives no assuraces whatsoever about any
behavior off-training-set. As I've stated before, without assumptions,
you cannot conclude that low xvalidation error leads to low
off-training set generalization error. And of course, each passing
second, each new scene you view, is "off-training-set".

The fallacy in Eric's claim was noted all the way back by
Hume. Success at inductive inference cannot formally establish the
utility of using inductive inference. To claim that it can you have to
invoke inductive inference, and that, as any second grader can tell
you, is circular reasoning.


Practically speaking of course, none of this is a concern in the real
world. We are all (me included) quite willing to conclude there is
structure in the real world.  But as was noted above, what we do in
practice is not the issue. The issue is one of theory.

***

It's very similar to high-energy physics. There are a bunch of
physical constants that, if only slightly varied, would (seem to) make
life impossible. Why do they have the values they have? Some invoke
the anthropic principle to answer this - we wouldn't be around if they
had other values. QED. But many find this a bit of a cop-out, and
search for something more fundamental. After all, you could have
stopped the progress of physics at any point in the past if you had
simply gotten everyone to buy into the anthropic principle at that
point in time.

Similarly with inductive inference. You could just cop out and say
"anthropic principle" - if inference were not possible, we wouldn't be
having this debate. But that's hardly a satisfying answer.


***


Eric goes on:

>>>
Consider the problem of learning
to predict the pressure of a gas from its temperature. Wolpert's theorem,
and his faith in our lack of prior about the world, predict,
that any learning algorithm whatever is as likely
to be good as any other. This is not correct.
>>>

To give two examples from just the past month, I'm sure MCI and
Coca-Cola would be astonished to know that the algorithms they're so
pleased with were designed for them by someone having "faith in our
lack of prior about the world".

Less glibly, let me address this claim about my "faith" with two
quotes from the NFL for supervised learning paper. The first is in the
introduction, and the second in a section entitled "On uniform
averaging". So neither is exactly hidden ...

1) "It cannot be emphasized enough that no claim is being made .. that
all algorithms are equivalent in the real world."

2) "The uniform sums over targets ... weren't chosen because there is
strong reason to believe that all targets are equally likely to arise
in practice. Indeed, in many respects it is absurd to ascribe such a
uniformity over possible targets to the real world. Rather the uniform
sums were chosen because such sums are a useful theoretical tool with
which to analyze supervised learning."


Finally, given that I'm mixing it up with Eric on NFL, I can't help
but quote the following from his "What size net gives valid
generalization" paper:

"We have given bounds (independent of the target) on the training set
size vs. neural net size need such that valid generalization can be
expected." 

(Parenthetical comment added - and true.)

Nowhere in the paper is there any discussion whatsoever of the
apparent contradiction between this statement and NFL-type concerns.
Indeed, as mentioned above, with only the conditioning-bar-free
mathematics in Eric's paper, there is no way to resolve the
contradiction. In this particular sense, that paper is extremely
misleading. (See discussion above on misinterpretations of Vapnik's
results.)


>>>>
Creatures evolving in this "play world" would exploit this structure and
understand their world in terms of it. There are other things they would
find hard to predict. In fact, it may be mathematically valid to say that
one could mathematically construct equally many functions on which
these creatures would fail to make good predictions. But so what?
So would their competition. This is not relevant to looking for
one's key, which is best done under the lamppost, where one has a
hope of finding it. In fact, it doesn't seem that the play world
creatures would care about all these other functions at all.
>>>

I'm not sure I quite follow this. In particular, the comment about the
"competition" seems to be wrong.

Let me just carry further Eric's metaphor though, and point out though
that it makes a hell of a lot more sense to pull out a flashlight and
explore into the surrounding territory for your key than it does to
spend all your time with your head down, banging into the lamppost. And
NFL is such a flashlight.


David Wolpert


[1] The current versions of the NFL for supervised learning papers,
nfl.ps.1.Z and nfl.ps.2.Z, at ftp.santafe.edu, in pub/dhw_ftp.

[2] "The Relationship between PAC, the Statistical Physics Framework,
the Bayesian Framework, and the VC Framework", in *The Mathematics of
Generalization*, D. Wolpert Ed., Addison-Wesley, 1995.


From marco at McCulloch.Ing.UniFI.IT  Fri Dec  1 12:21:43 1995
From: marco at McCulloch.Ing.UniFI.IT (Marco Gori)
Date: Fri, 01 Dec 1995 18:21:43 +0100
Subject: Italian Neural Network Society
Message-ID: <9512011721.AA09634@McCulloch.Ing.UniFI.IT>


==============================================================
This is to announce a new web page describing the aims and the
activities of the Italian Neural Network Society.  The page is
hosted at the DSI Web server of the Dipartimento di  Sistemi e 
Informatica, Universita' di Firenze) at the following address:

http://www-dsi.ing.unifi.it/neural/siren

-- marco gori. 
===============================================================


From schmidhu at informatik.tu-muenchen.de  Sun Dec  3 06:40:25 1995
From: schmidhu at informatik.tu-muenchen.de (Juergen Schmidhuber)
Date: Sun, 3 Dec 1995 12:40:25 +0100
Subject: compressibility and generalization
Message-ID: <95Dec3.124033+0100_met.116308+385@papa.informatik.tu-muenchen.de>


Eric Baum wrote:

>>>
(1) While it may be that in classical Lattice gas models, a gas does
not have high Kolmogorov complexity, this is not the origin of
the predictability exploited by physicists. Statistical mechanics
follows simply from the assumption that the gas is in a random one
of the accessible states, i.e. the states with a given amount of
energy. So *define* a *theoretical* gas as follows: Every time you
observe it,it is in a random accessible state. Then its
Kolmogorov complexity is huge (there are many accessible states)
but its macroscopic behavior is predictable. (Actually
this an excellent description of a real gas, given quantum mechanics.)
<<<

(1) The key expression here is ``the assumption that the gas is
in a random one of the *accessible* states''.  Since the accessible
states are defined to be those with equal energy, this greatly
restricts the number of possible states. By definition, it is
trivial to make a macro-level prediction like ``the total energy
will remain constant''.  In turn, there are relatively short
descriptions of a given history of such a gas.  With true random
gas, however, there are no invariants eliminating most of the
possible states. This makes its history incompressible.

(2) Back to: what does this have to do with machine learning? As a
first step, we may simply apply Solomonoff's theory of inductive
inference to a dynamic system or ``universe''. Loosely speaking,
in a universe whose history is compressible, we may expect to
generalize well.  A simple, old counting argument shows: most
computable universes are incompressible. Therefore, in most
computable universes you won't generalize well (this is related
to what has been (re)discovered in NFL).

(3) Hence, the best we may hope for is a learning technique with
good expected generalization performance in *arbitrary* compressible
universes. Actually, another restriction is necessary: the time
required for compression and decompression should be ``tolerable''.
To formalize the expression ``tolerable'' is subject of ongoing
research.

Juergen Schmidhuber
IDSIA
juergen at idsia.ch


From hicks at cs.titech.ac.jp  Sun Dec  3 00:32:43 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Sun, 3 Dec 1995 14:32:43 +0900
Subject: Is the universe finite?
Message-ID: <199512030532.OAA02207@euclid.cs.titech.ac.jp>


I would like to make 2 points.
One concerns a clarification of David Wolperts definition of the universe.
The second one is a thought problem meant to be an illustration 
of the inevitability of structure.

Point 1: 
David Wolpert writes:
(1) >Practically speaking of course, none of this is a concern in the real
>world. We are all (me included) quite willing to conclude there is
>structure in the real world.  But as was noted above, what we do in
>practice is not the issue. The issue is one of theory.

(2) >Okay. Now take all those problems together and view them as one huge
>training set. Better still, add in all the problems that Eric's
>anscestors addressed, so that the success of his DNA is also taken
>into account. That's still one training set. It's a huge one, but it's
>tiny in comparison to the full spaces it lives in.

The above statements seem  to me to be contradictory in some meaning.
"(1)" is saying we should, when discussing generalization, 
not concern ourselves with the real universe in which we live,
but should consider theoretical alternative universes as well.

On the other hand "(2)" seems to say that the real universe in which we live
is itself sufficiently "diverse" that any single approach to generalization
must on average be the same.

What is the universe about which we are talking?  Since mathematical models
exist in our minds and on paper in this universe, are they included?

I feel we ought to distinguish between a single universe (ours for example),
and the ensemble of possible universes.

Point 2:

Lets suppose a universe which is an N-dimensional binary (0/1) vector 
random variable X,
whose elements are iid with p(0)=p(1)=(1/2).  Apparently there is no structure
in this universe.

Now let us consider a universe which is a 
binary valued N by M matrix random variable AA
whose elements are also iid with p(0)=p(1)=(1/2).  
Let us draw a random instance A from AA.

Now we define an M-dimensional integer random variable Y
depending on X by p(y=Ax) = p(Ax), where x and y are instances of 
X and Y respectively.

If A happens to be chosen such that y is merely a subset of the elements
of x, then the prior p(y), like the prior p(x), will be uniform.
But for most choices of A, p(y) will not be uniform at all.

So, out of all the possible universes Y, most of them have structure.
This happens even though Y and AA have no structure.
The structure that Y will have is drawn from a uniform distribution 
(over AA), but we are only concerned with whether there will be structure
or not.

Of course, this proves nothing.  And now I am going to make a 
giant leap of analogy.

The following statements are not contradictory.

(a) In a universe drawn at random from 
the ensemble of all possible universes, we cannot expect to 
see any particular structure to be more likely that any other structure.

(b) In any given universe, we can expect structure to be present.

Would I be correct in saying that only (b) needs to be true in order
for cross-validation to be profitable?


Craig Hicks

Craig Hicks           hicks at cs.titech.ac.jp | Hisakata no, hikari nodokeki
Ogawa Laboratory, Dept. of Computer Science | Haru no hi ni, Shizu kokoro naku
Tokyo Institute of Technology, Tokyo, Japan | Hana no chiruran 
lab:03-5734-2187 home:03-3785-1974 | Spring smiles with sun beams 
fax (from abroad):		   | sifting down through cloudy dreams 
  +81(3)5734-2905 OGAWA LAB  	   | towards the anxious hearts
03-5734-2905 OGAWA LAB (from Japan)| beating pitter pat
[ Poem from Hyaku-nin i-syuu ->    | while flower petals scatter.


From arbib at pollux.usc.edu  Sun Dec  3 14:28:26 1995
From: arbib at pollux.usc.edu (Michael A. Arbib)
Date: Sun, 3 Dec 1995 11:28:26 -0800 (PST)
Subject: VISUOMOTOR COORDINATION: AMPHIBIANS, MODELS, AND COMPARATIVE STUDIES
Message-ID: <199512031928.LAA10890@pollux.usc.edu>


                          PRELIMINARY CALL FOR PAPERS

                                  Workshop on
       VISUOMOTOR COORDINATION: AMPHIBIANS, MODELS, AND COMPARATIVE STUDIES

                     Sedona, Arizona, November 22-24, 1996

Co-Directors: Kiisa Nishikawa (Northern Arizona University, Flagstaff) and
Michael Arbib (University of Southern California, Los Angeles).

Program Committee: Kiisa Nishikawa (Chair), Michael Arbib, Emilio Bizzi,
Chris Comer, Peter Ewert, Simon Gizster, Mel Goodale, Ananda Weerasuriya,
Walt Wilczynski, and Phil Zeigler.

Local Arrangements Chair: Kiisa Nishikawa.

This workshop is the sequel to four earlier workshops on the general theme
of "Visuomotor Coordination in Frog and Toad: Models and Experiments".  The
first two were organized by Rolando Lara and Michael Arbib at the
University of Massachusetts, Amherst (1981) and Mexico City (1982).  The
next two were organized by Peter Ewert and Arbib in Kassell and Los
Angeles, respectively, with the Proceedings published as follows:

Ewert, J.-P. and Arbib, M.A., Eds., 1989, Visuomotor Coordination:
Amphibians, Comparisons, Models and Robots, New York: Plenum Press.
Arbib, M.A.and  J.-P. Ewert, Eds., 1991, Visual Structures and Integrated
Functions, Research Notes in Neural Computing 3, Heidelberg, New York:
Springer-Verlag.

The time is ripe for a fifth Workshop on this theme, with the more generic
title "Visuomotor Coordination: Amphibians, Models, and Comparative
Studies".  The Workshop will be held in Sedona - a beautiful small resort
town set in dramatic red hills in Arizona - straight after the Society for
Neuroscience meeting in 1996.  Next year, Neuroscience ends on Thursday,
November 21, 1996, in Washington, DC, so people can fly to Phoenix that
evening, meet Friday, Saturday, and Sunday, and fly home Monday November
25th (so that US types not going to Neuroscience get the Saturday stopover
that they could not get if we met before Neuroscience).

The aim is to study the neural mechanisms of visuomotor coordination in
frog and toad both for their intrinsic interest and as a target for
developments in computational neuroscience, and also as a basis for
comparative and evolutionary studies.  The list of subsidiary themes given
below is meant to be representative of this comparative dimension, but is
not intended to be exhaustive.  In each case, the emphasis (but not the
exclusive emphasis) will be on papers which contribute to the development
of both modeling and experimentation.

Central Theme: Visuomotor Coordination in Frog and Toad

Subsidiary Themes:
Visuomotor Coordination: Comparative and Evolutionary Perspectives
Reaching and Grasping in Frog, Pigeon, and Primate
Cognitive Maps
Auditory Communication (with emphasis on spatial behavior and sensory
integration)
Sensory Control of Motor Pattern Generators

Formal registration information will be available in March of 1996.
Scientists who wish to present papers are asked to send three copies of
extended abstracts no later than March 31st, 1996 to:

Kiisa Nishikawa
Department of Biological Sciences
Northern Arizona University
Flagstaff, AZ 86011-5640

Notification of the Program Committee's decision will be sent out no later
than May 31st, 1996.

A decision as to whether or not to publish a proceedings is still pending.


From theresa at umiacs.UMD.EDU  Mon Dec  4 10:13:47 1995
From: theresa at umiacs.UMD.EDU (Theresa)
Date: Mon, 04 Dec 1995 10:13:47 -0500
Subject: Postdoc Position in Neural Modeling
Message-ID: <199512041513.KAA05125@skippy.umiacs.UMD.EDU>


The University of Maryland Institute for Advanced Computer Studies (UMIACS)
invites applications for post doctoral positions, beginning summer/fall '96
in the following areas:  Real-time Video Indexing, Natural Language Processing,
and Neural Modeling.  Exceptionally strong candidates from other areas will
also be considered.

UMIACS, a state-supported research unit, has been the focal point for 
interdisciplinary and applications-oriented research activities in computing
on the College Park campus.  The Institute's 40 faculty members conduct
research in high performance computing, software engineering, artificial
intelligence, systems, combinatorial algorithms, scientific computing, and
computer vision.

Qualified applicants should send a 1 page statement of research interest,
curriculum vitae and the names and addresses of 3 references to:

	Prof. Joseph Ja'Ja'
	UMIACS
 	A.V. Williams Building
	University of Maryland
	College Park, MD 20742

by April 1.  UMIACS strongly encourages applications from minorities and
women.
EOE/AA


From howse at eece.unm.edu  Mon Dec  4 11:12:34 1995
From: howse at eece.unm.edu (James W. Howse)
Date: Mon, 04 Dec 1995 09:12:34 -0700
Subject: Dissertation Available
Message-ID: <9512041612.AA27407@opus.eece.unm.edu>


The following PhD dissertation is available by FTP:

Gradient and Hamiltonian Dynamics: Some Applications to Neural Network
              Analysis and System Identification


                       James W. Howse


			  Abstract

The work in this dissertation is based on decomposing system dynamics into the
sum of dissipative (e.g., convergent) and conservative (e.g., periodic)
components.  Intuitively, this can be viewed as decomposing the dynamics into
a component normal to some surface and components tangent to other surfaces.
First, this decomposition was applied to existing neural network architectures
to analyze their dynamic behavior. Second, this formalism was employed to
create models which learn to emulate the behavior of actual systems.  The
premise of this approach is that the process of system identification can be
considered in two stages: model selection and parameter estimation.  In this
dissertation a technique is presented for constructing dynamical systems with
desired qualitative properties.  Thus, the model selection stage consists of
choosing the dissipative and conservative portions appropriately so that a
certain behavior is obtainable.  By choosing the parametrization of the models
properly, a learning algorithm has been devised and proven to always converges
to a set of parameters for which the error between the output of the actual
system and the model vanishes.  So these models and the associated learning
algorithm are guaranteed to solve certain types of nonlinear identification
problems.

Retrieval: 
   ftp ftp.eece.unm.edu
   login as anonymous
   cd howse
   get dissertation.ps.Z

This is a PostScript file compressed with compress.  The dissertation is 133
pages long and formatted to print single-sided.  If there are any retrieval or
printing problems please let me know.  I would welcome any comments or
suggestions regarding the dissertation.

No hardcopies are available.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

  James Howse - howse at eece.unm.edu
   __  __  __  __   _    _
  /\ \/\ \/\ \/\ \/\ `\_/ `\   University of New Mexico
  \ \ \ \ \ \ `\\ \ \       \   Department of EECE, 224D
   \ \ \ \ \ \ , ` \ \ `\_/\ \   Albuquerque, NM 87131-1356
    \ \ \_\ \ \ \`\ \ \ \_',\ \   Telephone: (505) 277-0805
     \ \_____\ \_\ \_\ \_\ \ \_\   FAX: (505) 277-1413 or (505) 277-1439
      \/_____/\/_/\/_/\/_/  \/_/

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

From zhuh at helios.ASTON.ac.uk  Mon Dec  4 15:33:50 1995
From: zhuh at helios.ASTON.ac.uk (zhuh)
Date: Mon, 4 Dec 1995 20:33:50 +0000
Subject: compressibility and generalization
Message-ID: <28443.9512042033@sun.aston.ac.uk>

On the implecations of No Free Lunch Theorem(s) by David Wolpert,

> From: Juergen Schmidhuber <schmidhu at informatik.tu-muenchen.de>
>
> (3) Hence, the best we may hope for is a learning technique with
> good expected generalization performance in *arbitrary* compressible
> universes. Actually, another restriction is necessary: the time
> required for compression and decompression should be ``tolerable''.
> To formalize the expression ``tolerable'' is subject of ongoing
> research.

However, the deeper NFL Theorem states that this is still impossible:

1. The *non-existence* of structure guarantees any algorithm will 
neither win nor lose, compared with the "random algorithm", in the long 
run. If this were all that is there, then NFL would be just tautology. 

2. The *mere existence* of structure guarantees a (not uniformly-random)
algorithm as likely to lose you a million as to win you a million, 
even in the long run.  It is the *right kind* of structure that makes 
a good algorithm good.

3. This is by far one of the most important implications of NFL, yet my 
sample from Connectionist show that it is safe to make the posterior 
prediction that if someone criticises NFL as irrelevent, then he has not 
got this far yet.

In conclusion: "for arbitrary environment there is an optimal algorithm"
is drastically different from "there is an optimal algorithm for arbitrary
environment", whatever restrictions you make on the word "arbitrary".

--
Huaiyu Zhu, PhD                   email: H.Zhu at aston.ac.uk
Neural Computing Research Group   http://neural-server.aston.ac.uk/People/zhuh
Dept of Computer Science          ftp://cs.aston.ac.uk/neural/zhuh
    and Applied Mathematics       tel: +44 121 359 3611 x 5427
Aston University,                 fax: +44 121 333 6215
Birmingham B4 7ET, UK              


From dhw at santafe.edu  Mon Dec  4 19:49:47 1995
From: dhw at santafe.edu (David Wolpert)
Date: Mon, 4 Dec 95 17:49:47 MST
Subject: Non-randomness is no panacea
Message-ID: <9512050049.AA16646@sfi.santafe.edu>


Craig Hicks writes:


>>>>>
(1) >Practically speaking of course, none of this is a concern in the real
>world. We are all (me included) quite willing to conclude there is
>structure in the real world.  But as was noted above, what we do in
>practice is not the issue. The issue is one of theory.

(2) >Okay. Now take all those problems together and view them as one huge
>training set. Better still, add in all the problems that Eric's
>anscestors addressed, so that the success of his DNA is also taken
>into account. That's still one training set. It's a huge one, but it's
>tiny in comparison to the full spaces it lives in.

The above statements seem  to me to be contradictory in some meaning.
>>>>

Not at all. The second statement is concerned with theoretical issues,
whereas the first one is concerned with practical issues. The
distinction is ubiquitous in science and engineering. Even in the
little corner of academia known as supervised learning, most people
are content to distinguish the concerns of COLT (theory) from those of
what-works-in-practice.

>>>
"(1)" is saying we should, when discussing generalization, 
not concern ourselves with the real universe in which we live,
but should consider theoretical alternative universes as well.
>>>

Were you referring to (2) instead? Neither statement says anything
like "we should not concern ourselves with the real universe".


>>>
On the other hand "(2)" seems to say that the real universe in which we live
is itself sufficiently "diverse" that any single approach to generalization
must on average be the same.
>>>

Again, I would have hoped that nothing I have said could be construed
as saying something like that. It may or may not be true, but you said
it, not me. :-) I am sorry if you were somehow given the wrong
impression.


>>>>
I feel we ought to distinguish between a single universe (ours for example),
and the ensemble of possible universes.
>>>>

This is a time-worn concern. Read up on the past two centuries worth
of battles between Bayesians and non-Bayesians...


>>>>
Lets suppose a universe which is an N-dimensional binary (0/1) vector 
random variable X,
whose elements are iid with p(0)=p(1)=(1/2).  Apparently there is no structure
in this universe.
>>>>

NO!!! Forgive my ... passion, but as I've said many times now, even in
a purely random universe, there are many very deep distinctions
between the behavior of different learning algorithms (and in this
sense there is plenty of "structure"). Like head-to-head minimax
distinctions. (Or uniform convergence theory ala Vapnik.) Please read
the relevent papers! ftp.santafe.edu, pub/dhw_ftp, nfl.ps.1.Z and
nfl.ps.2.Z.


>>>>
(b) In any given universe, we can expect structure to be present.

Would I be correct in saying that only (b) needs to be true in order
for cross-validation to be profitable?
>>>>

Nope. The structure can just as easily negate the usefulness of
xvalidation as establish it. And in fact, the version of NFL in which
one fixes the target and then averages over generalizers says that the
state of the universe is (in a certain precise sense), by itself,
irrelevent. Structure or not; that fact alone can not determine the
utility of xvalidation. 

***

Although I think it is at best tangential to further discuss
Kolmogorov complexity, Juergen Schmidhuber's recent comment deserves a
response. He writes:


>>>>>
(2) Back to: what does this have to do with machine learning? As a
first step, we may simply apply Solomonoff's theory of inductive
inference to a dynamic system or ``universe''. Loosely speaking,
in a universe whose history is compressible, we may expect to
generalize well.
>>>>

How could this be true? Nothing has been specified in Juergen's
statement about the loss function, how test sets are generated (IID
vs. off-training-set vs. who knows what), the generalizer used, how it
is related (if at all) to the prior over targets (a prior which, I
take it, Juergen wishes to be "compressible"), the noise process,
whether there is noise in the inputs as well as the outputs, etc.,
etc. Yet all of those factors are crucial in determining the efficacy
of the generalizer.

Obviously if your generalizer *knows* the "compression scheme of the
universe", knows the noise process, etc., then it will generalize
well. Is that what you're saying Juergen? It reduces to saying that if
you know the prior, you can perform Bayes-optimally. There is
certainly no disputing that statement.

It is worth bearing in mind though that NFL can be cast in terms of
averages over priors. In that guise, it says that there are just as
many priors - just as many ways of having a universe be
"compressible", loosely speaking - for which your favorite algorithm
dies as there are for which it shines.

In fact, it's not hard to show that an average over only those priors
that are more than a certain distance from the uniform prior results
in NFL - under such an average, for OTS error, etc., all algorithms
have the same expected performance.

The simply fact of having a non-uniform prior does not mean that
better-than-random generalization arises.

***

Structure, compressibility, whatever you want to call it; it can hurt
just as readily as it can help. The simple claim that there is
non-randomness in the universe does not establish that any particular
algorithm performs better than randomly.

To all those who dispute this, I ask that they present a theorem,
relating generalization error to "compressibility". (To do this of
course, they will have to specify the loss function, noise, etc.) Not
words, but math, and not just math concerning Kolmogorov complexity
considered in isolation. Math presenting a formal relationship between
generalization error and "compressibility". (A relationship that
doesn't reduce to the statement that if you have information
concerning the prior, you can exploit it to generalize well - no
rediscovery of the wheel please.)


David Wolpert


From hicks at cs.titech.ac.jp  Mon Dec  4 20:40:08 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Tue, 5 Dec 1995 10:40:08 +0900
Subject: compressibility and generalization
In-Reply-To: Juergen Schmidhuber's message of Sun, 3 Dec 1995 12:40:25 +0100 <95Dec3.124033+0100_met.116308+385@papa.informatik.tu-muenchen.de>
Message-ID: <199512050140.KAA05180@euclid.cs.titech.ac.jp>


On Sun, 3 Dec 1995 12:40:25, Juergen Schmidhuber's wrote:
>(2) Back to: what does this have to do with machine learning? As a
>first step, we may simply apply Solomonoff's theory of inductive
>inference to a dynamic system or ``universe''. Loosely speaking,
>in a universe whose history is compressible, we may expect to
>generalize well.  A simple, old counting argument shows: most
>computable universes are incompressible. Therefore, in most
>computable universes you won't generalize well (this is related
>to what has been (re)discovered in NFL).

In an earlier communication I hypothesized that a typical universe would have
structure that could be exploited by cross-validation.  This communication
from Juergen Schmidhuber contradicts my hypothesis, I think, because of the
existence the "simple, old counting argument" showing that "most computable
universes are incompressible".  I stand corrected.

The point I really wanted clarified was what was meant by the asseration
that in a typical universe

(A)	cross-validation works as well as anti-cross validation


I will just talk about the problem of (determinisitic or stochastic) function
estimation.  I can accept that for any set of model functions, there will be
an infinity of problems where cross-validation will be of no assistance,
because that model does not have the capacity to predict future input/output
realtions from any finite set of examples from the past.  This could be either
becuase the true function is pure noise, or because it looks like pure noise
from the perspective of any function from the set of candidate model
functions.  In this there will be no correlation between predictions and
samples, and cross-validation will do its job of telling us that the
generalization error is not decreasing.

However, I interpret the assertion that anti-cross validation can be expected
to work as well as cross-validation to mean that we can equally well expect
cross-validation to lie.  That is, if cross-validation is telling us that the
generalization error is decreasing, we can expect, on average, that the true
generalization error is not decreasing.

Isn't this a contradiction, if we assume that the samples are really randomly
chosen?  Of course, we can a posteriori always choose a worst case function
which fits the samples taken so far, but contradicts the learned model
elsewhere.  But if we turn things around and randomly sample that deceptive
function anew, the learned model will probably be different, and
cross-validation will behave as it should.

I think this follows from the principle that the empirical distribution over
an ever larger number of samples converges to the the true distribution of a
single sample (assuming the true distribution is stationary).

Does assertion (A) mean that this principle fails in alternative universes?

Respectfully Yours,

	Craig Hicks

Craig Hicks           hicks at cs.titech.ac.jp 
Ogawa Laboratory, Dept. of Computer Science 
Tokyo Institute of Technology, Tokyo, Japan 


From juergen at idsia.ch  Tue Dec  5 12:50:01 1995
From: juergen at idsia.ch (Juergen Schmidhuber)
Date: Tue, 5 Dec 95 18:50:01 +0100
Subject: Compressibility and Generalization
Message-ID: <9512051750.AA00953@fava.idsia.ch>

Shahab Mohaghegh requested a definition of 
``compressibility of the history of a universe''. 

Let S(t) denote the state of a computable universe
at discrete time step t. Let's suppose S(t) can be 
described by n bits. The history of the universe
between time step 1 (big bang) and time step t is 
compressible if it can be computed by an algorithm
whose size is clearly less than tn bits.

Given a particular computing device, most histories 
are incompressible: there are 2^tn possible histories, 
but there are less than (1/2)^c * 2^tn algorithms 
with less than 2^(tn-c) bits (c is a small positive
constant).  With most possible universes, the mutual 
algorithmic information between past and future is zero, 
and previous experience won't help to generalize well
in the future.

There are a few compressible or ``regular'' universes, 
however. To use ML terminology, some of them allow for 
``generalization by analogy''. Some of them allow for 
``generalization by chunking''. Some of them allow for
``generalization by exploiting invariants''. Etc. It
would be nice to have a method that can generalize well
in *arbitrary* regular universes.

Juergen Schmidhuber
IDSIA

P.S.: Sorry, I meant to say: 
there are less than (1/2)^c * 2^tn 
algorithms with less than tn-c bits.
JS


From gluck at pavlov.rutgers.edu  Tue Dec  5 16:52:15 1995
From: gluck at pavlov.rutgers.edu (Mark Gluck)
Date: Tue, 5 Dec 1995 16:52:15 -0500
Subject: Faculty Openings at Rutgers-Newark for Connectionist Modelers Interested in Cog Sci/Cog Neuro
Message-ID: <199512052152.QAA16557@pavlov.rutgers.edu>


The following junior faculty openings at Rutgers-Newark may be of interest
to connectionist modelers working in the area of Cognitive Psychology
and Cognitive Neuroscience. Although a purely theoretical researcher
would be considered, someone who combines both theoretical/computational
modeling and experimental research would be prefered:

- Mark Gluck

 
CENTER FOR MOLECULAR AND BEHAVIORAL NEUROSCIENCE
COGNITIVE NEUROSCIENCE
 
	One faculty position in human cognitive neuroscience is
available at the assistant to full professor level. Scientists with a
research focus on the neurobiological basis of higher cortical
function in humans, who would be stimulated by the integrative focus
and collaborative research environment of the Center for Molecular and
Behavioral Neuroscience, are encouraged to apply.  Research areas
include (but are not limited to) human experimental neuropsychology,
neuropsychiatry, brain imaging and neuroplasticity, cognitive
neuroscience, neurolinguistics, development, human electrophysiology,
computational neuroscience, neural basis of speech, attention, memory,
perception, emotion, psychophysics and behavioral genetics. State of
the art laboratories and equipment for human research, and a doctoral
program in Behavioral and Neural Science are available in the Center.
Additional information on our program, research facilities,and faculty
can be obtained over the internet at:
http://www.cmbn.rutgers.edu/bns-home.html. Neuroscientists interested
in brain/behavior relationships in normal and/or clinical populations
should send CV, names of three references and a brief letter of
research goals and philosophy to: Dr. Paula Tallal, Center for
Molecular and Behavioral Neuroscience, Rutgers University, 197
University Avenue, Newark, New Jersey, 07102.  Phone:  (201) 648-1080
x3200. Fax: (201) 648-1272. Email: tallal at axon.rutgers.edu. 

COGNITIVE PSYCHOLOGY, ASSISTANT PROFESSOR (TWO POSITIONS)

	The Department of Psychology at the Newark Campus of Rutgers
University invites Ph.D. applications for one tenure track and one
term (non-tenure track) Assistant Professor position to expand its
program in Cognitive Experimental Psychology. One position is in the
area of Attention and the second is in Social Cognition, or Cognitive
Development. The positions call for candidates with an active research
program and who are effective teachers at both the graduate and
undergraduate levels. Candidates must be prepared to teach a variety
of undergraduate courses.  Send CV and three letters of recommendation
to Professor Harold I. Siegel, Acting Chair, Department of
Psychology-Cognitive Search, Rutgers University, Newark, NJ 07102.  


----- End Included Message -----


From juergen at idsia.ch  Wed Dec  6 04:39:11 1995
From: juergen at idsia.ch (Juergen Schmidhuber)
Date: Wed, 6 Dec 95 10:39:11 +0100
Subject: Non-randomness is no panacea.
Message-ID: <9512060939.AA02202@fava.idsia.ch>


In response to David's response dated Mon, 4 Dec 95:

I wrote ``Loosely speaking, in a universe whose
history is compressible, we may expect to generalize
well.''. To make this more precise, let us consider a
very simple 1-bit universe --- suppose the problem
is to extrapolate a  sequence of symbols (bits, without
loss of generality).  We have already observed a bitstring
s and would like to predict the next bit.  Let si denote
the event ``s is followed by symbol i'' for  i in {0,1}.

David is absolutely right by reminding us that we need a
prior before applying Bayes. And he is right by pointing
out that only if we have information concerning the prior,
we can exploit it to generalize well. In the context of
the present discussion, however, an interesting point is:
there is a special prior that is biased towards
*arbitrary* compressibility/structure/regularity.

Following Solomonoff/Levin/Chaitin/Li&Vitanyi, define P(s), the
a priori probability of a bitstring s, as the probability of
guessing a (halting) program that computes s on a universal
Turing machine U. Here, the way of guessing is defined by the
following procedure: initially, the input tape consists of
a single square.  Whenever the scanning head of the input
tape shifts to the right, do: (1) Append a new square.
(2) With probability 1/2 fill it with a 0; with probability 1/2
fill it with a 1.  Bayes tells us
P(s0|s) = P(s|s0)P(s0)/P(s) P(s0/P(s); P(s1|s) = P(s1)/P(s).
We are going to predict ``the next bit will be 0'' if
P(s0) > P(s1), and vice versa.  Due to the coding theorem
(Levin 74, Chaitin 75), P(si) = O((1/2)^K(si)) for  i in
{0,1} (K(x) denotes x' Kolmogorov complexity), the continuation
with lower Kolmogorov complexity will (in general) be more
likely. If s is ``noisy'' then this will be reflected by
its relatively high Kolmogorov complexity.

I am not saying anything new here. I'd just like to point
that if you know nothing about your universe except that it
is regular in some way, then P is of interest.  Sadly, most
possible universes are completely irregular and incompressible.
But for the few (but infinetly many) that are not, P is a
prior to consider (at least if we don't care for computing
time and constant factors).

Perhaps there are too many threads in the current discussion.
I'll shut up for a while.

Juergen Schmidhuber
IDSIA


From goldfarb at unb.ca  Wed Dec  6 15:54:00 1995
From: goldfarb at unb.ca (Lev Goldfarb)
Date: Wed, 6 Dec 1995 16:54:00 -0400 (AST)
Subject: Compressibility and Generalization
In-Reply-To: <9512051750.AA00953@fava.idsia.ch>
Message-ID: <Pine.SUN.3.91.951206110450.2520A-100000@jupiter.sun.csd.unb.ca>

On Tue, 5 Dec 1995, Juergen Schmidhuber wrote:

> ``compressibility of the history of a universe''. 
> 
> There are a few compressible or ``regular'' universes, 
> however. To use ML terminology, some of them allow for 
> ``generalization by analogy''. Some of them allow for 
> ``generalization by chunking''. Some of them allow for
> ``generalization by exploiting invariants''. Etc. It
> would be nice to have a method that can generalize well
> in *arbitrary* regular universes.

For a proposal how to capture formally the concept of an "arbitrary
regular universe" for the purposes of inductive learning (and
generalization), i.e.  the concept of a "combinative" representation in a
universe, see the two references below as well as the original two papers
published in Pattern Recognition (and mentioned in each of the two
references). The structure of objects in the universe was discussed on the
INDUCTIVE list. 

It appears, that the concept of a "symbolic" representation has to be
formalized first (via the concept of transformation system), and the
fundamentally new concept of *inductive class structure*, not present in
other ML models, becomes of critical importance. The issue of dynamic
object representation, so conspicuously (and not surprisingly) absent from
the ongoing (classical) "statistical" discussion of inductive learning, is 
also brought to the fore. 

1. L. Goldfarb and S. Nigam, The unified learning paradigm: A foundation 
   for AI, in V. Honavar and L. Uhr, eds., Artificial Intelligence and 
   Neural Networks: Steps toward Principled Integration, Academic Press, 
   1994.
2. L. Goldfarb , J. Abela, V.C. Bhavsar, V.N. Kamat, Can a vector space 
   based learning model discover inductive class generalization in a 
   symbolic environment? Pattern Recognition Letters 16, 719-726, 1995.


-- Lev Goldfarb 

From N.Sharkey at dcs.shef.ac.uk  Thu Dec  7 07:24:09 1995
From: N.Sharkey at dcs.shef.ac.uk (N.Sharkey@dcs.shef.ac.uk)
Date: Thu, 7 Dec 95 12:24:09 GMT
Subject: CALL FOR ROBOTICS PAPERS
Message-ID: <9512071224.AA11298@entropy.dcs.shef.ac.uk>


			CALL FOR PAPERS

	      ** LEARNING IN ROBOTS AND ANIMALS **
                   An AISB-96 two-day workshop

University of Sussex, Brighton, UK: April, 1st & 2nd, 1996
Co-Sponsored by IEE Professional Group C4 (Artificial Intelligence)

WORKSHOP ORGANISERS:
Noel Sharkey (chair), University of Sheffield, UK.
Gillian Hayes, University of Edinburgh, UK.
Jan Heemskerk, University of Sheffield, UK.
Tony Prescott, University of Sheffield, UK.

PROGRAMME COMMITTEE:
Dave Cliff, UK.
Marco Dorigo, Italy.
Frans Groen, Netherlands.
John Hallam, UK.
John Mayhew, UK.
Martin Nillson, Sweden
Claude Touzet, France
Barbara Webb, UK.
Uwe Zimmer, Germany.
Maja Mataric, USA.

For Registration Information: alisonw at cogs.susx.ac.uk

In the last five years there has been an explosion of research on
Neural Networks and Robotics from both a self-learning and an
evolutionary perspective. Within this movement there is also a growing
interest in natural adaptive systems as a source of ideas for the
design of robots, while robots are beginning to be seen as an
effective means of evaluating theories of animal learning and
behaviour.  A fascinating interchange of ideas has begun between a
number of hitherto disparate areas of research and a shared science of
adaptive autonomous agents is emerging.  This two-day workshop
proposes to bring together an international group to both present
papers of their most recent research, and to discuss the direction of
this emerging field.


WORKSHOP FORMAT:
The workshop will consist of half-hour presentations with at least 15
minutes being allowed for discussion at the end of each presentation.
Short videos of mobile robot systems may be included in presentations.
Proposals for robot demonstrations are also welcome. Please contact
the workshop organisers if you are considering bringing a robot as
some local assistance can be arranged.  The workshop format may change
once the number of accepted papers is known, in particular, there may
be some poster presentations.


WORKSHOP CONTRIBUTIONS:
Contributions are sought from researchers in any field with an
interest in the issues outlined above.

Areas of particular interest include the following

 * Reinforcement, supervised, and imitation learning methods for
   autonomous robots

 * Evolutionary methods for robotics

 * The development of modular architectures and reusable representations

 * Computational models of animal learning with relevance to robots,
   robot control systems modelled on animal behaviour

 * Reviews or position papers on learning in autonomous agents

Papers will ideally emphasise real world problems, robot implementations,
or show clear relevance to the understanding of learning in both
natural and artificial systems. 

Papers should not exceed 5000 words length. Please submit four hard copies
to the Workshop Chair (address below) by 30th January, 1996.
All papers will be refereed by the Workshop Committee and other
specialists. Authors of accepted papers will be notified by 24th February 

Final versions of accepted papers must be submitted by 10th March, 1996.
A collated set of workshop papers will be distributed to workshop attenders.
We are currently negotiating to publish the workshop proceedings as a book.

SUBMISSIONS TO:
Noel Sharkey 
Department of Computer Science 
Regent Court                     
University of Sheffield 
S1 4DP, Sheffield, UK       
email: n.sharkey at dcs.sheffield.ac.uk 

For further information about AISB96

ftp ftp.cogs.susx.ac.uk  

login as <anonymous>
Password: <your email address>
cd pub/aisb/aisb96


From mkearns at research.att.com  Thu Dec  7 13:39:00 1995
From: mkearns at research.att.com (Michael J. Kearns)
Date: Thu, 7 Dec 95 13:39 EST
Subject: COLT 96 Call for Papers, ASCII
Message-ID: <m0tNlET-000q4nC@radish.research.att.com>


______________________________________________________________________
		      CALL FOR PAPERS---COLT '96

	  Ninth Conference on Computational Learning Theory
		      Desenzano del Garda, Italy
		       June 28 -- July 1, 1996
______________________________________________________________________

The Ninth Conference on Computational Learning Theory (COLT  '96) will
be held in the town of Desenzano del Garda,  Italy, from  Friday, June
28,  through  Monday,  July 1, 1996.   COLT  '96  is  sponsored by the
Universita`  degli Studi  di Milano.   We invite papers  in all  areas 
that relate  directly to the analysis  of learning algorithms  and the 
theory  of machine  learning,  including neural networks,  statistics,
statistical physics, Bayesian/MDL estimation, reinforcement  learning,
inductive inference, knowledge  discovery in databases,  robotics, and
pattern recognition.    We  also encourage the  submission  of  papers
describing   experimental results  that  are supported by  theoretical
analysis.

ABSTRACT SUBMISSION.
Authors should  submit  fifteen  copies (preferably two-sided)  of  an
extended abstract to:
				   
		     Michael Kearns --- COLT '96
		 AT&T Bell Laboratories, Room 2A-423
			 600 Mountain Avenue
		  Murray Hill, New Jersey 07974-0636
	    Telephone(for overnight mail): (908) 582-4017

Abstracts must be RECEIVED by FRIDAY JANUARY 12,  1996.  This deadline
is  firm.     We   are also  allowing   electronic  submissions  as an
alternative to submitting  hardcopy.  Instructions  for  how to submit
papers    electronically  can  be    obtained by   sending  email   to
colt96 at cs.cmu.edu with subject "help", or from our web site:

	       http://www.cs.cmu.edu/~avrim/colt96.html

which will also be used to provide other  program-related information.
Authors will   be  notified of  acceptance  or rejection on  or before
Friday, March  15, 1996.   Final camera-ready papers   will  be due by
Friday, April 5.   Papers  that have  appeared   in  journals or other
conferences, or that are being submitted to other conferences, are not
appropriate for submission  to  COLT.  An exception to  this policy is
that COLT and STOC have agreed  that a paper can be  submitted to both
conferences, with the understanding that a paper will be automatically
withdrawn from COLT if accepted to STOC.

ABSTRACT FORMAT.
The  extended  abstract  should  include a  clear   definition of  the
theoretical model used and a clear description of the results, as well
as a  discussion of their  significance, including comparison to other
work.   Proofs or proof sketches  should  be included. If the abstract
exceeds 10 pages,  only the first 10 pages  may be examined.   A cover
letter   specifying the contact  author and  his  or her email address
should accompany the abstract.

PROGRAM FORMAT.
At the discretion of the program committee, the program may consist of
both long and short talks,  corresponding to longer and shorter papers
in  the proceedings.   The short talks  will  also be  coupled with  a
poster presentation.

PROGRAM CHAIRS.
Avrim Blum (Carnegie Mellon University) and Michael Kearns  (AT&T Bell
Laboratories).

CONFERENCE AND LOCAL ARRANGEMENTS CHAIRS.
Nicolo`  Cesa-Bianchi  (Universita`  di  Milano)  and  Giancarlo Mauri
(Universita` di Milano).

PROGRAM COMMITTEE.
Martin Anthony (London School of Economics), 
Avrim Blum (Carnegie Mellon University),
Bill Gasarch (University of Maryland), 
Lisa Hellerstein (Northwestern University), 
Robert Holte (University of Ottawa), 
Sanjay Jain (National University of Singapore), 
Michael Kearns (AT&T Bell Laboratories),
Nick Littlestone (NEC Research Institute), 
Yishay Mansour (Tel Aviv University), 
Steve Omohundro (NEC Research Institute), 
Manfred Opper (University of Wuerzburg), 
Lenny Pitt (University of Illinois), 
Dana Ron (Massachusetts Institute of Technology), 
Rich Sutton (University of Massachusetts)

COLT, ML, AND EUROCOLT.
The Thirteenth International  Conference on  Machine Learning (ML '96)
will be held right after COLT '96,  on July 3--7 in  Bari, Italy.   In
cooperation  with COLT, the  EuroCOLT  conference will not  be held in
1996.

STUDENT TRAVEL.  
We anticipate some funds will be available to partially support travel
by student   authors.   Details will be  distributed   as  they become
available.


From hicks at cs.titech.ac.jp  Thu Dec  7 19:49:53 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Fri, 8 Dec 1995 09:49:53 +0900
Subject: compressibility and generalization
In-Reply-To: William Finnoff's message of Thu, 7 Dec 95 15:55:52 MST <9512072255.AA25329@predict.com>
Message-ID: <199512080049.JAA10560@euclid.cs.titech.ac.jp>


finnoff at predict.com (William Finnoff) wrote:
>Reading some of the recent postings concerning NFL theorems, it appears
>that there is still some misunderstandings about what they refer to in
>the versions dealing with statistical inference.   For example,  Craig
>Hicks writes:
>> (paraphrase: I want to clarify the meaning of the following assertion)
>>  (A) cross-validation works as well as anti-cross 
>>      validation (paraphrase: on average)

finnoff at predict.com (William Finnoff) continued:
>An example of this 
>would be the case of a two by two contingency table
>where the inputs are, say, 0=patient received treatment A,
>1=patient received treatment B, and values of the dependent variable
>are 0=patient died within three months, or 1=patient still alive
>after three months.  ... Using the example given above, this corresponds
>to cases where the training data contains no examples of 
>of a patient receiving one of the treatments (for example, where
>the training data only contains examples of patients
>that have received treatment A).   

Since there is no data for treatment B, how can we use cross-validation?  In
this case statement (A) above is not wrong, but it is implicitly occuring
within a context where there is no data to use for cross-validation.  If so
isn't it rather a trivial statement?  Possibly misleading?

finnoff at predict.com (William Finnoff) continued:
>The NFL theorems state that in this case, unless there is some other prior
>information available about the performance of treatment B in keeping patients
>alive, all predictions are equivalent in their average expected performance.

I certainly wouldn't expect cross-validation to work when it can't even be
used.  And I think it would work just as well as anti-cross validation,
whatever that is, where anti-cross validation is also not being used.  In
fact, both would score `0', not only on average, but every time, since they
are not being used.

----

After further study and reading postings to this list 
my current understanding is that (A) merely means that for any problem
	(cross validation >= 0)
in the sense that it will never be deceptive (never < 0) 
taking the average across the ensemble of samplings.

However, by taking a straight average over a certain infinite
(and arguably universal) ensemble of problems we can obtain 
	Expectation[cross validation] = 0
because in this ensemble the positive scoring problems are an infinitely small 
proportion.

This is exciting, because in our universe at the present time evidently 
	Expectation[cross validation] > 0,
which implies a non uniform prior over the ensemble of problems.
Or are we just choosing our problems unfairly?  
And if so, what algorithm are we using (or is using us) to choose them?

Craig Hicks           hicks at cs.titech.ac.jp 
Ogawa Laboratory, Dept. of Computer Science 
Tokyo Institute of Technology, Tokyo, Japan 

PS.  I do not claim to be clear on all the issues, 
or be free from misunderstandings by any means.  

PSS. What is anti-cross validation?


From WALTSCH at vms.cis.pitt.edu  Thu Dec  7 22:27:49 1995
From: WALTSCH at vms.cis.pitt.edu (WALTSCH@vms.cis.pitt.edu)
Date: Thu, 07 Dec 1995 23:27:49 -0400 (EDT)
Subject: Faculty position is Cognitive Neuroscience Univ. of Pittsburgh
Message-ID: <01HYJKVPQW36AM35MW@vms.cis.pitt.edu>

********Faculty Opening in Cognitive Neuroscience*************

The Department of Psychology at the University of Pittsburgh seeks a 
faculty member at the assistant professor level who studies human cognitive
neuroscience. The faculty member must have a strong empirical background and a
program of research that brings together neuroscience and behavioral 
techniques and an interest graduate and undergraduate teaching in this 
area. Candidates are likely to become affiliated with Center for 
the Neural Basis of Cognition between the University of Pittsburgh and 
Carnegie Mellon University. For additional information, 
see WWW httyp://neurocog.lrdc.pitt.edu/search 

   Applications should be sent to:
       Cognitive Neuroscience Search
       455 Langley Hal
       Psychology Department 
       University of Pittsburgh
       PGH PA 15260. 

       Applications should include:
           1. a statement of research and teaching interest 
           2. a CV 
           3. copies of selected publications 
           4. three letters of reference.

Initial consideration will begin January 15, 1996, though applications
arriving after that date may be considered. 

The University of Pittsburgh is an Equal Opportunity/Affirmative Action 
Employer. Women and minority candidates are especially encouraged to apply. 


From esann at dice.ucl.ac.be  Fri Dec  8 12:39:48 1995
From: esann at dice.ucl.ac.be (esann@dice.ucl.ac.be)
Date: Fri, 8 Dec 1995 18:39:48 +0100
Subject: ESANN extended deadline
Message-ID: <199512081737.SAA18067@ns1.dice.ucl.ac.be>

Dear Colleagues,

The deadline to submit papers to the ESANN'96 conference (the 4th European
Symposium on Artificial Neural Networks, which will be held in Bruges,
Belgium, on April 24-26, 1996) was December 8th, 1995 (today !) as
announced in the call for papers.

However, as you know, there are important strikes in France and in other
countries, and many of you have problems to meet this deadline because of
the post office strike (it is even worst because of the airport strike in
Belgium...).  So we are pleased to announce that we will accept submission
of papers until Friday December 15th, 1996 (so next Friday!).  Please
however ensure that the printed copies (no e-mail or fax please) will reach
the conference secretariat (see address below), together with the required
information (as described in the call for papers) before this date.  Please
use private mail delivery services if necessary, and don't forget that in
most countries Chronopost in NOT a private mail service (for example,
because of the strike, the French Chronopost service was not working this
week...), while DHL, TNT Mailfast and other companies are private services,
and so could be more efficient in the next few days...

If you still have problems to meet the new deadline, please contact me
personally at the following e-mail address:
        esann at dice.ucl.ac.be
and we will try to arrange another way to transfer your paper.

Please feel free to contact me if you need any other information about the
submission of papers.
Sincerely yours,

Michel Verleysen


_____________________________
D facto publications -
        conference services
45 rue Masui
1210 Brussels
Belgium
tel: +32 2 245 43 63
fax: +32 2 245 46 94
_____________________________


From giles at research.nj.nec.com  Fri Dec  8 14:18:39 1995
From: giles at research.nj.nec.com (Lee Giles)
Date: Fri, 8 Dec 95 14:18:39 EST
Subject: reprint available
Message-ID: <9512081918.AA20599@alta>


The following conference paper published in the 2nd International IEEE
Conference on "Massively Parallely Processing Using Optical
Interconnections," October, 1995 is now available via the NEC Research
Institute archive:

____________________________________________________________________________________

          "Predictive Control of Opto-Electronic Reconfigurable 
             Interconnection Networks Using Neural Networks"

         Majd F. Sakr[1,2], Steven P. Levitan[2], C. Lee Giles[1,3], 
         Bill G. Horne[1], Marco Maggini[4], Donald M. Chiarulli[5] 

     [1] NEC Research Institute, 4 Independence Way, Princeton, NJ  08540
 [2] Electrical Engineering Department, U. of Pittsburgh, Pittsburgh, PA 15261
            [3] UMIACS, U. of Maryland, College Park, MD 20742
[4] Universit` di Firenze, Dipartimento di Sistemi e Informatica, 
	50139 Firenze, Italy   
    [5] Computer Science Department, U. of Pittsburgh, Pittsburgh, PA 15260
                                      

                                  Abstract

Opto-electronic reconfigurable interconnection networks are limited by
significant control latency when used in large multiprocessor systems. This
latency is the time required to analyze the current traffic and reconfigure
the network to establish the required paths. The goal of latency hiding is
to minimize the effect of this control overhead. In this paper, we
introduce a technique that performs latency hiding by learning the patterns
of communication traffic and using that information to anticipate the need
for communication paths. Hence, the network provides the required
communication paths before a request for a path is made. In this study, the
communication patterns (memory accesses) of a parallel program are used as
input to a time delay neural network (TDNN) to perform on-line training and
prediction. These predicted communication patterns are used by the
interconnection network controller that provides routes for the memory
requests.  Based on our experiments, the neural network was able to learn
highly repetitive communication patterns, and was thus able to predict the
allocation of communication paths, resulting in a reduction of
communication latency.

------------------------------------------------------------------------------

http://www.neci.nj.nec.com/homepages/giles.html
ftp://external.nj.nec.com/pub/giles/papers/MPPOI.95.ps.Z

------------------------------------------------------------------------------


--                                 
C. Lee Giles / Computer Sciences / NEC Research Institute / 
4 Independence Way / Princeton, NJ 08540, USA / 609-951-2642 / Fax 2482
http://www.neci.nj.nec.com/homepages/giles.html
==


From mablume at sdcc10.ucsd.edu  Fri Dec  8 17:03:18 1995
From: mablume at sdcc10.ucsd.edu (Matthias Blume)
Date: Fri, 8 Dec 1995 14:03:18 -0800 (PST)
Subject: Fuzzy ART architecture papers online
Message-ID: <199512082203.OAA06153@e3329-4.ucsd.edu>

Dear Connectionists,

Two papers describing a simple and efficient architecture for Fuzzy ART and 
Fuzzy ARTMAP are now available online.  (Sorry, hardcopies are not available.)

------------------------------------------------------------------------------
Matthias Blume and Sadik C. Esener, 
An efficient mapping of Fuzzy ART onto a neural architecture (5 pages), 
submitted to Neural Networks.

A novel mapping of the Fuzzy ART algorithm onto a neural network architecture 
is described.  The architecture does not utilize bi-directional synapses, 
weight transport, or weight duplication, and requires one fewer layer of 
processing elements than the architecture originally proposed by Carpenter, 
Grossberg, & Rosen (1991).  In the new architecture, execution of the 
algorithm takes constant time per input vector regardless of the relationship 
between the input and existing templates, and several control signals are 
eliminated.  This mapping facilitates hardware implementation of Fuzzy ART and 
furthermore serves as a tool for envisioning and understanding the algorithm.

Keywords:  Fuzzy ART, Fuzzy ARTMAP, parallel hardware, neural architecture.

ftp://archive.cis.ohio-state.edu/pub/neuroprose/blume.fam_arch.ps.Z
http://icse1.ucsd.edu/~mablume/nnletter.ps

------------------------------------------------------------------------------
Matthias Blume and Sadik C. Esener, 
Optoelectronic Fuzzy ARTMAP processor, 
Optical Computing, Vol. 10, 1995 OSA Technical Digest Series
(Optical Society of America, Washington, DC, 1995), p. 213-215, March 1995.

The Fuzzy ARTMAP algorithm can perform well even with weights truncated to 4 
bits during training.  Furthermore, only the weights corresponding to one 
processing element are updated after each training sample.  Finally, it 
converges rapidly and relatively uniformly with little dependence on the 
particular choice of adjustable parameter values and initial state.  These 
characteristics are particularly advantageous for parallel optoelectronic 
implementations.  We map Fuzzy ARTMAP onto an architecture which satisfies the 
constraints of the hardware, and suggest an implementation which is an 
appropriate combination of optical and electronic technology.  The proposed 
mapping of the algorithm onto a neural architecture is efficient, requiring 
only an input layer and one processing layer per fuzzy ART module, and 
requiring neither weight transport nor multiple copies of weights.  The 
proposed optoelectronic system is simple, yet versatile, and relies on proven 
components.

Keywords:  Parallel optoelectronic hardware, Fuzzy ART, neural architecture.

ftp://archive.cis.ohio-state.edu/pub/neuroprose/blume.oe_fam.ps.Z
http://icse1.ucsd.edu/~mablume/OSA95.ps

------------------------------------------------------------------------------

- Matthias Blume
  ECE department, UCSD
  matthias at ucsd.edu
  http://icse1.ucsd.edu/~mablume  
  

From mpp at watson.ibm.com  Fri Dec  8 19:27:29 1995
From: mpp at watson.ibm.com (Michael Perrone)
Date: Fri, 8 Dec 1995 19:27:29 -0500 (EST)
Subject: NFL Summary
Message-ID: <9512090027.AA26165@austen.watson.ibm.com>

Hi Everyone,

There has been a lot of confusion regarding the "No Free Lunch" theorems.
Below, I try to summarize what I feel to be the key points.

NFL in a Nutshell:
------------------
   If you make no assumptions about the target function then on average,
   all learning algorithms will have the same generalization performance.

Apparent Contradiction and Resolution:
--------------------------------------
   Contradiction: Lots of theoretical results regarding generalization
   claim to make no assumptions about the target function.
   Resolution: These theoretical results DO make assumption (which may
   or may no be explicit) regarding the target.

Importance of NFL:
------------------
   The NFL results in and of itself is not terribly interesting because
   it's assumption (that we make no assumptions) is NEVER true.

   What makes NFL important is that it emphasizes in a very striking way
   that it is the ASSUMPTIONS that we make about our learning domains
   that MAKE ALL THE DIFFERENCE.

   Therefore, I see NFL *NOT* as a criticism of theoretical generalization
   results; but rather, as a call to examine the assumptions underlying
   these results because it is there that we can potentially learn the
   most about machine learning.

Examples of Unstated Assumptions:
---------------------------------
   In practise, there are numerous assumption that we as a community
   usually make when we attempt to learn a task using out favorite
   algorithm.  Below, I list just a few obvious ones.

   1) The training and testing data are IID.
   2) The data distribution is "smooth" (i.e. "near" data points are
      in general more similar than "far" data points).  This can also be
      interpreted as some differentiability conditions.
   3) NN's approximate real-world functions reasonably well.
   4) Starting with small intial weights is good.
   5) Overfitting is bad - early stopping is good.
   6) Gaussian error models are the best thing since machine sliced bread.

REALLY INTERESTING STUFF:
-------------------------
   I think that the NFL results point towards what I feel are extremely
   interesting research topics:

      Exactly what are the assumptions that certain theoretical results
        require?
      Exactly how do these assumptions affect generalization?
      Which assumptions are necessary/sufficient?
      How do different assumptions compare?
      Can we identify a set of assumptions that are equivalent to the
        assumption that CV model selection improves generalization?
      Can we do the same for early stopping?  Bagging?
        (You can be damn sure I can do this for averaging... :-)
      Etc, etc, ...

Caveat:
-------
   All of the above is conditioned on the assumptions that David Wolpert
   did his math correctly when deriving the NFL theorems...  :-)

I hope all of this helps clear things up.
Comments?

Regards,
Michael
-------------------------------------------------------------------------
   Michael P. Perrone                          914-945-1779 (office)
   IBM - Thomas J. Watson Research Center      914-945-4010 (fax)
   P.O. Box 704 / Rm 36-207                    914-245-9746 (home)
   Yorktown Heights, NY 10598                  mpp at watson.ibm.com
-------------------------------------------------------------------------

From jlm at crab.psy.cmu.edu  Sat Dec  9 17:35:01 1995
From: jlm at crab.psy.cmu.edu (James L. McClelland)
Date: Sat, 9 Dec 95 17:35:01 EST
Subject: TR Announcement
Message-ID: <9512092235.AA21814@crab.psy.cmu.edu.psy.cmu.edu>


The following Technical Report is available both electronically from
our own FTP server or in hard copy form.  Instructions for obtaining 
copies may be found at the end of this post.

========================================================================

     Stochastic Interactive Processing, Channel Separability, and
     Optimal Perceptual Inference: An Examination of Morton's Law

               Javier R. Movellan & James L. McClelland

                    Technical Report PDP.CNS.95.4
                            December 1995

In this paper we examine a regularity found in human perception,
called Morton's law, in which stimulus and context have independent
influences on perception.  This regularity has been used in the past
to argue that perception is a feed-forward, non-interactive process.
Building on earlier work by McClelland ( Cognitive Psychology, 1991)
we illustrate how Morton's law may emerge from stochastic interactions
between simple processing units.  To this end we consider the
properties of interactive diffusion networks, the continuous
stochastic limit of standard artificial neural models.  If, as we
believe, human information processing involves using noisy processing
elements to process potentially noisy inputs, such models may
ultimately serve as foundations for a theory of human information
processing.  We show that Morton's law emerges in recurrent diffusion
networks when the units are organized into separable channels,
feed-forward processing is not a necessary condition for Morton's law
to hold.  Failures to exhibit Morton's law provide evidence that the
information channels are not separable. This result can be used to
analyze cognitive models as well as actual brain structures. Finally,
we illustrate how diffusion networks can be organized to implement
optimal Bayesian perceptual inference.

=======================================================================

Retrieval information for pdp.cns TRs:

unix> ftp 128.2.248.152                 # hydra.psy.cmu.edu
Name: anonymous
Password: <email address>
ftp> cd pub/pdp.cns
ftp> binary
ftp> get pdp.cns.95.4.ps.Z              # gets this tr
ftp> quit
unix> zcat pdp.cns.95.4.ps.Z | lpr      # or however you print postscript

NOTE:  

The compressed file is 255910 bytes long.
Uncompressed, the file is 727359 byes long.

The printed version is 66 total pages long.

For those who do not have FTP access, physical copies can be requested from
Barbara Dorney <bd1q+ at andrew.cmu.edu>.

For a list of available PDP.CNS Technical Reports:

> get README

For the titles and abstracts:

> get ABSTRACTS

From hicks at cs.titech.ac.jp  Sun Dec 10 09:24:29 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Sun, 10 Dec 1995 23:24:29 +0900
Subject: NFL Summary
In-Reply-To: Michael Perrone's message of Fri, 8 Dec 1995 19:27:29 -0500 (EST) <9512090027.AA26165@austen.watson.ibm.com>
Message-ID: <199512101424.XAA13664@euclid.cs.titech.ac.jp>


Micheal Perrone writes:
>   I think that the NFL results point towards what I feel are extremely
>   interesting research topics:
> ...
>      Can we identify a set of assumptions that are equivalent to the
>      assumption that CV model selection improves generalization?

CV is nothing more than the random sampling of prediction ability.  If the
average over the ensemble of samplings of this ability on 2 different models A
and B come out showing that A is better than B, then by definition A is better
than B.  This assumes only that the true domain and the ensemble of all
samplings coincide.  Therefore CV will not, on average, cause a LOSS in
prediction ability.  That is, when it fails, it fails gracefully,
on average.  It cannot be consistently deceptive.

(A quick note:  Sometimes it is advocated that a complexity 
parameter be set by splitting the data set into training and testing,
and using CV.  Then with the complexity parameter fixed the
whole data set can be used to train the other parameters.
Behind this is an ASSUMPTION about the independence of the 
complexity from the other parameters.  Of course it often works
in practice, but it violates the principle in the above paragraph,
so I do not count this as real CV here.)

Two prerequisites exist to obtain a GAIN with CV

	1) The objective function must be "compressible".  I.e., it cannot 
	   be noise.
        2) We must have a model which can recognize the structure in 
           the data.  This structure might be quite hard to see, as in chaotic
	   signals.  

I think NFL says that on average CV will not obtain GAINful results, because
the chance that a randomly selected problem and a randomly selected algorithm
will hit it off is vanishingly small.  (Or even any fixed problem and a
randomly selected algorithm.)

But I think it tells us something more important as well.  It tells us that
not using CV means we are always implicitly trusting our a priori knowledge.
Any reasonable learning algorithm can always predict the training data, or a
"smoothed" version of it.  But because of the NFL theorem, this, over the
ensemble of all algorithms and problems, means nothing.  On average there will
be no improvement in the off training set error.  Fortunately, CV will report
this fact by showing a zero correlation between prediction and true value on
the off training set data. (Of course this is only the performance of CV on
average over the ensemble of off training set datas; CV may be deceptive for a
single off training set data.)  Thus, we shouldn't think we can do away with
CV unless we admit to having great faith in our prior.

Going back to NFL, I think it poses another very interesting problem:
Supposing we have "a foot in the door".  That is, an algorithm which makes
some sense of the data by showing some degree of prediction capability.  Can
we always use this prediction ability to gain better prediction ability?  Is
there some kind of ability to perform something like steepest descent over the
space of algorithms, ONCE we are started on a slope?  Is there a provable 
snowball effect?

I think NFL reminds us that we are already rolling down the hill,
and we shouldn't think otherwise.

Craig Hicks
Tokyo Institute of Technology

From goldfarb at unb.ca  Sun Dec 10 10:52:29 1995
From: goldfarb at unb.ca (Lev Goldfarb)
Date: Sun, 10 Dec 1995 11:52:29 -0400 (AST)
Subject: NFL Summary
In-Reply-To: <9512090027.AA26165@austen.watson.ibm.com>
Message-ID: <Pine.SUN.3.91.951210102350.6128G-100000@jupiter.sun.csd.unb.ca>

On Fri, 8 Dec 1995, Michael Perrone wrote:

> NFL in a Nutshell:
> ------------------
>    If you make no assumptions about the target function 

     [specifically, about the axiomatic structure of the sample space 
      and the inductive generalization, i.e. which ones are the most 
                                               general for the purpose]
                                         

Strangely as it may sound at first, try to inductively learn the subgroup
of some large group with the group structure completely hidden. No
statistics will reveal the underlying group structure. 

Objects in the universe do have structure, especially when they have to 
be represented, as we have learned from the data types in computer science:
TO REPRESENT AN OBJECT IS TO MAKE SOME ASSUMPTIONS ABOUT THE OPERATIONS 
RELATED TO ITS MANIPULATION.

Cheers,
         Lev Goldfarb

From XIAODONG at rivendell.otago.ac.nz  Sun Dec 10 20:46:21 1995
From: XIAODONG at rivendell.otago.ac.nz (Xiaodong Li, Otago University, New Zealand)
Date: Mon, 11 Dec 1995 14:46:21 +1300
Subject: Paper available
 "Connectionist Model Based on an Optical Thin-Film Model"
Message-ID: <01HYONVDU5GYLBVSXM@rivendell.otago.ac.nz>

FTP-host: archive.cis.ohio-state.edu
FTP-filename:/pub/neuroprose/xli.thinfilm.ps.Z

The file xli.thinfilm.ps.Z is now available for ftp from Neuroprose repository.


	Connectionist Learning Using an Optical Thin-Film Model (4 pages)

			Martin Purvis and Xiaodong Li
			Computer and Information Science 
			University of Otago
			Dunedin, New Zealand

ABSTRACT:

An alternative connectionist architecture to the one based on the neuroanatomy 
of biological organisms is described.  The proposed architecture is based on 
an optical thin-film multilayer model, with the thicknesses of thin-film layers
serving as adjustable 'weights' for the computation.  Inputs are encoded into 
the corresponding refractive indices of individual thin-film layers, while the 
outputs are typically measured by the overall reflection coefficients off the 
thin-film layers, at different wavelengths.  The nature of the model and some 
example calculations (a pattern recognition and the classification on the iris 
data set) that exhibit behaviour typical of conventional connectionist 
architectures are described.  This model has also been used in solving the XOR 
and 16 four-bit parity problems, and it has demonstrated comparable performance
to that of a conventional feed-forward neural netwrok model using 
Back-propagation learning. 

This paper is also available at the proceeding of the Second New Zealand 
International Two-Stream Conference on Artificial Neural Nteworks and Expert 
Systems (ANNES'95), IEEE Computer Society Press, Los Almamitos, California, 
1995, pp. 63-66.

Comments are greatly appreciated.


-- Xiaodong Li 
Email: Xiaodong at otago.ac.nz
Http: http://divcom.otago.ac.nz:800/COM/INFOSCI/SECML/xdli/xiao.htm
(Postscript file of this paper is also available here at my hoempage)

From prechelt at ira.uka.de  Mon Dec 11 07:11:32 1995
From: prechelt at ira.uka.de (Lutz Prechelt)
Date: Mon, 11 Dec 1995 13:11:32 +0100
Subject: NN Benchmarking WWW homepage
Message-ID: <"iraun1.ira.487:11.12.95.12.12.22"@ira.uka.de>


The homepage of the very successful NIPS*95 workshop on benchmarking
has now been converted into a repository for information about
benchmarking issues: Status quo, methodology, facilities, and
related info.

I kindly ask everybody who has additional information that should
be on the page (in particular sources or potential sources of
learning data of all kinds) to submit that information to me.
Other comments are also welcome.

The URL is

http://wwwipd.ira.uka.de/~prechelt/NIPS_bench.html

The page is also still reachable over the benchmarking workshop
link on the NIPS*95 homepage.

Below is a textual version of the page.

  Lutz

Lutz Prechelt (http://wwwipd.ira.uka.de/~prechelt/)  | Whenever you 
Institut f. Programmstrukturen und Datenorganisation | complicate things,
Universitaet Karlsruhe;  D-76128 Karlsruhe;  Germany | they get
(Phone: +49/721/608-4068, FAX: +49/721/694092)       | less simple.


===============================================

Benchmarking of learning algorithms

information repository page 


Abstract: Proper benchmarking of (neural network and other)
learning architectures is a prerequisite for orderly progress in
this field. In many published papers deficiencies can be observed
in the benchmarking that is performed.
A workshop about NN benchmarking at NIPS*95 addressed the
status quo of benchmarking, common errors and how to avoid
them, currently existing benchmark collections, and, most
prominently, a new benchmarking facility including a results
database.
This page contains pointers to written versions or slides of most
of the talks given at the workshop plus some related material.
The page is intended to be a repository for such information to
be used as a reference by researchers in the field. Note that most
links lead to Postscript documents. Please send any additions or
corrections you might have to Lutz Prechelt
(prechelt at ira.uka.de). 


Workshop Chairs: 

   Thomas G. Dietterich <tgd at chert.cs.orst.edu>, 
   Geoffrey Hinton <hinton at cs.toronto.edu>, 
   Wolfgang Maass <maass at igi.tu-graz.ac.at>, 
   Lutz Prechelt <prechelt at ira.uka.de> [communicating
   chair] 
   Terry Sejnowski <terry at salk.edu> 


Assessment of the status quo:

 *  Lutz Prechelt. A quantitative study of current
   benchmarking practices.
   A quantitative survey of 400 journal articles of 1993 and
   1994 on NN algorithms. Most articles used far too few
   problems during benchmarking. 
 *  Arthur Flexer. Statistical Evaluation of Neural
   Network Experiments: Minimum Requirements and
   Current Practice. Says that it is insufficient what is
   reported about the benchmarks and how. 

Methodology:

 *  Tom Dietterich. Experimental Methodology
   Benchmarking types, correct statistical testing, synthetic
   versus real-world data, understanding via algorithm
   mutation or data mutation, data generators. 
 *  Lutz Prechelt. Some notes on neural learning
   algorithm benchmarking.
   A few general remarks about volume, validity,
   reproducibility, and comparability of benchmarking;
   DOs and DON'Ts. 
 *  Brian Ripley. What can we learn from the study of
   the design of experiments?
   (Only two slides, though). 
 *  Brian Ripley. Statistical Ideas for Selecting Network
   Architectures.
   (Also somewhat related to benchmarking.) 

Benchmarking facilities:

 *  Previously available NN benchmarking data
   collections
      CMU nnbench, 
      UCI machine learning databases archive, 
      Proben1, 
      StatLog data, 
      ELENA data. 
   Advantages of these: UCI is large and growing and
   popular, Statlog has largest and most orderly collection
   of results available (in a book, though), and Proben1 is
   most easy to use and best supports reproducible
   experiments. Elena and nnbench have no particular
   advantages.
   Disadvantages: UCI and Probem1 have too few and too
   unstructured results available, Proben1 is also inflexible
   and small, Statlog is partially confidential and neither
   data nor results collection are growing. 
 *  Carl Rasmussen and Geoffrey Hinton. DELVE: A
   thoroughly designed benchmark collection
   A proposal of data, terminology, and procedures and a
   facility for the collection of benchmarking results.
   This is the newly proposed standard for benchmarking
   NN (and other) learning algorithms. DELVE is currently
   still under construction at the University of Toronto. 

Other sources of data:

   (Thanks to Nici Schraudolph <schraudo at salk.edu>)
   There is a large amount of game data about the board
   game Go available on the net. One starting point is here.
   Others are the Go game database project, and the Go
   game server. The database holds several hundred
   thousand games of Go and could for instance be used for
   advanced reinforcement learning projects. 


Last correction: 1995/12/11
Please send additions and corrections to Lutz Prechelt,
prechelt at ira.uka.de. 

To NIPS homepage.
To original homepage of this workshop. 

From mpp at watson.ibm.com  Mon Dec 11 08:42:59 1995
From: mpp at watson.ibm.com (Michael Perrone)
Date: Mon, 11 Dec 1995 08:42:59 -0500 (EST)
Subject: compressibility and generalization
In-Reply-To: <199512080049.JAA10560@euclid.cs.titech.ac.jp> from "hicks@cs.titech.ac.jp" at Dec 8, 95 09:49:53 am
Message-ID: <9512111342.AA25646@austen.watson.ibm.com>

[hicks at cs.titech.ac.jp wrote:]
> PSS. What is anti-cross validation?

Suppose we are given a set of functions and a crossvalidation data set.
The CV and Anti-CV algorithms are as follows:

     CV: Choose the function with the best  performance on the CV set.
Anti-CV: Choose the function with the worst performance on the CV set.

(And for this year's NIPS motif: Anti-EM:  Dorothy? Dorothy? :-)

Regards,
Michael
-------------------------------------------------------------------------
   Michael P. Perrone                          914-945-1779 (office)
   IBM - Thomas J. Watson Research Center      914-945-4010 (fax)
   P.O. Box 704 / Rm 36-207                    914-245-9746 (home)
   Yorktown Heights, NY 10598                  mpp at watson.ibm.com
-------------------------------------------------------------------------

From hicks at cs.titech.ac.jp  Mon Dec 11 20:01:05 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Tue, 12 Dec 1995 10:01:05 +0900
Subject: compressibility and generalization
In-Reply-To: "Michael Perrone"'s message of Mon, 11 Dec 1995 08:42:59 -0500 (EST) <9512111342.AA25646@austen.watson.ibm.com>
Message-ID: <199512120101.KAA16136@euclid.cs.titech.ac.jp>


"Michael Perrone" <mpp at watson.ibm.com> wrote:
>[hicks at cs.titech.ac.jp wrote:]
>> PSS. What is anti-cross validation?
>Suppose we are given a set of functions and a crossvalidation data set.
>The CV and Anti-CV algorithms are as follows:
>     CV: Choose the function with the best  performance on the CV set.
>Anti-CV: Choose the function with the worst performance on the CV set.

case 1: 
*	Either the target function is (noise/uncompressible/has no structure),
or none of the candidate functions have any correlation with the target
function.*
	In this case both Anti-CV and CV provide (ON AVERAGE) equal
improvement in prediction ability: none.  For that matter so will ANY method
of selection.
	Moreover, if we plot a graph of the number of data used for training
vs. the estimated error (using the residual data), we will (ON AVERAGE) see no
decrease in estimated error.  Since CV provides an estimated prediction error,
it can also tell us "you might as well be using anti-cross validation, or
random selection for that matter, because it will be equally useless".

case 2: 
*	The target (is compressible/has structure), and some the candidate
functions are positively correlated with the target function.*
	In this case CV will outperform anti-CV (ON AVERAGE).


By ON AVERAGE I mean the expectation across the ensemble of samples for
a FIXED target function.  This is different from the ensemble and distribution
of target functions, which is a much bigger question.  We known much already
about about the ensemble of samples from a fixed target function.  I am not
avoiding the issue of the ensemble or distribution of target functions, but
merely showing that we have 2 general cases, and that in both of them CV is
never WORSE than anti-CV.  It follows that whatever the distribution of
targets is, CV is never worse (ON AVERAGE) than anti-CV.

I don't believe this contradicts NFL in any way.  It just clarifies the
role that CV can play.  

Learning and monitoring prediction error go hand in hand.
This is even more true for cases when the underlying function 
may be changing and the data has the form of an infinite stream.


Craig Hicks
Tokyo Institute of Technology


From GIOIELLO at cres.it  Mon Dec 11 19:13:43 1995
From: GIOIELLO at cres.it (GIOIELLO)
Date: Tue, 12 Dec 1995 01:13:43 +0100
Subject: A neural net based OCR demo for both Windows/DOS and Mac OS is
 available
Message-ID: <01HYP9T0BSPU934ROD@cres.it>

Dear Netters,

	An OCR demo for Mac OS is available at the following URL:

	 ftp://ftpcsai.diepa.unipa.it/pub/demos/OCR-demo.cpt.hqx

	A Windows and DOS version is also available at the following URL:

         ftp://ftpcsai.diepa.unipa.it/pub/demos/OCR-Win.zip

this latter version offers a more rich set of capabilities too. The OCR
is based on a three-layer MLP. The conjugate gradient descent techniques
were used while training the net. Training and test set were those of NIST.

	The related papers will be found at the following URL:

		ftp://ftpcsai.diepa.unipa.it/pub/papers/handwritten

	Several VLSI architectures to implement the OCR device using
a digital implementation of the proposed MLP are also described in the
papers.

	An overwiev of the activities we carry on can be found at the following
URL:

 http://wwwcsai.diepa.unipa.it/research/projects/vlsinn/handcare/handcare.html

					Best Regards,
			
					Giuseppe A. M. Gioiello

E-Mail:   gioiello at diepa.unipa.it

   URL:   http://wwwcsai.diepa.unipa.it/people/doctors/gioiello/gioiello.html

From ernst at kuk.klab.caltech.edu  Tue Dec 12 12:02:22 1995
From: ernst at kuk.klab.caltech.edu (Ernst Niebur)
Date: 12 Dec 1995 17:02:22 GMT
Subject: Training opportunities in Computational Neuroscience at Johns Hopkins University
Message-ID: <ERNST.95Dec12090222@kuk.klab.caltech.edu>

 
The Zanvyl Krieger Mind/Brain Institute at Johns Hopkins University is
an interdisciplinary research center devoted to the investigation of
the neural mechanisms of mental function and particularly to the
mechanisms of perception: How is complex information represented and
processed in the brain, how is it stored and retrieved, and which
brain centers are critical for these operations?

The Institute intends to significantly enhance its research program in
Computational Neuroscience and encourages students with interest in
this domain to apply for the graduate program in the Neuroscience
department. Research opportunities exist in all of the laboratories of
the Institute. Interdisciplinary projects, involving the student in
more than one laboratory, are particularly encouraged.

At present, MBI faculty include (listed with primary field of interest
and methodology used):


C. Ed Connor, PhD: Visual selective attention (electrophysiology in
the awake behaving monkey).

Stewart Hendry, PhD: Organization and plasticity of mammalian cerebral
cortex (primate neuroanatomy).

Steve S. Hsiao, PhD: Neurophysiology of tactile perception
(electrophysiology in the awake behaving monkey).

Kenneth O. Johnson, PhD: Neurophysiology of the somatosensory system
(electrophysiology in the awake behaving monkey).

Guy McKhann, MD (Director of MBI): Cognitive and neurologic outcomes
after cardiac surgery; immunologic attack on peripheral motor axonal
membranes in the human and experimental animal (neurology).

Ernst Niebur, PhD: Theoretical Neuroscience (computational and
mathematical modeling).

Gian F Poggio, PhD: Analysis of Stereopsis and Texture
(electrophysiology in the awake behaving monkey).

Michael A. Steinmetz, PhD: Neurophysiological mechanisms in
visual-spatial perception (electrophysiology in the awake behaving
monkey).

Ruediger von der Heydt, PhD: Neural mechanisms of visual perception
(electrophysiology in the awake behaving monkey).


Additional research opportunities exist in collaborative work with
faculty in the Psychology Department (located next door to the
Mind/Brain Institute), in particular with Drs. Howard Egeth
(attention, perception, cognition), Michael Rudd (computational
vision, psychophysics), Trisha Van Zandt (mathematical modelling,
neural networks and memory), and Steven Yantis (visual perception,
attention, mathematical modeling).

All students accepted to the PhD program of the Neuroscience
department receive full tuition remission plus a stipend at or above
the National Institutes of Health predoctoral level. The Mind/Brain
Institute is located on the very attractive Homewood campus in
Northern Baltimore.

Applicants should have a B.S. or B.A. with a major in any of the
biological or physical sciences. Applicants are required to take the
Graduate Record Examination (GRE), both the aptitude tests and an
advanced test, or the Medical College Admission Test. Further
information on the admission procedure can be obtained from the
Department of Neuroscience:

Director of Graduate Studies
Neuroscience Training Program
Department of Neuroscience
The Johns Hopkins University School of Medicine
725 Wolfe Street 
Baltimore, MD 21205

Completed applications (including three letters of recommendation and
either GRE scores or Medical College Admission Test scores) must be
_received_ by January 1, 1996 at the above address. Candidates for
whom this is impossible, or those who need additional information,
should immediately contact

Prof. Ernst Niebur 
The Zanvyl Krieger Mind/Brain Institute
Johns Hopkins University 
3400 N. Charles Street 
Baltimore, MD 21218
niebur at jhu.edu
--
Ernst Niebur					Krieger Mind/Brain Institute
Asst. Prof. of Neuroscience			Johns Hopkins University
niebur at jhu.edu					3400 N. Charles Street
(410)516-8643, -8640 (secr), -8648 (fax)	Baltimore, MD 21218

From dhw at santafe.edu  Tue Dec 12 17:25:06 1995
From: dhw at santafe.edu (David Wolpert)
Date: Tue, 12 Dec 95 15:25:06 MST
Subject: The last of a dying thread
Message-ID: <9512122225.AA00709@sfi.santafe.edu>


Some comments on the NFL thread.


Huaiyu Zhu writes

>>>
2. The *mere existence* of structure guarantees a (not uniformly-random)
algorithm as likely to lose you a million as to win you a million, 
even in the long run.  It is the *right kind* of structure that makes 
a good algorithm good.
>>>

This is a crucial point. It also seems to be one lost on many of the
contributors to this thread, even those subsequent to Zhu's
posting. Please note in particular that the knowledge that "the
universe is highly compressible" can NOT, by itself, be used to
circumvent NFL.

I can only plead again: Those who are interested in this issue should
look at the papers directly, so they have at least passing familiarity
with the subject before disussing it. :-)

ftp.santafe.edu, pub/dhw_ftp, nfl.1.ps.Z and nfl.2.ps.Z.


Craig Hicks then writes:

>>>
However, I interpret the assertion that anti-cross validation can be expected
to work as well as cross-validation to mean that we can equally well expect
cross-validation to lie.  That is, if cross-validation is telling us that the
generalization error is decreasing, we can expect, on average, that the true
generalization error is not decreasing.

Isn't this a contradiction, if we assume that the samples are really randomly
chosen?  Of course, we can a posteriori always choose a worst case function
which fits the samples taken so far, but contradicts the learned model
elsewhere.  But if we turn things around and randomly sample that deceptive
function anew, the learned model will probably be different, and
cross-validation will behave as it should.
>>>

That's part of the power of the NFL theorems - they prove that Hicks'
intuition, an intuition many people share, is in fact wrong.


>>>
I think this follows from the principle that the empirical distribution over
an ever larger number of samples converges to the the true distribution of a
single sample (assuming the true distribution is stationary).
>>>

Nope. The central limit theorem is not directly germane. See all the
previous discussion on NFL and Vapnik.


>>>>
CV is nothing more than the random sampling of prediction ability.  If the
average over the ensemble of samplings of this ability on 2 different models A
and B come out showing that A is better than B, then by definition A is better
than B.  This assumes only that the true domain and the ensemble of all
samplings coincide.  Therefore CV will not, on average, cause a LOSS in
prediction ability.  That is, when it fails, it fails gracefully,
on average.  It cannot be consistently deceptive.
	Fortunately, CV will report this (failure to generalize) by
showing a zero correlation between prediction and true value on the
off training set data. (Of course this is only the performance of CV
on average over the ensemble of off training set datas; CV may be
deceptive for a single off training set data.)
>>>

This is wrong (or at best misleading). Please read the NFL papers. In
fact, if the head-to-head minimax hypothesis concerning xvalidation
presented in those papers is correct, xvalidation is wrong more often
than it is right. In which case CV is "deceptive" more often (!!!)
than not.


Lev Goldfarb wrote

>>>
Strangely as it may sound at first, try to inductively learn the subgroup
of some large group with the group structure completely hidden. No
statistics will reveal the underlying group structure. 
>>>

It may help if people read some of the many papers (Cox, deFinnetti,
Erickson and Smith, etc., etc.) that prove that the only consistent
way of dealing with uncertainty is via probability theory. In other
words, there is nothing *but* statistics, in the real world. (Perhaps
occuring in prior knowledge that you're looking for a group, but
statistics nonetheless.)


David Wolpert

From lemm at LORENTZ.UNI-MUENSTER.DE  Wed Dec 13 09:46:52 1995
From: lemm at LORENTZ.UNI-MUENSTER.DE (Joerg_Lemm)
Date: Wed, 13 Dec 1995 15:46:52 +0100
Subject: NFL and practice
Message-ID: <9512131446.AA13879@xtp141.uni-muenster.de>

Some remarks to Craig Hicks arguments on crossvalidation and NFL in general
from my point of view:

One may discuss NFL for theoretical reasons, but
the conditions under which NFL-Theorems hold
are not those which are normally met in practice.

1.) In short, NFL assumes that data, i.e. information of the form y_i=f(x_i),
do not contain information about function values on a non-overlapping test set.
This is done by postulating "unrestricted uniform" priors, 
or uniform hyperpriors over nonumiform priors... (with respect to Craig's 
two cases this average would include a third case: target and model are 
anticorrelated so anticrossvalidation works better) and "vertical" likelihoods.
So, in a NFL setting data never say something about function values 
for new arguments.
This seems rather trivial under this assumption and one has to ask
how natural is such a NFL situation.

2.) Information of the form y_i=f(x_i) is rather special and not what
we normally have. There is much information which is not of this 
"single sharp data" type. (Examples see below.)

There is absolutly no reason why information which depends on more than
one f(x_i) should not be incorporated. (This can be done using nonuniform 
priors or in a way more symmetrical to "sharp data".)
NFL just describes the situation in which we don't have
any such information but much of the (then quite useless)
"sharp data". But these sharp data are not less (maybe more) obscure
as other forms of information.

Information which is not of this "single sharp data" form but includes
many or all f(x_i) to produce one answer normally induces correlations 
between target and generalizer if included into the generalizer. 
At the same time there is no real off training set anymore!

Examples:

3) Informations like symmetries (even if only approximate), maxima,
Fouriercomponents (and much, much more ...) involve more than one f(x_i).
Fouriercomponents, for example, can be seen as sharp data but for different 
basisvectors, i.e. asking for momentum instead of location.
This shows again, that the definition of "sharp data" corresponds to choosing 
a "basis of questions" and is no natural entity!!!


4) Real measurements (especially of continuous variables)
normally do also NOT have the form y_i=f(x_i) !
They mostly perform some averaging over f(x_i) or
at least they have some noise on the x_i (as small as you like, but present).
In the latter case of "sharp" noise posing the same question several times 
gives you also an average of several (nearby) y 
with different x_i of the underlying true function.
In both cases the averaging is equivalent to regularization
for the "effective" function which we can observe!!!
This shows that smoothness of the expectation (in contrast to uniform priors) 
is the result of the measurement process and therefore
is a real phenomena for "effective" functions.
There is no need to see it just as a subjective prior!
(The same could be said on a quantummechanical level, but that's another story.)
It follows that NFL results do NOT hold 
for the "effective" functions in such situations,
even if assuming NFL for the underlying true functions. 

5.) NFL again:
Averaging or noise in the input space of the x_i requires a
probability distribution in that space
which can be defined independently from a specific function.
Noise means that x_i is a random variable dependend from 
an actual question z_i, i.e. p(actual argument = x_i | question=z_i)
and it is f(z_i) which we can observe.

If you don't accept a given p(x_i|z_i), I am sure you can average over 
"all possible" of such relations with unrestricted "uniform" priors to 
find that it is impossible to obtain any information about any function 
without assuming a priori that you know something about what you are asking.
This could be seen as another NFL-Theorem for questions: You do not even get 
informations about a single function value if you don't know (assume,define) 
a priori what you are asking!

6.) With respect to the underlying "true" function
off-training set error itself, an important concept for NFL, is in general 
no longer a measurable quantity if input noise or averaging is present!! 
(For simplicity let's assume that noise or averaging includes all 
questions x_i. Then in the case of noise you only have a probability 
for the x_i to belong to the "true" training set 
and averaging includes all questions x_i.)
So for the "true" functions there remains nothing NFL can say something about
and for the "effective" functions NFL is not valid!

To conclude: 

In many interesting cases "effective" function values contain information 
about other function values and NFL does not hold!

The very special handling of "sharp data" in comparison to other 
information must be discussed for much more learning theories.

Joerg Lemm 
(Institute for Theoretical Physics I, University of Muenster, Germany)


From wray at ptolemy-ethernet.arc.nasa.gov  Wed Dec 13 17:06:42 1995
From: wray at ptolemy-ethernet.arc.nasa.gov (Wray Buntine)
Date: Wed, 13 Dec 95 14:06:42 PST
Subject: one revised paper and NIPS slides by Buntine
Message-ID: <9512132206.AA08307@ptolemy.arc.nasa.gov>


Dear Connectionists

Please note the following two WWW resources.   One, a forthcoming journal
paper, and the other, slides from a NIPS'95 Workshop presentation.
Also, please note my new address, email, and company.  I am no longer at
Heuristicrats.

Wray Buntine                                   
Thinkbank, Inc.                                +1 (510) 540-6080 [voice]
1678 Shattuck Avenue, Suite 320                +1 (510) 540-6627   [fax]
Berkeley, CA 94709                                    wray at Thinkbank.COM


============  Article

URL:	http://www.thinkbank.com/wray/graphbib.ps.Z
        (about 240Kb compressed)

TITLE:   A guide to the literature on learning probabilistic
         networks from data
AUTHOR:         Wray Buntine, Thinkbank
JOURNAL:    Accepted for IEEE Trans. on Knowledge and Data Eng.,
	Final draft submitted.

ABSTRACT: This literature review discusses different methods under the
general rubric of learning Bayesian networks from data, and includes some
overlapping work on more general probabilistic networks.  Connections are
drawn between the statistical, neural network, and uncertainty communities,
and between the different methodological communities, such as Bayesian,
description length, and classical statistics.  Basic concepts for learning
and Bayesian networks are introduced and methods are then reviewed.  Methods
are discussed for learning parameters of a probabilistic network, for
learning the structure, and for learning hidden variables.  The presentation
avoids formal definitions and theorems, as these are plentiful in the
literature, and instead illustrates key concepts with simplified examples.

KEYWORDS:  Bayesian networks, graphical models, hidden variables,
learning, learning structure, probabilistic networks, knowledge discovery

===========  Talk

URL:    http://www.thinkbank.com/wray/refs.html
   	(and look under Talks for NIPS) 

TITLE:  Compiling Probabilistic Networks and Some Questions this Poses.
AUTHOR:  Wray Buntine 
WORKSHOP:    NIPS'95 Workshop on Learning Graphical Models

ABSTRACT:
Probabilistic networks (or similar) provide a high-level language that 
can be used as the input to a compiler for generating a learning or 
inference algorithm.  Example compilers are BUGS (inputs a Bayes 
net with plates) by Gilks, Spiegelhalter, et al., and MultiClass (inputs 
a dataflow graph) by Roy.  This talk will cover three parts:  (1) an 
outline of the arguments for such compilers for probabilistic 
networks, (2) an introduction to some compilation techniques, and 
(3) the presentation of some theoretical challenges that compilation 
poses.

High-level language compilers are usually justified as a rapid 
prototyping tool.  In learning, rapid prototyping arises for the 
following reasons:  good priors for complex networks are not obvious 
and experimentation can be required to understand them;  several 
algorithms may suggest themselves and experimentation is required 
for comparative evaluation.  These and other justifications will be 
described in the context of some current research on learning 
probabilistic networks, and past research on learning classification 
trees and feed-forward neural networks.  Techniques for compilation 
include the data flow graph, automatic differentiation, Monte Carlo 
Markov Chain samplers of various kinds, and the generation of C 
code for certain exact inference tasks.  With this background, I will 
then pose a number of research questions to the audience. 

===========

From bernabe at cnm.us.es  Tue Dec 12 07:39:41 1995
From: bernabe at cnm.us.es (Bernabe Linares B.)
Date: Tue, 12 Dec 95 13:39:41 +0100
Subject: two papers in neuroprose
Message-ID: <9512121239.AA17985@cnm1.cnm.us.es>

FTP-host:  archive.cis.ohio-state.edu
FTP-file:  pub/neuroprose/bernabe.art1-nn.ps.Z (30 pages, 257846 bytes)
           pub/neuroprose/bernabe.art1-vlsi.ps.Z (26 pages, 311686 bytes)

The files "bernabe.art1-nn.ps.Z" and "bernabe.art1-vlsi.ps.Z" are now
available for copying from the Neuroprose repository. They contain two
papers which have been accepted for publication in the following journals:

PAPER1:  Journal: IEEE Transactions on VLSI Systems
         Title: "A Real-Time Clustering Microchip Neural Engine"
         File: bernabe.art1-vlsi.ps.Z

PAPER2:  Journal: Neural Networks
         Title: "A Modified ART1 Algorithm more suitable for VLSI
                 Implementations"
         File: bernabe.art1-nn.ps.Z

Authors: Teresa Serrano-Gotarredona and Bernabe Linares-Barranco
Filiation: National Microelectronics Center (CNM), Sevilla, SPAIN.

Sorry, no hardcopies available.


Brief description of papers follows:

--------------------------------------------------------------------
PAPER1:
-------

File: bernabe.art1-vlsi.ps.Z, 26 pages, 311686 bytes.

Title: "A Real-Time Clustering Microchip Neural Engine"

                            Abstract
This paper presents an analog current-mode VLSI implementation of an
unsupervised clustering algorithm. The clustering algorithm is based on the
popular ART1 algorithm [1], but has been modified resulting in a more
VLSI-friendly algorithm [2], [3] that allows a more efficient hardware 
implementation with simple circuit operators, little memory requirements,
modular chip assembly capability, and higher speed figures. The chip described
in this paper implements a network that can cluster 100 binary pixels input
patterns into up to 18 different categories. Modular expansibility of the 
system is directly possible by assembling an NxM array of chips without any
extra interfacing circuitry, so that the maximum number of clusters is 18xM
and the maximum number of bits of the input pattern is Nx100. Pattern
classification and learning is performed in 1.8us, which is an equivalent
computing power of 4.4x10^9 connections per second plus connection-updates
per second. The chip has been fabricated in a standard low cost 1.6um
double-metal single-poly CMOS process, has a die area of 1cm^2, and is mounted
in a 120-pin PGA package. Although internally the chip is analog in nature,
it interfaces to the outside world through digital signals, and thus has a true
asynchronous digital behavior. Experimental chip test results are available,
obtained through digital chip test equipment. Fault tolerance at the system
level operation is demonstrated through the experimental testing of faulty
chips.

--------------------------------------------------------------------
PAPER2:
-------

File: bernabe.art1-nn.ps.Z, 30 pages, 257846 bytes.

Title: "A Modified ART1 Algorithm more suitable for VLSI Implementations"

                          Abstract
This paper presents a modification to the original ART1 algorithm
[Carpenter, 1987a] that is conceptually similar, can be implemented in hardware
with less sophisticated building blocks, and maintains the computational
capabilities of the originally proposed algorithm. This modified ART1 
algorithm (which we will call here ART1m) is the result of hardware motivated
simplifications investigated during the design of an actual ART1 chip
[Serrano, 1994, 1996]. The purpose of this paper is simply to justify
theoretically that the modified algorithm preserves the computational
properties of the original one and to study the difference in behavior
between the two approaches.

--------------------------------------------------------------------
ftp instructions are:

% ftp archive.cis.ohio-state.edu
Name : anonymous
Password: <your e-mail address>
ftp> cd pub/neuroprose
ftp> binary
ftp> get bernabe.art1-nn.ps.Z  
ftp> get bernabe.art1-vlsi.ps.Z
ftp> quit
% uncompress bernabe.art1-nn.ps.Z
% uncompress bernabe.art1-vlsi.ps.Z
% lpr -P<your_printer> bernabe.art1-nn.ps
% lpr -P<your_printer> bernabe.art1-vlsi.ps

These files are also available from the node "ftp.cnm.us.es", user
"anonymous", directory /pub/bernabe/publications,
files: "NN_art1theory_96.ps.Z" and "TVLSI_art1chip_96.ps.Z".

Any feedback will be appreciated. Thanks,

Dr. Bernabe Linares-Barranco
National Microelectronics Center (CNM)
Dept. of Analog Design
Ed. CICA, Av. Reina Mercedes s/n, 
41012 Sevilla, SPAIN.
Phone: 34-5-4239923, Fax: 34-5-4624506, 
E-mail: bernabe at cnm.us.es


From bishopc at helios.aston.ac.uk  Wed Dec 13 14:52:48 1995
From: bishopc at helios.aston.ac.uk (Prof. Chris Bishop)
Date: Wed, 13 Dec 1995 19:52:48 +0000
Subject: New Book: Neural Networks for Pattern Recognition
Message-ID: <1400.9512131952@sun.aston.ac.uk>


--------------------------------------------------------------------
NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK
--------------------------------------------------------------------


              "Neural Networks for Pattern Recognition"
              -----------------------------------------

                       Christopher M. Bishop

                     (Oxford University Press)


        Full details at:  http://neural-server.aston.ac.uk/NNPR/


This book provides the first comprehensive treatment of neural
networks from the perspective of statistical pattern recognition.

 * 504 pages
 * 160 figures
 * 129 graded exercises
 * a self-contained introduction to statistical pattern recogniton
 * an extensive treatment of Bayesian methods
 * paperback and hardback editions
 * 300 references


Contents:
---------

1.  Statistical Pattern Recognition
2.  Probability Density Estimation
3.  Single-layer Networks
4.  The Multi-layer Perceptron
5.  Radial Basis Functions
6.  Error Functions
7.  Parameter Optimization Algorithms
8.  Pre-processing and Feature Extraction
9.  Learning and Generalization
10. Bayesian Techniques

                              *****

   Instructors wishing to use this text as the basis for a course may 
   request a complimentary examination copy from the publishers. 
   (USA: fax request to 212-726-6442 with brief description of the course)

                              *****

Ordering information:
---------------------

ISBN 
0-19-853864-2 paperback
0-19-853849-9 hardback

USA:   45 dollars paperback
----   98 dollars hardback
       Credit card orders:
       Tel: 1-800-451-7556 (toll free)

  By post, send payment to:
       Order Dept.
       Oxford University Press
       2001 Evans Road
       Cary, NC 27513
       USA
       (3 dollars shipping for first copy, 1 dollar each thereafter)

Canada: Tel: 1-800-387-8020 (toll free)
-------

UK:    25 pounds paperback
---    55 pounds hardback
       Tel: 01536 454 534 (from the UK)
       Tel: +44 1536 454 534 (from abroad)

  By post, send payment to: 
       CWO Department
       Oxford University Press
       Saxon Way West, Corby
       Northants NN18 9ES, UK
       (3.53 pounds postage)

  By fax:
     01536 746 337 (from the UK)
     +44 1536 746 337 (from abroad)

---------------------------------------------------------------------- 

  Prof. Christopher M. Bishop        Tel. +44 (0)121 333 4631
  Neural Computing Research Group    Fax. +44 (0)121 333 4586
  Dept. of Computer Science          c.m.bishop at aston.ac.uk
    & Applied Mathematics            http://neural-server.aston.ac.uk/
  Aston University                   
  Birmingham B4 7ET, UK

----------------------------------------------------------------------


From zhuh at helios.aston.ac.uk  Thu Dec 14 13:12:43 1995
From: zhuh at helios.aston.ac.uk (zhuh)
Date: Thu, 14 Dec 1995 18:12:43 +0000
Subject: No free lunch for Cross Validation!
Message-ID: <2237.9512141812@sun.aston.ac.uk>

Dear Colleagues,

A little while ago someone claimed that 
    Cross validation will benefit from the presence of any structure,
    and if there is no structure it does no harm; 
yet
    NFL explicitly states that a structure can be equally good or
    bad for any given method, depending on how they match each other;
yet
    It was further claimed that they do not conflict with each other.
 
I was quite curious and did the following five-minute experiment to
find out which is correct.
 
Suppose we have a Gaussian variable x, with mean mu and unit variance.
We have the following three estimators for estimating mu from a
sample of size n.
  A: The sample mean.  It is optimal both in the sense of Maximum
Likelihood and Least Mean Squares.
  B: The maximum of sample.  It is a bad estimator in any reasonable sense.
  C: Cross validation to choose between A and B, with one extra data point.
 
The numerical result with n=16 and averaged over 10000 samples, gives
mean squared error:
        A: 0.0627    B: 3.4418    C: 0.5646
This clearly shows that cross validation IS harmful in this case,
despite the fact it is based on a larger sample.  NFL still wins!
 
Many of you might jump on me at this point: But this is a very
artificial example, which is not what normally occurs in practice.
To this I have two answers, short and long.
 
The short answer is from principle.  Any counter-example, however 
artificial it is, clearly demolishes the hope that cross validation
is a "universally beneficial method".
 
The longer answer is divided in several parts, which hopefully will
answer any potential criticism from any aspect:
 
1. The cross validation is performed on extra data points.  We are not
requiring it to perform as good as the mean on 17 data points.  If it 
cannot extract more information from the one extra data point, a minimum
requirement is that it keeps the information in the original 16 points. 
But it can't even do this.
 
2. The maximum of a sample is the 100 percentile. The median is the 50
percentile, which is in fact a quite reasonable estimator.  Let us use
a larger cross validation set (of size k), and replace B with a
different percentile.  The result is that, for the median, CV needs k>2
to work. For 70 percentile CV needs k>16.  The required k increases 
dramatically with the percentile.
 
3. It is not true that we have set up a case in which cross validation 
can't win.  There is indeed a small probability that a sample can be so 
bad that the sample maximum is even a better estimate than the sample 
mean.  However to utilise such rare chances to good effect k must be at 
least several hundred (maybe exponential) while n=16.  We know such k 
exists since k=infinity certainly helps.  Yet to adopt such a method 
is clearly absurd.
 
4. Although we have chosen estimator A to be the known optimal
estimator in this case, it can be replaced by something else. For
example, both A and B can be some reasonable averages over
percentiles, so that without detailed analysis it may appear doing
cross validation might give a C which is better than both A and B.  
Such believes can be defeated by similar counter-examples.
 
5. The above scheme of cross validation may appear different from what
is familiar, but here is a "practical example" which shows that it is
indeed what people normally do.  Suppose we have a random variable
which is either Gaussian or Cauchy.  Consider the following three
estimators:
    A: Sample mean: It has 100% efficiency for Gaussian, and 0%
efficiency for Cauchy.
    B: Sample median: It is 2/pi=63.66% efficient for Gaussian and
8/pi^2=81.06% efficient for Cauchy.
    C: Cross validation on an additional sample of size k, to choose
between A and B.
Intuitively it appears quite reasonable to expect cross validation to
pick out the correct one, for most of the time, so that, if averaged
over all samples, C ought to be superior to both A and B.  But no!!
This will depend on the PRIOR mixing probability of these two sub-models.  
If the variable is in fact always Gaussian, then we have just seen that 
if n=16, CV will be worse unless k>2.  The same is even more true in 
the reversed order, since the mean is an essentially useless estimator 
for Cauchy. 

6. In any of the above cases, "anti cross validation" would be even
more disastrous.

If you are not convinced by these arguments, or if you want to know 
more about efficiency, then maybe the following reference can help:
Fisher, R.A.: Theory of statistical estimation, Proc. Camb. Phil. Soc.,
Vol. 122, pp. 700-725, 1925.
 
If you are more or less convinced, I have the following speculation:
 
Several centuries ago, the French Academy of Science (or is it the
Royal Society?) made a decision that they would no longer examine 
inventions of "perpetual motion machines", on the ground that the Law
of Energy Conservation was so reliable that it would defeat any such
attempt.  History proved that this was a wise decision, which assisted
the effort of designing machines which utilise energy in fuel.
 
Should we expect the same fate for "the universally beneficial
methods" in the face of NFL?  Should we put more effort in designing
methods which use prior information? 

   posterior information <= prior information + data information.

--
Huaiyu Zhu, PhD                   email: H.Zhu at aston.ac.uk
Neural Computing Research Group   http://neural-server.aston.ac.uk/People/zhuh
Dept of Computer Science          ftp://cs.aston.ac.uk/neural/zhuh
    and Applied Mathematics       tel: +44 121 359 3611 x 5427
Aston University,                 fax: +44 121 333 6215
Birmingham B4 7ET, UK              


From C.Campbell at bristol.ac.uk  Thu Dec 14 11:21:26 1995
From: C.Campbell at bristol.ac.uk (I C G Campbell)
Date: Thu, 14 Dec 1995 16:21:26 +0000 (GMT)
Subject: New Web Page (Bristol University, UK)
Message-ID: <199512141621.QAA11250@zeus.bris.ac.uk>


The Neural Computing Research Group at Bristol University, UK
has recently set up a WWW page describing their interests at:

http://www.fen.bris.ac.uk/engmaths/research/neural/neural.html

Our interests cover three main areas: theory of neural 
computation, modelling simple neurobiological systems and
applications of neural computing in engineering. Collectively
we have produced in excess of 100 publications related to
neural computing in these topic areas. Further details
about these publications, current research interests and
research grants may be found on the above page.

Merry Xmas

Colin Campbell
University of Bristol


From robert at fit.qut.edu.au  Thu Dec 14 19:24:04 1995
From: robert at fit.qut.edu.au (Robert Andrews)
Date: Fri, 15 Dec 1995 10:24:04 +1000
Subject: Rule Extraction Mailing List
Message-ID: <199512150024.KAA15975@ocean.fit.qut.edu.au>

=-=-=-=-= RULE EXTRACTION FROM ARTIFICIAL NEURAL NETWORKS =-=-=-=-=-=-=-=-

                ANNOUNCEMENT OF MAILING LIST


Rule Extraction from Artificial Neural Networks and the related field of
Rule Refinement are topics of increasing interest and importance. This is to
announce the formation of a moderated mailing list for researchers and
students interested in these areas.

If you are interested in becoming a subscriber to this list please send the
following information by return mail:

                    Name:
Organisation/Institution:
          E-mail Address:


=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Mr Robert Andrews                          
School of Information Systems            robert at fit.qut.edu.au
Faculty of Information Technology        R.Andrews at qut.edu.au
Queensland University of Technology      +61 7 864 1656 (voice)
GPO Box 2434                  _--_|\     +61 7 864 1969 (fax)
Brisbane  Q 4001            /      QUT
Australia                   \_.--._/     http://www.fit.qut.edu.au/staff/~robert
                                  v
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-


From l.s.smith at cs.stir.ac.uk  Fri Dec 15 05:12:09 1995
From: l.s.smith at cs.stir.ac.uk (Dr L S Smith (Staff))
Date: Fri, 15 Dec 1995 10:12:09 GMT
Subject: TR on generalization available
Message-ID: <19951215T101209Z.KAA27913@katrine.cs.stir.ac.uk>

Dear all:

We have a new TR available by ftp from here:

CCCN Technical report CCCN-21, December 1995.

A Theoretical Study of the Generalization Ability of Feed-Forward Neural  
Networks.

M J Roberts.

By making assumptions on the probability distribution of the potentials  
in a
feed-forward neural network we have derived lower bounds for the  
generalization ability of the network in terms of the number of training  
patterns.  The
results are consistent with simulations carried out on a simple  
geometrical function.

The URL is ftp://ftp.cs.stir.ac.uk/pub/tr/cccn/TR21.ps.Z

If you really can't access this hard copies are available, but only as a  
last resort.

Dr Leslie S. Smith
Dept of Computing and Mathematics, Univ of Stirling
Stirling FK9 4LA Scotland

lss at cs.stir.ac.uk   (NeXTmail welcome)
Tel (44) 1786 467435 Fax (44) 1786 464551
www http://www.cs.stir.ac.uk/~lss/


From bastiane at irit.fr  Fri Dec 15 09:07:57 1995
From: bastiane at irit.fr (bastiane@irit.fr)
Date: Fri, 15 Dec 1995 15:07:57 +0100
Subject: Call for papers for DYNN'96
Message-ID: <199512151407.PAA05193@irit.irit.fr>


                        CALL FOR PAPERS FOR DYNN'96


                        International workshop on

            NEURAL NETWORKS DYNAMICS AND PATTERN RECOGNITION.

                           Toulouse - France    

                        12 and 13 of March 1996


Organized by ONERA-CERT

Sponsored by DRET of French MOD, US Air Force Scientific Research and Pole
Universitaire Europeen de Toulouse.

Organizers:  Manuel SAMUELIDES (ONERA-CERT), Bernard DOYON (INSERM),
Gregory TARR (US AF), Simon THORPE (CNRS).

Practical Information: Emmanuel DAUCE (dauce at cert.fr)
                       ***********************

OBJECTIVES OF THE WORKSHOP.
***************************

This workshop is designed to allow information exchange and discussion
between theoretical scientists working on models of neuronal dynamics and
engineersnners who are looking for efficient devices to process sensor
information.

Continuous activation state units as well as Integrate and Fire neurons 
or oscillators are elementary components of Dynamical Neural Networks.
Attractor neural networks as well as transitory data-driven dynamics will
be considered. The common features between these models is the conversion
of spatial information into spatio-temporal data flow which allows specific
processing.

Mathematical models involved use dynamical systems and stochastic processes.
They will be compared to the results of numerical simulations and the latest
neuro-physiological data concerning the dynamics of biological neural nets.

The main aim of the workshop is to encourage significant advances concerning
the dynamics of biologically plausible neural networks and their applications
to pattern recognition.
                       ***********************


ORGANIZATION OF THE WORKSHOP.
*****************************

Scheduled talks will take place on the 12th and the 13 th of March. There will
be invited talks as well as submitted contributions. About 24 talks of 30
minutes will be scheduled with time for discussion and panels.

Informal discussion and collective work may be scheduled on the 14 th.

Extended abstract (one or two pages) of submitted contribution have to be
send for acceptation by e-mail to dauce at cert.fr or by  post to Manuel
Samuelides, 
DERI ONERA-CERT, BP 4025, 31055 Toulouse CEDEX, FRANCE.  


Provisional list of invited lecturers: J.P.AUBIN, M.COTTRELL, J.DEMONGEOT,
J.DAYHOFF,G.DREYFUS, M.HIRSCH, J.TAYLOR.

(This list will be  completed)


The number of attendants to the workshop is limited
to 40 in order to allow living exchange and real discussion.
Copies of abstracts and slides will be provided to participants.

The registration fees amount to FF 1,200 including 2 nights with american
breakfast(11th and 12 th) at a first class hotel in downtown Toulouse
(Holyday Inn, Crown Plaza), two lunches on the site of the
workshop, the workshop banquet, transportation to and from CERT, coffee
beaks, the general costs of the workshop facilities and equipment.

Payment should be made either by check payable to
" AGENT COMPTABLE DU CERT ONERA " in French francs only
or by Bank transfer to
"AGENT COMPTABLE DU CERT ONERA"
Bank: Societe Generale Ramonville Saint Agne
Account N? 30003 /02117/ 00037291008/93
Please state the workshop reference: DYNN'96
 on all transactions.


                       ***********************

IMPORTANT DATES:
****************

15th of January: Dead-line for contributions and declarations of interest.

31 th of January: Date for signification of accepted contribution
and expedition of final programming of the workshop

15 th of February: Dead-Line for Inscriptions to the workshop.

To avoid postage delay, e-mail will be accepted as a usual communication

If you want to attend DYNN'96 please use your computer to reply at once

--------------------------------------------------------------------------------
Name
Organization
Adress
e mail
(  ) wishes the information about the final program
(  ) wishes to attend DYNN'96
(  ) will submit a contribution entitled:


-----------------------------------------------------------------------------

Please send your reply to the following e-mail       dauce at cert.fr

or to
        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
        x  Professor Manuel SAMUELIDES x
        x   DERI ONERA-CERT            x
        x   BP 4025                    x
        x   31055 Toulouse CEDEX       x
        x   FRANCE                     x
        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


Manuel SAMUELIDES
-----------------------------------------------------------------
for research:
Chercheur a l'ONERA-CERT        samuelid at cert.fr

for Teaching
Professeur a l'ENSAE              Manuel.Samuelides at supaero.fr
                                Tel: (33) 62 17 81 06
                                Fax: (33) 62 17 83 30


From lemm at LORENTZ.UNI-MUENSTER.DE  Fri Dec 15 09:28:49 1995
From: lemm at LORENTZ.UNI-MUENSTER.DE (Joerg_Lemm)
Date: Fri, 15 Dec 1995 15:28:49 +0100
Subject: NFL and practice
Message-ID: <9512151428.AA24811@xtp141.uni-muenster.de>

Huaiyu Zhu responsed to
>> One may discuss NFL for theoretical reasons, but
>> the conditions under which NFL-Theorems hold
>> are not those which are normally met in practice.
and wrote
>Exactly the opposite.  The theory behind NFL is trivial (in some sense).
>The power of NFL is that it deals directly with what is rountinely
>practiced in the neural network community today.

That depends on how you understand practice.
E.g. in nearly all cases functions are somewhat smooth.
This is a prior which exists in reality (for example because
of input noise in the measuring process). 
And the situation would we hopeless
if we would not use this fact in practice.
(That is just what also NFL says.)
But, if Huaiyu means that it is necessary to think about
the priors in "practice" explicitly, then I fully aggree!

But what I wanted to say is: 
WE DO HAVE "PRIORS" (BETTER SAY CORRELATIONS BETWEEN
ANSWERS TO DIFFERENT QUESTIONS) IN MOST CASES 
and they are NOT obscure, but very often 
at least as well MEASUREABLE
as "normal" sharp data y_i=f(x_i).
Even more: situations without "priors" are VERY artificial.
So if we specify the "priors" (and the lesson from NFL is
that we should if we want to make a good theory) 
then we cannot use NFL anymore.(What should it be used for then?)
 
>Joerg continued with examples of various priors of practical concern,
>including smoothness, symmetry, positive correlation, iid samples, etc.
>These are indeed very important priors which match the real world,
>and they are the implicit assumptions behind most algorithms.
>
>What NFL tells us is: If your algorithm is designed for such a prior,
>then say so explicitly so that a user can decide whether to use it.
>You can't expect it to be also good for any other prior which you have
>not considered.  In fact, in a sense, you should expect it to perform
>worse than a purely random algorithm on those other priors.

Maybe the problem is that Huaiyu Zhu uses the word "PRIOR" for every
information which is not of the sharp data form y_i=f(x_i).
It suggests that we know something before starting our generalizer.
NO, that is not the normal case!!! I mentioned many examples 
(like measurement with input noise) where "priors" are just normal
information which should be used DURING learning like sharp data!
(Sharp data might be even not available at all!) And of course using
wrong "priors" is similar to using wrong sharp data.
But I fully aggree that most algorithm uses "prior" information
only implicitly and that there is a lot of theoretical work to do.

In response to
>> In many interesting cases "effective" function values contain information
>> about other function values and NFL does not hold!
Huaiyu Zhu continues 
>This is like saying "In many interesting cases we do have energy sources,
>and we can make a machine running forever, so the natural laws against
>`perpetual motion machines' do not hold."
                     
Indeed, it is a little bit like that, but a system without energy sources
is a much better approximation for some real world systems
compared to a world without "priors"
(i.e. without correlated answers over different questions)!
So the energy law is useful,
but models for worlds without correlated information are NOT,
except maybe that they tell us to include the correlation
properly! 

Joerg Lemm
(Institute for Theoretical Physics I, University of Muenster, Germany)


From shastri at ICSI.Berkeley.EDU  Fri Dec 15 16:34:24 1995
From: shastri at ICSI.Berkeley.EDU (Lokendra Shastri)
Date: Fri, 15 Dec 1995 13:34:24 PST
Subject: Technical report --- negated knowledge and inconsistency 
Message-ID: <199512152134.NAA06683@kulfi.ICSI.Berkeley.EDU>


        Dealing with negated knowledge and inconsistency in a neurally
	motivated model of memory and reflexive reasoning.

        Lokendra Shastri and Dean J. Grannes
        TR-95-041
	ICSI
        August 1995

	Recently, SHRUTI has been proposed as a connectionist model of
	rapid reasoning. It demonstrates how a network of simple neuron-
	like elements can encode a large number of specific facts as well
	as systematic knowledge (rules) involving n-ary relations, quanti-
	fication and concept hierarchies, and perform a class of reasoning
	with extreme efficiency. The model, however, does not deal with
	negated facts and rules involving negated antecedents and 
	consequents. We describe an extension of SHRUTI that can encode
	positive as well as negated knowledge and use such knowledge
	during reflexive reasoning. The extended model explains how an
	agent can hold inconsistent knowledge in its long-term memory
	without being ``aware'' that its beliefs are inconsistent, but
	detect a contradiction whenever inconsistent beliefs that are 
	within a certain inferential distance of each other become 
	co-active during an episode of reasoning. Thus the model is not
	logically omniscient, but detects contradictions whenever it tries 
	to use inconsistent knowledge. The extended model also explains how
	limited attentional focus or action under time pressure can lead an
	agent to produce an erroneous response.  A biologically signficant
	feature of the model is that it uses  only local inhibition to
	encode negated knowledge. Like the basic model, the extended model
	encodes and propagates dynamic bindings using temporal synchrony.

	Key Words: long-term memory; rapid reasoning; dynamic bindings;
		   synchrony; knowledge representation; neural oscillations;
		   short-term memory; negation; inconsistent knowledge.


ftp-server:	ftp.icsi.berkeley.edu (128.32.201.55)
ftp-file:	/pub/techreports/1995/tr-95-041.ps.Z


Lokendra Shastri
International Computer Science Institute
1947 Center Street, Suite 600
Berkeley, CA 94704
http://www.icsi.berkeley.edu/~shastri

==========================

Detailed instructions for retrieving the report:

	unix% ftp ftp.icsi.berkeley.edu
	Name (ftp.icsi.berkeley.edu:): anonymous
	Password: your_name at your_machine
	ftp> cd /pub/techreports/1995
	ftp> binary
	ftp> get tr-95-041.ps.Z
	ftp> quit
	unix% uncompress tr-95-041.ps.Z
	unix% lpr tr-95-041.ps


If your name server does not know about ftp.icsi.berkeley.edu, use
128.32.201.55 instead.

All files in this archive can also be obtained through an
e-mail interface in case direct ftp is not available. To obtain
instructions, send mail containing the line `send help' to:

	 ftpmail at ICSI.Berkeley.EDU

As a last resort, hardcopies may be ordered for a small fee.
Send mail to info at ICSI.Berkeley.EDU for more information.


From cherkaue at cs.wisc.edu  Fri Dec 15 19:03:15 1995
From: cherkaue at cs.wisc.edu (cherkaue@cs.wisc.edu)
Date: Fri, 15 Dec 1995 18:03:15 -0600
Subject: No free lunch for Cross Validation!
Message-ID: <199512160003.SAA03324@mozzarella.cs.wisc.edu>

In reply to Huaiyu Zhu's message <zhuh at helios.aston.ac.uk>

> ...
>
>A little while ago someone claimed that 
>    Cross validation will benefit from the presence of any structure,
>    and if there is no structure it does no harm; 
>
> ...
>
>Suppose we have a Gaussian variable x, with mean mu and unit variance.
>We have the following three estimators for estimating mu from a
>sample of size n.
>  A: The sample mean.  It is optimal both in the sense of Maximum
>Likelihood and Least Mean Squares.
>  B: The maximum of sample.  It is a bad estimator in any reasonable sense.
>  C: Cross validation to choose between A and B, with one extra data point.
>
>The numerical result with n=16 and averaged over 10000 samples, gives
>mean squared error:
>        A: 0.0627    B: 3.4418    C: 0.5646
>This clearly shows that cross validation IS harmful in this case,
>despite the fact it is based on a larger sample.  NFL still wins!
 

You forgot

   D: Anti-cross validation to choose between A and B, with one extra data
      point.


I don't understand your claim that "cross validation IS harmful in this case."
You seem to equate "harmful" with "suboptimal." Cross validation is a technique
we use to guess the answer when we don't already know the answer. You give
technique A the benefit of your prior knowledge of the true answer, but C must
operate without this knowledge. A fair comparison would pit C against D, not C
against A. As you say:

>6. In any of the above cases, "anti cross validation" would be even
>more disastrous.

Kevin Cherkauer
Computer Sciences Dept.
University of Wisconsin-Madison
cherkauer at cs.wisc.edu

From pkso at castle.ed.ac.uk  Sat Dec 16 10:06:41 1995
From: pkso at castle.ed.ac.uk (P Sollich)
Date: Sat, 16 Dec 95 15:06:41 GMT
Subject: Thesis on Query Learning available
Message-ID: <9512161506.aa29855@uk.ac.ed.castle>

FTP-host: archive.cis.ohio-state.edu
FTP-filename: /pub/neuroprose/Thesis/sollich.thesis.tar.Z


Dear fellow connectionists,

the following Ph.D. thesis is now available for copying from the
neuroprose archive:


                      ASKING INTELLIGENT QUESTIONS ---
                THE STATISTICAL MECHANICS OF QUERY LEARNING
 
                              Peter Sollich
                          Department of Physics
                       University of Edinburgh, U.K.

                                 Abstract:		

  This thesis analyses the capabilities and limitations of query learning
  by using the tools of statistical mechanics to study learning in
  feed-forward neural networks.
  
  In supervised learning, one of the central questions is the issue of
  generalization: Given a set of training examples in the form of
  input-output pairs produced by an unknown {\em teacher} rule, how can
  one generate a {\em student} which {\em generalizes}, i.e., which
  correctly predicts the outputs corresponding to inputs not contained in
  the training set? The traditional paradigm has been to study learning
  from {\em random examples}, where training inputs are sampled randomly
  from some given distribution.  However, random examples contain
  redundant information, and generalization performance can thus be
  improved by {\em query learning}, where training inputs are chosen such
  that each new training example will be maximally `useful' as measured by
  a given {\em objective function}. 
  
  We examine two common kinds of queries, chosen to optimize the objective
  functions, generalization error and entropy (or information),
  respectively.  Within an extended Bayesian framework, we use the
  techniques of statistical mechanics to analyse the average case
  generalization performance achieved by such queries in a range of
  learning scenarios, in which the functional forms of student and teacher
  are inspired by models of neural networks.  In particular, we study how
  the efficacy of query learning depends on the form of teacher and
  student, on the training algorithm used to generate students, and on the
  objective function used to select queries.  The learning scenarios
  considered are simple but sufficiently generic to allow general
  conclusions to be drawn. 
  
  We first study perfectly learnable problems, where the student can
  reproduce the teacher exactly.  From an analysis of two simple model
  systems, the high-low game and the linear perceptron, we conclude that
  query learning is much less effective for rules with continuous outputs
  -- provided they are `invertible' in the sense that they can essentially
  be learned from a finite number of training examples -- than for rules
  with discrete outputs.  Queries chosen to minimize the entropy generally
  achieve generalization performance close to the theoretical optimum
  afforded by minimum generalization error queries, but can perform worse
  than random examples in scenarios where the training algorithm is
  under-regularized, i.e., has too much `confidence' in corrupted training
  data. 
  
  For imperfectly learnable problems, we first consider linear students
  learning from nonlinear perceptron teachers and show that in this case
  the structure of the student space determines the efficacy of queries
  chosen to minimize the entropy in {\em student} space.  Minimum {\em
  teacher} space queries, on the other hand, perform worse than random
  examples due to lack of feedback about the progress of the student.  For
  students with discrete outputs, we find that in the absence of
  information about the teacher space, query learning can lead to
  self-confirming hypotheses far from the truth, misleading the student to
  such an extent that it will not approximate the teacher optimally even
  for an infinite number of training examples.  We investigate how this
  problem depends on the nature of the noise process corrupting the
  training data, and demonstrate that it can be alleviated by combining
  query learning with Bayesian techniques of model selection.  Finally, we
  assess which of our conclusions carry over to more realistic neural
  networks, by calculating finite size corrections to the thermodynamic
  limit results and by analysing query learning in a simple two-layer
  neural network.  The results suggest that the statistical mechanics
  analysis is often relevant to real-world learning problems, and that the
  potentially significant improvements in generalization performance
  achieved by query learning can be made available, in a computationally
  cheap manner, for realistic multi-layer neural networks. 
  

Criticism, comments and suggestions are welcome.
Merry Christmas everyone!

Peter Sollich

--------------------------------------------------------------------------
 Peter Sollich                           Department of Physics
                                         University of Edinburgh
 e-mail: P.Sollich at ed.ac.uk              Kings Buildings
 phone: +44 - (0)131 - 650 5236          Mayfield Road
                                         Edinburgh EH9 3JZ, U.K.
--------------------------------------------------------------------------

RETRIEVAL INSTRUCTIONS: Get `sollich.thesis.tar.Z' from the `Thesis'
subdirectory of the neuroprose archive.  Uncompress, and unpack the
resulting tar file (on UNIX: uncompress sollich.thesis.tar.Z; tar xf - <
sollich.thesis.tar).  This will yield the postscript files listed below. 
Contact me if there are any problems with retrieval and or printing. 

QUICK GUIDE for busy readers: For a first look, see sollich_title.ps (has
abstract and table of contents).  File sollich_chapter1.ps contains a
general introduction to query learning and an overview of the
literature.  Finally, for a summary of the main results and open
questions, see sollich_chapter9.ps.

LIST OF FILES:
------------------------------------------------------------------------------
Filename             No of  Size in KB   Contents
                     pages  (compressed/
                            uncompressed)
------------------------------------------------------------------------------
sollich_title.ps     8       37/  75     Title, Declaration, 
                                              Acknowledgements, Publications, 
                                              Abstract, Table of contents
------------------------------------------------------------------------------
sollich_chapter1.ps  8       48/  98     Introduction
------------------------------------------------------------------------------
sollich_chapter2.ps  10      48/ 101     A probabilistic framework for 
                                              query selection
------------------------------------------------------------------------------
sollich_chapter3.ps  21     128/ 376     Perfectly learnable problems: 
                                              Two simple examples
------------------------------------------------------------------------------
sollich_chapter4.ps  19     135/ 337     Imperfectly learnable problems: 
                                              Linear students
------------------------------------------------------------------------------
sollich_chapter5.ps  40     228/ 565     Query learning assuming the 
                                              inference model is correct
------------------------------------------------------------------------------
sollich_chapter6.ps  12     244/1050     Combining query learning and 
                                              model selection
------------------------------------------------------------------------------
sollich_chapter7.ps  20     217/ 558     Towards realistic neural networks I:
                                              Finite size effects
------------------------------------------------------------------------------
sollich_chapter8.ps  24     136/ 299     Towards realistic neural networks II:
                                              Multi-layer networks
------------------------------------------------------------------------------
sollich_chapter9.ps  5       31/  59     Summary and Outlook
------------------------------------------------------------------------------
sollich_bib.ps       8       37/  68     Bibliography
------------------------------------------------------------------------------

From zhuh at helios.aston.ac.uk  Mon Dec 18 08:11:50 1995
From: zhuh at helios.aston.ac.uk (zhuh)
Date: Mon, 18 Dec 1995 13:11:50 +0000
Subject: NFL and practice
Message-ID: <4332.9512181311@sun.aston.ac.uk>


I accidentally sent my reply to Joerg Lemm, instead of Connnetionist.
Since he replied to the Connectionist, I'll reply here as well, and
include my original posting at the end.

I quite agree with Joerg's observation about learning algorithms in
practice, and the priors they use.  The key difference is

	Is it legitimate to be vague about prior?

Put it another way,

	Do you claim the algorithm can pick up whatever prior automatically,
	instead of being specified before hand?

My answer is NO, to both questions, because for an algorithm to be good on 
any prior is exactly the same as for an algorithm to be good without prior,
as NFL told us.

For purely cosmetic reasons, it might be helpful to translate the 
useless "No free lunch theorem" :-)

	Without specifying a particular prior, any algorithm is as good as 
	random guess,

into the equivalent, but infinitely more useful, "You have to pay for lunch
Theorem" :-)

	For an algorithm to perform better than random guess, a particular 
	prior should be specified.

On a more practical level,

> E.g. in nearly all cases functions are somewhat smooth.
Do you specify the scale on which it is smooth?

> This is a prior which exists in reality (for example because
> of input noise in the measuring process). 
If you average smoothness over all scales, in a certain uniform way, you get
a prior which contains no smoothness at all.  If you average them in a non-
uniform way, you actually specify a non-uniform prior, which is the crucial
piece of information for any algorithm to work at all.

> And the situation would we hopeless
> if we would not use this fact in practice.
It would still be hopeless if we only used the fact of "somewhat smooth",
instead of specifying how smooth.  See the following for theory and examples:

Zhu, H. and Rohwer, R.:
  Bayesian regression filters and the issue of priors, 1995. To appear in 
  Neural Computing and Applications.
  ftp://cs.aston.ac.uk/neural/zhuh/reg_fil_prior.ps.Z

My original posting is enclosed as the following:

----- Begin Included Message -----


From imlm at tuck.cs.fit.edu  Mon Dec 18 16:39:40 1995
From: imlm at tuck.cs.fit.edu (IMLM Workshop (pkc))
Date: Mon, 18 Dec 1995 16:39:40 -0500
Subject: CFP: AAAI-96 Workshop on Integrating Multiple Learned Models
Message-ID: <199512182139.QAA10740@tuck.cs.fit.edu>

		    CALL FOR PAPERS/PARTICIPATION


		 INTEGRATING MULTIPLE LEARNED MODELS
	FOR IMPROVING AND SCALING MACHINE LEARNING ALGORITHMS

	       to be held in conjunction with AAAI 1996
			   Portland, Oregon
			     August 1996


Most modern machine learning research uses a single model or learning
algorithm at a time, or at most selects one model from a set of
candidate models. Recently however, there has been considerable
interest in techniques that integrate the collective predictions of a
set of models in some principled fashion.  With such techniques often
the predictive accuracy and/or the training efficiency of the overall
system can be improved, since one can "mix and match" among the
relative strengths of the models being combined.

The goal of this workshop is to gather researchers actively working in
the area of integrating multiple learned models, to exchange ideas and
foster collaborations and new research directions.  In particular, we
seek to bring together researchers interested in this topic from the
fields of Machine Learning, Knowledge Discovery in Databases, and
Statistics.

Any aspect of integrating multiple models is appropriate for the
workshop. However we intend the focus of the workshop to be improving
prediction accuracies, and improving training performance in the
context of large training databases.

More precisely, submissions are sought in, but not limited to, the
following topics:

1) Techniques that generate and/or integrate multiple learned
   models. In particular, techniques that do so by:

	* using different training data distributions
		(in particular by training over different partitions
		of the data)
	* using different output classification schemes
		(for example using output codes)
       	* using different hyperparameters or training heuristics
		(primarily as a tool for generating multiple models)

	2) Systems and architectures to implement such strategies. In particular:

        * parallel and distributed multiple learning systems
        * multi-agent learning over inherently distributed data

A paper need not be submitted to participate in the workshop, but
space may be limited so contact the organizers as early as possible if
you wish to participate.

The workshop format is planned to encompass a full day of half hour
presentations with discussion periods, ending with a brief period for
summary and discussion of future activities.  Notes or proceedings for
the workshop may be provided, depending on the submissions received.


Submission requirements:

i) A short paper of not more than 2000 words detailing recent research
results must be received by March 18, 1996.

ii) The paper should include an abstract of not more than 150 words,
and a list of keywords.  Please include the name(s), email
address(es), address(es), and phone number(s) of the author(s) on the
first page.  The first author will be the primary contact unless
otherwise stated.

iii) Electronic submissions in postscript or ASCII via email are
preferred.  Three printed copies (preferrably double-sided) of your
submission are also accepted.

iv) Please also send the title, name(s) and email address(es) of the
author(s), abstract, and keywords in ASCII via email.


Submission address:

	imlm at cs.fit.edu

	Philip Chan
	IMLM Workshop
	Computer Science
	Florida Institute of Technology
	150 W. University Blvd.
        Melbourne, FL 32901-6988
	407-768-8000 x7280 (x8062)
	407-984-8461 (fax)


Important Dates:

	Paper submission deadline:	March 18, 1996
	Notification of acceptance:	April 15, 1996
	Final copy:			May 13, 1996


Chairs:

        Salvatore Stolfo, Columbia University		sal at cs.columbia.edu
        David Wolpert, Santa Fe Institute		dhw at santafe.edu
	Philip Chan, Florida Institute of Technology	pkc at cs.fit.edu


General Inquiries:

Please address general inquiries to one of the co-chairs or send them
to:

	imlm at cs.fit.edu

Up-to-date workshop information is maintained on WWW at:

	http://cs.fit.edu/~imlm/ or
	http://www.cs.fit.edu/~imlm/


From ces at negi.riken.go.jp  Mon Dec 18 20:36:45 1995
From: ces at negi.riken.go.jp (ces@negi.riken.go.jp)
Date: Tue, 19 Dec 95 10:36:45 +0900
Subject: PhD Thesis Announcement : nonlinear filters 
Message-ID: <9512190136.AA21982@negi.riken.go.jp>


    FTP-host: archive.cis.ohio-state.edu
    FTP-filename: /pub/neuroprose/Thesis/chng.thesis.ps.Z


Dear fellow connectionists,

the following Ph.D. thesis is now available for copying from the
neuroprose archive: (Sorry, no hardcopies available.)


- -----------------------------------------------------------------------

		Applications of nonlinear filters with
		the linear-in-the-parameter structure
 
			   Eng-Siong CHNG
               	Department of Electrical Engineering
                     University of Edinburgh, U.K.

                              Abstract:		


The subject of this thesis is the application of nonlinear filters,
with the linear-in-the-parameter structure, to time series prediction
and channel equalisation problems.

In particular, the Volterra and the radial basis function (RBF) 
expansion techniques are considered to implement the nonlinear filter
structures. These approaches, however, will generate filters with 
very large numbers of parameters. As large filter models require 
significant implementation complexity, they are undesirable for practical
implementations.  To reduce the size of the filter, the orthogonal least 
squares (OLS) algorithm is considered to perform  model selection.  
Simulations were conducted to study the effectiveness  of subset models 
found using this algorithm, and the results indicate that this selection 
technique is adequate for many practical applications.
The other aspect of the OLS algorithm studied is its implementation 
requirements. Although the OLS algorithm  is very efficient, the required
computational complexity is still substantial. To reduce the processing 
requirement, some fast OLS methods are examined.

Two major applications of nonlinear filters are considered in this thesis.
The first involves the use of nonlinear filters  to predict time series
which possess nonlinear dynamics.  To study the performance of the
nonlinear predictors, simulations were conducted to compare the
performance of these predictors with conventional linear predictors.
The simulation results confirm that nonlinear predictors normally perform
better than linear predictors. Within this study, the application of RBF 
predictors to time series  that exhibit  homogeneous nonstationarity is
also considered.  This type of time series possesses  the same characteristic
throughout the time sequence apart from local variations of mean and trend. 

The second application involves the use of  filters for symbol-decision 
channel equalisation. The decision function of the  optimal  symbol-decision
equaliser is first derived  to show that it is nonlinear, and that
it may be realised explicitly using  a RBF filter. Analysis is then carried
out to illustrate the difference between the optimum equaliser's performance
and that of the conventional linear equaliser. In particular, the effects of 
delay order on the equaliser's decision boundaries and bit error rate (BER)
performance are studied. The minimum mean square error (MMSE) optimisation 
criterion for training the linear equaliser is also examined  to illustrate 
the sub-optimum nature of such a criterion. To improve the linear equaliser's 
performance, a method which adapts the equaliser by minimising the BER is 
proposed. Our results indicate that the linear equalisers 
performance is normally improved by using the minimum BER criterion.
The decision feedback equaliser (DFE) is also examined. We propose a
transformation using the feedback inputs to change  the DFE problem
to a feedforward equaliser problem. This unifies the treatment of the
equaliser structures with and without decision feedback.

	-----------------------------------------------------------


Criticism, comments and suggestions are welcome.
Merry Christmas everyone!

Eng Siong

- --------------------------------------------------------------------------
 Eng Siong CHNG                          Lab. for ABS, 
					 Frontier Research Programme,
					 RIKEN,
 email : ces at negi.riken.go.jp		 2-1 Hirosawa, Wako-Shi,
 					 Saitama 351-01,
					 JAPAN.
- --------------------------------------------------------------------------


    RETRIEVAL INSTRUCTIONS: 

    FTP-host: archive.cis.ohio-state.edu
    FTP-filename: /pub/neuroprose/Thesis/chng.thesis.ps.Z

    File size : 1715073 bytes
    Number of pages : 165 pages

unix> ftp archive.cis.ohio-state.edu
Connected to archive.cis.ohio-state.edu.
220 archive.cis.ohio-state.edu FTP server ready.
Name: anonymous
331 Guest login ok, send ident as password.
Password:neuron
230 Guest login ok, access restrictions apply.
ftp> binary
200 Type set to I.
ftp> cd pub/neuroprose/Thesis
250 CWD command successful.
ftp> get chng.thesis.ps.Z
200 PORT command successful.
150 Opening BINARY mode data connection for chng.thesis.ps.Z
226 Transfer complete.
ftp> quit
221 Goodbye.


unix> uncompress chng.thesis.ps.Z
unix> lpr chng.thesis.ps  (postscript printer)


Contact me if there are any problems with retrieval and or printing. 


------- End of Forwarded Message


From hag at santafe.edu  Mon Dec 18 21:22:57 1995
From: hag at santafe.edu (Howard A. Gutowitz)
Date: Mon, 18 Dec 1995 19:22:57 -0700 (MST)
Subject: Exploring the Space of CA
Message-ID: <9512190222.AA29140@sfi.santafe.edu>


Announcing:

"Exploring the Space of Cellular Automata"

Cellular automata can be thought of  as a
restricted kind of neural net, in which the
cells take on only a finite set of values,
and connections are local and regular.
 
This is set of interactive web pages designed to help
you learn about CA, and the use of the lambda
parameter to find critical regions in the space of
CA.

Credits: 

    Concept: Chris Langton 
    CA simulation program: Patrick Hayden. 
    cgi interface: Eric Carr. 
    Text: Chris Langton , Howard Gutowitz, and Eric Carr. 


Available from: http://alife.santafe.edu/alife/topics/ca/caweb 


-- 
Howard Gutowitz                |   hag at neurones.espci.fr
ESPCI                          |   http://www.santafe.edu/~hag
Laboratoire d'Electronique     |   home:   (331) 4707-3843
10 rue Vauquelin               |   office: (331) 4079-4697
75005 Paris, France            |   fax:    (331) 4079-4425 

From hicks at cs.titech.ac.jp  Mon Dec 18 23:58:07 1995
From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp)
Date: Tue, 19 Dec 1995 13:58:07 +0900
Subject: NFL, practice, and CV
Message-ID: <199512190458.NAA28669@euclid.cs.titech.ac.jp>


Huaiyu Zhu wrote:
>You can't make every term positive in your balance sheet, if the grand
>total is bound to be zero.

There ARE functions which are always non-negative, but which under 
an appropriate measure integrate to 0.
It only requires that 

	1) the support of the non-negative values is vanishingly small,
	2) the non-negative values are bounded 

So the above statement by Dr. Zhu is not true.  In fact I think this ability
for pointwise positive values to dissapear under integration is key to the
"zero-sum" aspect of the NFL theorem holding true, despite the fact that we
obviously see so many examples of working algorithms.

My key point:  A zero-sum (infinite) universe doesn't require negative values.

----

There is another important issue which needs to be clarified, and that is the
definition of CV and the kinds of problems to which it can be applied.  Now
anybody can make whatever definition they want, and then come to some
conclusions based upon that definition, and that conclusion may be correct
given that definition.  However, there are also advantages to sharing a common
intellectual currency.  

	I quote below from "An Introduction to the Bootstrap" by Efron and
Tibshirani, 1993, Chapter 17.1.  It describes well what I meant when I talked
monitoring prediction error in a previous posting, and describes CV as a
method for doing that.

==================================================

	In our discussion so far we have focused on a number of measures of
statistical accuracy: standard errors, biases, and confidence intervals.  All
of these are measures of accuracy for parameters of a model.  Prediction error
is a different quantity that measures how well a model predicts the response
value of a future observation.  It is often used for model selection, since
it is ensible ot choose a model that has the lowest prediction error among a
set of candidates.

	Cross-validation is a standard tool for estimating prediction error.
It is an old idea (predating the bootstrap) that has enjoyed a comeback in
recent years with the increase in available computing power and speed.  In
this chapter we discuss cross-validation, the bootstrap, and some other
closely related techniques for estimation of prediction error.

	In regression models, prediction error refers to the expected squared 
difference between a future response and its prediction from the model:

	PE = E(y - \hat{y})^2.
	
The expectation refers to repeated sampling from the true population. 
Prediction error also arises in th eclassification problem, where the
repsponse falls into one of k unordered classes.  For example, the possible
reponses might be Republican, Democrat, or Independent in a political survey.
In classification problems prediction error is commonly defined as the
probability of an incorrect classification

	PE = Prob(\hat{y} \neq y),

also called the misclassification rate.  The methods described in this chapter
apply to both definitions of prediction error, and also to others.

==================================================

Craig Hicks
Tokyo Institute of Technology

From zhuh at helios.aston.ac.uk  Tue Dec 19 10:14:20 1995
From: zhuh at helios.aston.ac.uk (zhuh)
Date: Tue, 19 Dec 1995 15:14:20 +0000
Subject: NFL, practice, and CV
Message-ID: <8208.9512191514@sun.aston.ac.uk>

This is in reply to the critisism by Craig Hicks and Kevin Cherkauer,
and will be my last posting in this thread.

Craig Hicks thought that my statement (A)

> >You can't make every term positive in your balance sheet, if the grand
> >total is bound to be zero.

is contradictory to his statements (B)

> There ARE functions which are always non-negative, but which under 
> an appropriate measure integrate to 0.
> It only requires that 
> 
> 	1) the support of the non-negative values is vanishingly small,
> 	2) the non-negative values are bounded 

But they are actually talking about different things.  There is a big 
difference between positive and non-negative.  For all practical purposes, 
the functions described by (B) can be regarded as identically zero.  

Translating back to the original topic, statement (B) becomes

(C) There are algorithms which are always no worse than random guessing,
on any prior, provided that 
	1) The priors on which it performs better than random guessing
	   have zero probability to occur in practice.
	2) It cannot be infinitely better on these priors.

It is true that something improbable may still be possible, but this is
only of academic interest.  In most of modern treatment of function spaces,
functions are only identified up to a set of measure zero, so that phrases
like "almost everywhere" or "almost surely" are redundent.

I suspect that due to the way NFL are proved, even (C) is impossible,
but this does not matter anyway, because (C) itself is of no practical
interest whatsoever.

> ----

Considering cross validation, Craig wrote
> 
> There is another important issue which needs to be clarified, and that is the
> definition of CV and the kinds of problems to which it can be applied.  Now
> anybody can make whatever definition they want, and then come to some
> conclusions based upon that definition, and that conclusion may be correct
> given that definition.  However, there are also advantages to sharing a common
> intellectual currency.  

Risking a little bit over-simplification, I would like to summarise the two
usages of CV as the following

(CV1)	A method for evaluating estimates,
(CV2)	A method for evaluating estimators.

The key difference is that in (CV1), a decision is made for each sample,
while in (CV2) a decision is made for all samples.

If (CV1) is applied on two algorithms A and B, then we can always define
a third algorithm C, by always choosing the estimate given by either A or
B which is favoured by (CV1).   But my previous counter-example shows
that averaging over all samples, C can be worse than A.  One may seek 
refuge in statements like "optimal decision for each sample does not mean 
optimal decision for all samples".  Well, such incoherent inference is the 
defining characteristic of non-Bayesian statistics. In Bayesian decision 
theory it is well known that
	A method is optimal iff it is optimal on almost all samples,
	(excluding various measure zero anomolies.)

The case of (CV2) is quite different.  It is of a higher level than
algorithms like A and B.  It is in fact a statistical estimator mapping
(D,A,f) to to a real number r, where D is a finite data set, A is a given 
algorithm, f is an objective function, and r is the predicted average 
performance.  It should therefore be compared with other such methods.
This appears not to be a topic considered in this discussion.

--------------
Kevin Cherkauer wrote
> 
> You forgot
> 
>    D: Anti-cross validation to choose between A and B, with one extra data
>       point.

Well, I did not forget that, as you have quoted below, point 6.  
> 
> I don't understand your claim that "cross validation IS harmful in this case."
> You seem to equate "harmful" with "suboptimal." 

See my original answer, points 1. and 4.

> 	Cross validation is a technique
> we use to guess the answer when we don't already know the answer. 

This is true for any statistical estimator.

>	You give
> technique A the benefit of your prior knowledge of the true answer, but C must
> operate without this knowledge. 

The prior knowledge is that the distribution is a unit Gaussian with
unspecified mean, the true answer is its mean.  No, they are not the 
same thing.  C also operates with the knowledge that the distribution 
is a unit Gaussian, but it refuses to use this knowledge (which implies 
A is better than B).  Instead, it insists on evaluating A and B on a 
cross validation set.  That's why it performs miserably.

>	A fair comparison would pit C against D, not C
> against A. As you say:
> 
> >6. In any of the above cases, "anti cross validation" would be even
> >more disastrous. 

If the definition was that "An algorithm is good if it is no worse than
the worst algorithm", then I would have no objection.  Well, almost any
algorithm would be good in this sense.  However, if the phrase "in any of 
the above cases" is droped without putting a prior restriction as remedy, 
then it's also true that all algorithm is as bad as the worst algorithm.

Huaiyu

PS. I think I have already talked enough about this subject so I'll shut 
up from now on, unless there's anything new to say. More systematic
treatment of these subjects instead of counter-examples can be found
in the ftp site below.

--
Huaiyu Zhu, PhD                   email: H.Zhu at aston.ac.uk
Neural Computing Research Group   http://neural-server.aston.ac.uk/People/zhuh
Dept of Computer Science          ftp://cs.aston.ac.uk/neural/zhuh
    and Applied Mathematics       tel: +44 121 359 3611 x 5427
Aston University,                 fax: +44 121 333 6215
Birmingham B4 7ET, UK              


From minton at ISI.EDU  Tue Dec 19 14:53:27 1995
From: minton at ISI.EDU (minton@ISI.EDU)
Date: Tue, 19 Dec 95 11:53:27 PST
Subject: JAIR article
Message-ID: <9512191953.AA11913@sungod.isi.edu>


Readers of this mailing list may be interested in the following JAIR
article, which was just published:

Weiss, S.M. and Indurkhya, N. (1995)
  "Rule-based Machine Learning Methods for Functional Prediction", 
   Volume 3, pages 383-403.
   PostScript: volume3/weiss95a.ps (527K)
	       compressed, volume3/weiss95a.ps.Z (166K)


   Abstract: We describe a machine learning method for predicting the
   value of a real-valued function, given the values of multiple input
   variables. The method induces solutions from samples in the form of
   ordered disjunctive normal form (DNF) decision rules. A central
   objective of the method and representation is the induction of
   compact, easily interpretable solutions.  This rule-based decision
   model can be extended to search efficiently for similar cases prior to
   approximating function values. Experimental results on real-world data
   demonstrate that the new techniques are competitive with existing
   machine learning and statistical methods and can sometimes yield
   superior regression performance.

The PostScript file is available via:
   
 -- comp.ai.jair.papers

 -- World Wide Web: The URL for our World Wide Web server is
       http://www.cs.washington.edu/research/jair/home.html

 -- Anonymous FTP from either of the two sites below:
      CMU:   p.gp.cs.cmu.edu        directory: /usr/jair/pub/volume3
      Genoa: ftp.mrg.dist.unige.it  directory:  pub/jair/pub/volume3

 -- automated email. Send mail to jair at cs.cmu.edu or jair at ftp.mrg.dist.unige.it
    with the subject AUTORESPOND, and the body GET VOLUME3/FILE-NM
    (e.g., GET VOLUME3/MOONEY95A.PS)
    Note: Your mailer might find our files too large to handle. Also, note  
    that compressed files cannot be emailed, since they are binary files.

 -- JAIR Gopher server: At p.gp.cs.cmu.edu, port 70. 

For more information about JAIR, check out our WWW or FTP sites, or
send electronic mail to jair at cs.cmu.edu with the subject AUTORESPOND
and the message body HELP, or contact jair-ed at ptolemy.arc.nasa.gov.


From lucas at scr.siemens.com  Tue Dec 19 12:26:15 1995
From: lucas at scr.siemens.com (Lucas Parra)
Date: Tue, 19 Dec 1995 12:26:15 -0500 (EST)
Subject: Preprint: Symplectic Nonlinear Component Analysis
Message-ID: <199512191726.MAA04146@owl.scr.siemens.com>


Dear fellow connectionists,

a preprint of the following NIPS*95 paper is available at:


  ftp://archive.cis.ohio-state.edu/pub/neuroprose/parra.nips95.ps.Z


             Symplectic Nonlinear Component Analysis

                        Lucas C. Parra 
                   Siemens Corporate Research
                     lucas at scr.siemens.com 


Statistically independent features can be extracted by finding a
factorial representation of a signal distribution. Principal Component
Analysis (PCA) accomplishes this for linear correlated and Gaussian
distributed signals. Independent Component Analysis (ICA), formalized
by Comon (1994), extracts features in the case of linear
statistical dependent but not necessarily Gaussian distributed
signals. Nonlinear Component Analysis finally should find a factorial
representation for nonlinear statistical dependent distributed
signals. This paper proposes for this task a novel feed-forward,
information conserving, nonlinear map - the explicit symplectic
transformations. It also solves the problem of non-Gaussian output
distributions by considering single coordinate higher order
statistics.

From jlm at crab.psy.cmu.edu  Wed Dec 20 18:16:31 1995
From: jlm at crab.psy.cmu.edu (James L. McClelland)
Date: Wed, 20 Dec 95 18:16:31 EST
Subject: Technical Report Available
Message-ID: <9512202316.AA19275@crab.psy.cmu.edu.psy.cmu.edu>


The following Technical Report is available electronically from our
FTP server or in hard copy form.  Instructions for obtaining copies
may be found at the end of this post.

========================================================================

	       On the Time Course of Perceptual Choice:
	  A Model Based on Principles of Neural Computation

		  Marius Usher & James L. McClelland

		  Carnegie Mellon University and the
	       Center for the Neural Basis of Cognition

                    Technical Report PDP.CNS.95.5
                            December 1995

The time course of information processing is discussed in a model
based on leaky, stochastic, non-linear accumulation of activation in
mutually inhibitory processing units. The model addresses data from
choice tasks using both time-controlled (e.g., deadline or response
signal) and standard reaction time paradigms, and accounts
simultaneously for aspects of data from both paradigms.  In special
cases, the model becomes equivalent to a classical diffusion process,
but in general a more complex type of diffusion occurs. Mutual
inhibition counteracts the effects of information leakage, allows
flexible choice behavior regardless of the number of alternatives, and
contributes to accounts of additional data from tasks requiring choice
with conflict stimuli and word identification tasks.

======================================================================

Retrieval information for pdp.cns TRs:

unix> ftp 128.2.248.152                 # hydra.psy.cmu.edu
Name: anonymous
Password: <email address>
ftp> cd pub/pdp.cns
ftp> binary
ftp> get pdp.cns.95.5.ps.Z              # gets this tr
ftp> quit
unix> zcat pdp.cns.95.5.ps.Z | lpr      # or however you print postscript

NOTE:  

The compressed file is 567,075 bytes long.
Uncompressed, the file is 1,768,398 byes long.

The printed version is 53 total pages long.

For those who do not have FTP access, physical copies can be requested from
Barbara Dorney <bd1q+ at andrew.cmu.edu>.

For a list of available PDP.CNS Technical Reports:

> get README

For the titles and abstracts:

> get ABSTRACTS

From dhw at santafe.edu  Wed Dec 20 20:00:48 1995
From: dhw at santafe.edu (David Wolpert)
Date: Wed, 20 Dec 95 18:00:48 MST
Subject: NFL once again, I'm afraid
Message-ID: <9512210100.AA06007@sfi.santafe.edu>

First and foremost, I would like to request that this NFL thread fade
out. It is only sowing confusion - people should read the papers on
NFL to understand NFL.

  [[ Moderator's note: I concur.  We've had enough "No Free Lunch" discussion
  for a while; people are starting to protest.  Future discussion should be
  done in email.  -- Dave Touretzky, CONNECTIONISTS moderator ]]

Full stop.

*After* that, after there is common grounding, we can all debate.
There is much else that connectionist is more appropriate for in the
meantime.

(To repeat: ftp.santafe.edu, pub/dhw_ftp, nfl.1.ps.Z and nfl.2.ps.Z.)

Please, I'm on my knees, use the time that would have been spent
thrashing at connectionist in a more fruitful fashion. Like by reading
the NFL papers. :-)

***

Hicks writes:

>>>
case 1: 
*	Either the target function is (noise/uncompressible/has no structure),
or none of the candidate functions have any correlation with the target
function.*
Since CV provides an estimated prediction error,
it can also tell us "you might as well be using anti-cross validation, or
random selection for that matter, because it will be equally useless".
>>>

This is wrong.

Construct the following algorithm: "If CV says one of the algorithms
under consideration has particularly low error in comparison to the
other, use that algorithm. Otherwise, choose randomly among the
algorithms."

Averaged over all targets, this will do exactly as well as the
algorithm that always guesses randomly among the algorithms. (For
zero-one loss, either OTS error or IID error with a big input space,
etc.)

So you cannot rely on CV's error estimate *at all* (unless you impose
a prior over targets or some such, etc.).

Alternatively, keep in mind the following simple argument: In its
uniform prior(targets) formulation, NFL holds even for error
distributions conditioned on *any* property of the training set. So in
particular, you can condition on having a training set for which CV
says "yep, I'm sure; choose that one". And NFL still holds. So even in
those cases where CV "is sure", by following CV, you'll die as often
as not.


>>>
case 2: 
*	The target (is compressible/has structure), and some the candidate
functions are positively correlated with the target function.*
	In this case CV will outperform anti-CV (ON AVERAGE).
>>>

This is wrong.

As has been mentioned many times, having structure in the target, by
itself, gains you nothing. And as has also been mentioned, if "the
candidate functions are positively correlated with the target
function", then in fact *anti-CV wins*.

READ THE PAPERS.


>>>
By ON AVERAGE I mean the expectation across the ensemble of samples for
a FIXED target function.  This is different from the ensemble and distribution
of target functions, which is a much bigger question.
>>>

This distinction is irrelevent. There are versions of NFL that address
both of these cases (as well as many others).

READ THE PAPERS.


*****


Lemm writes:

>>>
1.) In short, NFL assumes that data, i.e. information of the form y_i=f(x_i),
do not contain information about function values on a non-overlapping
test set.
>>>

This is wrong.

See all the previous discussion about how NFL holds even if you
restrict yourself to targets with a lot of structure. The problem is
that the structure can hurt just as easily as help. There is no need
for the data set to contain no information about the test set - simply
that the limited types of information can "confuse" the learning
algorithm at hand.

READ THE PAPERS.


>>>
This is done by postulating "unrestricted uniform" priors, 
or uniform hyperpriors over nonumiform priors...
>>>

This is wrong. There is (obviously) a version of NFL that holds for
uniform priors. And there is another version in which one averages
over all priors - so the uniform prior has measure 0. But one can also
restrict oneself to average only over those priors "with a lot of
structure", and again get NFL.

And there are many other versions of NFL in which there is *no* prior,
because things are conditioned on a fixed target. Exactly as in
(non-Bayesian) sampling theory statistics.

Some of those alternative NFL results involve saying "if you're
conditioning on a target, there are as many such targets where you die
as where you do well". 

Other NFL results never vary the target *in any sense*, even to
compare different targets. Rather they vary something concerning the
generalizer. This is the case with the more sophisticated xvalidation
results, for example.

READ THE PAPERS.


>>>
There is much information which is not of this 
"single sharp data" type. (Examples see below.)
>>>

*Obviously* if you have extra information and/or knowledge beyond that
in the training set, you can (often) do better than randomly. That's
what Bayesian analysis is all about. More generally, as I have proven
in [1], the probability of error can be written as a non-Euclidean
inner product between the learning algorithm and the posterior. So
obviously if your posterior is structured in an appropriate manner,
that can be exploited by the algorithm.

This was never the issue however. The issue had to do with "blind"
supervised learning, in which one has no such additional
information. Like in COLT, for example.

You're arguing apples and oranges here.


>>>
4) Real measurements (especially of continuous variables)
normally do also NOT have the form y_i=f(x_i) !
They mostly perform some averaging over f(x_i) or
at least they have some noise on the x_i (as small as you like, but present).
>>>

Again, this is obvious. And stated explicitly in the papers,
moreover. And completely irrelevent to the current discussion. The
issue at hand has *always* been "sharp" data. And if you look at
what's done in the neural net community, or in COLT, 95% of it assumes
"sharp data".

Indeed, there are many other assumptions almost always made and almost
never true that Lemm has missed. Like making a "weak filtering
assumption": assume the target and the distribution over inputs are
independent. But again, just like in COLT, we're starting simple here,
with such assumptions intact.

READ THE PAPERS.


>>>
This shows that smoothness of the expectation (in contrast to uniform priors) 
is the result of the measurement process and therefore
is a real phenomena for "effective" functions.
>>>

To give one simple example, what about with categorical data, where
there is not even a partial ordering over the inputs? What does
"locally smooth" even mean then?

And even if we're dealing with real valued spaces, if there's input
space noise, NFL simply changes to be a statement concerning test set
elements that are sufficiently far (on the scale of the input space
noise) from the elements of the training set. The input space noise
makes the math more messy, but doesn't change the underlying phenomenon.

(Readers interested in previous work on the relationship between local
(!) regularization, smoothness, and input noise should see Bishop's
Neural Computation article of about 6 months ago.)


>>>
Even more: situations without "priors" are VERY artificial.
So if we specify the "priors" (and the lesson from NFL is
that we should if we want to make a good theory) 
then we cannot use NFL anymore.(What should it be used for then?)
>>>

Sigh. 

1) I am a Bayesian whenever feasible. (In fact, I've been taken to
task for being "too Bayesian".) But situations without obvious priors
- or where eliciting the priors is not trivial and you don't have the
time - are in fact *very* common.

A simple example is a project I am currently involved on for detecting
phone fraud for MCI. Quick, tell me the prior probability that a
fraudulent call arises from area code 617 vs. the prior probability
that a non-fraudulent call does...

2) Essentially all of COLT is non-Bayesian. (Although some of it makes
assumptions about things like the support of the priors.) You haven't
a prayer of really understanding what COLT has to say without keeping
in mind the admonitions of NFL.

3) As I've now said until I'm blue in the face, NFL is only the
starting point. What it's "good for", beyond proving to people that
they must pay attention to their assumptions, be wary of COLT-type
claims, etc. is: head-to-head minimax theory, scrambled algorithms
theory, hypothesis-averaging theory, etc., etc., etc.

READ THE PAPERS.
 

****


Zhu writes:

>>>
I quite agree with Joerg's observation about learning algorithms in
practice, and the priors they use.  The key difference is

	Is it legitimate to be vague about prior?

Put it another way,

	Do you claim the algorithm can pick up whatever prior automatically,
	instead of being specified before hand?

My answer is NO, to both questions, because for an algorithm to be good on 
any prior is exactly the same as for an algorithm to be good without prior,
as NFL told us.
>>>

Yes!

Everybody, LISTEN TO ZHU!!!!


David Wolpert


[1] - Wolpert, D. "The Relationshop Between PAC, the Statistical
Physics Framework, the Bayesian Framework, and the VC Framework", in
"The Mathematics of Generalization", D. Wolpert (Ed.), Addison-Wesley,
1995

From terry at salk.edu  Wed Dec 20 20:34:15 1995
From: terry at salk.edu (Terry Sejnowski)
Date: Wed, 20 Dec 95 17:34:15 PST
Subject: Senior Position at GSU
Message-ID: <9512210134.AA16333@salk.edu>

Forwarded to Connectionists:

    Date: Mon, 18 Dec 1995 15:00:23 -0500 (EST)
    From: Donald Edwards <biodhe at gsusgi2.Gsu.EDU>
    Subject: job
    
    Dear friends and colleagues,
    	I am writing to let you know of a senior position in 
    computational neuroscience available here in the Department of Biology at 
    Georgia State University.  This person would join neurobiologists, 
    physicists, mathematicians and computer scientists in the newly 
    established Center for Neural Communication and Computation, and would 
    participate in the graduate program in Neurobiology in the Department of 
    Biology.  This person would also help guide the construction, equipping and 
    staffing of a Laboratory for Computational Neuroscience for which funds 
    have already been obtained from the George Research Alliance.  
    	Georgia State University is located in downtown Atlanta. 
    	For more information, please contact me at this address, or call at 
    (404) 651-3148.  
    	To apply, please send a letter of intent, c.v., and two letters 
    of reference to Search Committee for Computational Neuroscience, 
    Department of Biology, Georgia State University, Atlanta, GA 30302-4010.  
    FAX: (404) 651-2509.   
    	Please share this message with anyone who might be interested.
    	Thanks for your consideration,
    	Don Edwards


From erik at kuifje.bbf.uia.ac.be  Thu Dec 21 12:48:50 1995
From: erik at kuifje.bbf.uia.ac.be (Erik De Schutter)
Date: Thu, 21 Dec 95 17:48:50 GMT
Subject: Crete Course in Computational Neuroscience
Message-ID: <9512211748.AA27308@kuifje.bbf.uia.ac.be>

                 CRETE COURSE IN COMPUTATIONAL NEUROSCIENCE

                       AUGUST 25 - SEPTEMBER 21, 1996

                                CRETE, GREECE

DIRECTORS:    Erik  De Schutter (University of Antwerp, Belgium)
              Idan Segev (Hebrew University, Jerusalem, Israel)
              Jim Bower (California Institute of Technology, USA)
              Adonis Moschovakis (University of Crete, Greece)


The Crete Course in Computational Neuroscience introduces students to 
the practical application of computational methods in neuroscience, in 
particular how to create biologically realistic models of neurons and 
networks.  

The course consists of two complimentary parts.  A distinguished 
international faculty gives morning lectures on topics in experimental 
and computational neuroscience.  The rest of the day is spent learning 
how to use simulation software and how to implement a model of the 
system the student wishes to study.  The first week of the course 
introduces students to the most important techniques in modeling single 
cells, networks and neural systems.  Students learn how to use the 
GENESIS, NEURON, XPP and other software packages on their individual 
unix workstations.  During the following three weeks the lectures will 
be more general, moving from modeling single cells and subcellular 
processes through the simulation of simple circuits and large neuronal 
networks and, finally, to system level models of the cortex and the brain. 
The course ends with a presentation of the student modeling projects.

The Crete Course in Computational Neuroscience is designed for advanced 
graduate students and postdoctoral fellows in a variety of disciplines, 
including neurobiology, physics, electrical engineering, computer science 
and psychology.  Students are expected to have a basic background in 
neurobiology as well as some computer experience.  A total of 25 students 
will be accepted, the majority of whom will be from the European Union
and affiliated countries.  A tuition fee of 500 ECU ($700) covers travel 
to Crete, lodging and all course-related expenses for European nationals.  
We encourage students from the Far East and the USA to also apply to this
international course.

More information and application forms can be obtained:
   - WWW access: http://bbf-www.uia.ac.be/CRETE/Crete_index.html
   - by mail:  Prof. E. De Schutter
               Born-Bunge Foundation
               University of Antwerp - UIA, 	 
               Universiteitsplein 1
               B2610 Antwerp
               Belgium
   - email: crete_course at kuifje.bbf.uia.ac.be

APPLICATION DEADLINE:  April 10th, 1996.  Applicants will be notified of the
                       results of the selection procedures before May 1st.

FACULTY: M. Abeles (Hebrew University, Jerusalem, Israel), D.J. Amit 
         (University of Rome, Italy and Hebrew University, Israel), 
         R.E. Burke  (NIH, USA), C.E. Carr (University of Maryland, USA), 
         A. Destexhe (Universit Laval, Canada), R.J. Douglas (Institute of
         Neuroinformatics, Zurich, Switzerland), T. Flash (Weizmann 
         Institute, Rehovot, Israel), A. Grinvald (Weizmann Institute, 
         Israel), J.J.B. Jack (Oxford University, England), C. Koch 
         (California Institute of Technology, USA), H. Korn (Institut 
         Pasteur, France), A. Lansner (Royal Institute Technology, Sweden), 
         R. Llinas (New York University, USA), E. Marder (Brandeis
         University, USA), M. Nicolelis (Duke University, USA), J.M. Rinzel 
         (NIH, USA), W. Singer (Max-Planck Institute, Frankfurt, Germany), 
         S. Tanaka (RIKEN, Japan), A.M. Thomson (Royal Free Hospital, 
         England), S. Ullman (Weizmann Institute, Israel), Y. Yarom 
         (Hebrew University, Israel).

The Crete Course in Computational Neuroscience is supported by the 
European Commission (4th Framework Training and Mobility of Researchers 
program) and by The Brain Science Foundation (Tokyo). 

Local administrative organization: the Institute of Applied and 
Computational Mathematics of FORTH (Crete, GR).


From udah075 at kcl.ac.uk  Thu Dec 21 12:53:21 1995
From: udah075 at kcl.ac.uk (Rasmus Petersen)
Date: Thu, 21 Dec 95 17:53:21 GMT
Subject: studentships for European students
Message-ID: <3027.9512211753@maths1.mth.kcl.ac.uk>

**************************************************************

Studentships - For EU Students - Please note new age limit

It was agreed by the Human Resources Committee and endorsed by the Executive
Board of NEuroNet in Paris that up to 10,000 ECU be allocated for 
studentships each year.

These provide support for registration, accommodation and travel to
designated workshops and conferences with a significant tutorial component.
(The studentships are a fixed value).

Up to 22 studentships of 450 ECU each will be available for the NEuroFuzzy
'96 workshop and tutorials in Prague from 16th-18th April 1996.
Applications for these studentships must be received in the NEuroNet Office
before 31st December 1995.  Successful applicants will be notified in
January 1996.

Up to 20 studentships of 500 ECU each will be available for the ICANN '96
conference in Bochum, Germany from 16th-19th July 1996.  Applications for 
these studentships must be received in the NEuroNet Office
before 3rd March 1996.  Successful applicants will be notified in
April 1996.

Applicants for studentships are limited to full-time students, who are EU
nationals, and aged 30 years or less.  (Priority will be given to applicants
aged under 25 years of age).  All applications should be accompanied by a 
letter of support from the applicant's Head of Department and should contain 
verification of the applicant's age, status as a student and nationality.

All applications will be reviewed by the Human Resources Committee of NEuroNet.

Please apply in writing to the NEuroNet Administrator:

Ms Terhi Garner
NEuroNet                       
Department of Electronic and Electrical Engineering 
King's College London                                    
Strand, London WC2R 2LS, UK                        
Fax: +44 (0) 171 873 2559

***********************************************************************


From dhw at santafe.edu  Fri Dec 29 19:54:42 1995
From: dhw at santafe.edu (dhw@santafe.edu)
Date: Fri, 29 Dec 95 17:54:42 MST
Subject: Postdoc opening
Message-ID: <9512300054.AA17781@yaqui>


The Santa Fe Institute is soliciting applications for a TXN
postdoctoral fellow. The fellow is expected to perform research in
Machine Learning, Artificial Intelligence, or related areas of
statistics.

Information about the SFI can be found at http://www.santafe.edu/.

Candidates should have a Ph.D. (or expect to receive one soon) and should
have backgrounds in computer science, mathematics, statistics, or
related fields. 

Applicants should submit a curriculum vitae, list of publications,
statement of research interests, and three letters of
recommendation. Please submit your materials in one complete
package. Incomplete applications will not be considered.

All application materials must be received by March 1, 1996. Decisions
will be made by April, 1996. Send complete application packages only,
preferably hard copy, to:

       TXN Postdoctoral Committee
       Attention: David Wolpert
       Santa Fe Institute
       1399 Hyde Park Road
       Santa Fe, New Mexico 87501

Include your e-mail address and/or fax number.

The SFI is an equal opportunity employer. Women and minorities are
encouraged to apply.

From bozinovs at delusion.cs.umass.edu  Sun Dec 31 17:55:53 1995
From: bozinovs at delusion.cs.umass.edu (bozinovs@delusion.cs.umass.edu)
Date: Sun, 31 Dec 1995 17:55:53 -0500
Subject: New Book
Message-ID: <9512312255.AA25407@delusion.cs.umass.edu>


Dear Connectionists,

Happy New Year to everybody!

At the end of the year I have a pleasure to announce a new book in
the field.

Advertisment:
*********************************************************************
New Book!   New Book!   New Book!   New Book!   New Book!   New Book!
---------------------------------------------------------------------

                        CONSEQUENCE DRIVEN SYSTEMS
                        CONSEQUENCE DRIVEN SYSTEMS 
                        CONSEQUENCE DRIVEN SYSTEMS

                          by  Stevo Bozinovski

*201 pages
*79 figures
*27 algorithm descriptions
*8 tables

Among its special features, the book:
---------------------------------------
** provides a unified theory of response-sensitive teaching and learning 
 
** as a result of that theory describes a generic architecture of a 
neuro-genetic agent capable of performing in 1) consequence sensitive 
teaching, 2) reinforcement learning, and 3) self-reinforcement learning 
paradigms 

** describes the Crossbar Adaptive Array (CAA) architecture, an 1981
neural network developed within the Adaptive Networks Group, as an 
example of a neuro-genetic agent

** explains how the CAA architecture was the first neural network that 
solved a delayed reinforcement learning task, the Dungeons-and-Dragons
task, in 1981

** explains how the 1981 learning method (shown on the cover of the 
book) is actually the well known, 1989 rediscovered,  Q-learning method  

** introduces the Benefit-Cost CAA (B-C CAA), as extension of the 1981 
Benefit-only CAA architecture 

** introduces at-subgoal-go-back algorithm as modification of the 
1981 at-goal-go-back CAA algorithm

** introduces a new type of neuron, denoted as Provoking Adaptive Unit,
for dealing with tasks of Distributed Consequence Programming

** illustrates the usage of those neurons as routers in a 
routing-in-networks-with-faults task

** uses parallel programming technique in describing the algorithms
throughout the book
-----------------------------------------

Ordering information
ISBN 9989-684-06-5, Gocmar Press, 1995 
price: $15, paperback

For further information contact the author: bozinovs at cs.umass.edu

**********************************************************************

CONTENTS:  

1. INTRODUCTION

1.1. The framework
1.2. Agents and architectures
1.3. Neural architectures
1.3.1. Greedy policy neural architectures
1.3.2. Recurrent architectures
1.3.3. Crossbar architectures 
1.3.4. Subsumption architecture adaptive arrays
1.4. Problems. Emotional Graphs
1.5. Games. Emotional Petri Nets
1.6. Parallel programming
1.7. Bibliographical and other notes


2.  CONSEQUENCE LEARNING AGENTS: A STRUCTURAL THEORY 

2.1. The agent-environment interface
2.2. A taxonomy of learning paradigms
2.3. Classes of consequence learning agents
2.4. A generic consequence learning architecture
2.5. Learning rules and routines
2.6. Bibliographical and other notes

3. CONSEQUENCE DRIVEN TEACHING

3.1. Class T agents
3.2. Learners
3.3. Teachers
3.3.1. Toward a theory of teaching systems
3.3.2. Teaching strategies
3.4. Curriculums
3.4.1. Curriculum grammars and languages
3.4.2. Curriculum space approach
3.5. Pattern classification teaching as integer programming
3.6. Pattern classification teaching as dynamic programming
3.7. Bibliographical and other notes

4. EXTERNAL REINFORCEMENT LEARNING 

4.1. Reinforcement learningh NG agents
4.2. Associative Search Network (ASN)
4.2.1. Basic ASN
4.2.2. Reionforcement predictive ASN
4.3. Actor-Critic architecture
4.4. Bibliographical and other notes

5. SELF-REINFORCEMENT LEARNING

5.1. Conceptual framework
5.2. Self-reinforcement learning and the NG agents
5.3. The Crossbar Adaptive Array architecture
5.4. How it works
5.4.1. Defining primary goals from the genetic environment
5.4.2. Secondary reinforcement mechanism
5.4.3. The CAA learning method
5.5. Example of a CAA architecture
5.6. Solving problems with a CAA architecture 
5.6.1. Learning in emotional graphs: Maze running
5.6.2. Learning in loosely defined emotional graphs: Pole balancing
5.7. Another example of a CAA architecture
5.8. Using entropy in Markov Decision Processes
5.9. Issues on the genetic environment
5.9.1. CAA architecture as an optimization architecture
5.9.2. Complemetarity with the Genetic Algorithms
5.9.3. Self-reinforcement: Genetic environment approach
5.10. Bibliographical and other notes

6. CONSEQUENCE PROGRAMMING

6.1. Dynamic Programming and Markov Decision Problems
6.2. Introducing cost in the CAA architecture
6.3. Q-learning
6.4. A taxonomy of the CAA-method based learning algorithms
6.5. Producing optimal solution in a stochastic environment
6.6. Distributed Consequence Programming: A neural theory
6.6.1. Provoking units: Axon provoked neurons
6.6.2. An illustration: Routing in client-server networks with faults
6.7. Bibliographical and other notes

7. SUMMARY

8. REFERENCES

9. INDEX
*********************************************************************