From dhw at santafe.edu Fri Dec 1 11:18:19 1995 From: dhw at santafe.edu (David Wolpert) Date: Fri, 1 Dec 95 09:18:19 MST Subject: Correcting misunderstandings about NFL Message-ID: <9512011618.AA27395@sfi.santafe.edu> This posting is to correct some misunderstandings that were recently posted concerning the NFL theorems. I also draw attention to some of the incorrect interpretations commonly ascribed to certain COLT results. *** Joerg Lemm writes: >>> 1.) If there is no relation between the function values on the test and training set (i.e. P(f(x_j)=y|Data) equal to the unconditional P(f(x_j)=y) ), then, having only training examples y_i = f(x_i) (=data) from a given function, it is clear that I cannot learn anything about values of the function at different arguments, (i.e. for f(x_j), with x_j not equal to any x_i = nonoverlapping test set). >>> Well put. Now here's the tough question: Vapnik *proves* that it is unlikely (for large enough training sets and small enough VC dimension generalizers) for error on the training set and full "generalization error" to be grealy different. Regardless of the target. Using this, Baum and Haussler even wrote a paper "What size net gives valid generalization?" in which no assumptions whatsoever are made about the target, and yet the authors are able to provide a response the question of their title. HOW IS THAT POSSIBLE GIVEN WHAT YOU JUST WROTE???? NFL is "obvious". And so are VC bounds on generalization error (well, maybe not "obvious"). And so is the PAC "proof" of Occam's razor. And yet the latter two bound generalization error (for those cases where training set error is small enough) without making any assumptions about the target. What gives? The answer: The math of those works is correct. But far more care must be exercised in the interpretation of that math than you will find in those works. The care involves paying attention to what goes on the right-hand side of the conditioning bars in one's probabilities, and the implications of what goes there. Unfortunately, such conditioning bar are completely absent in those works... (In fact, the sum-total of the difference between Bayesian and COLT approaches to supervised batch learning lies in what's on the right-hand side of those bars, but that's another story. See [2].) As an example, it is widely realized that VC bounds suffer from being worst-case. However there is another hugely important caveat to those bounds. The community as a whole simply is not aware of that caveat, because the caveat concerns what goes on the right-hand side of the conditioning bar, and this is NEVER made explicit. This caveat is the fact that VC bounds do NOT concern Pr(IID generalization error | observed error on the training set, training set size, VC dimension of the generalizer). But you wouldn't know that to read the claims made on behalf of those bounds ... To give one simple example of the ramifications of this: Let's say you have a favorite low-VC generalizer. And in the course of your career you parse though learning problems, either explicitly or (far more commonly) without even thinking about it. When you come across one with a large training set on which your generalizer has small generalization error, you want to invoke Vapnik to say you have assuraces about full generalization error. Well, sorry. You don't and you can't. You simply can't escape Bayes by using confidence intervals. Confidence intervals in general (not just in VC work) have the annoying property that as soon as you try to use them, very often you contradict the underlying statistical assumptions behind them. Details are in [1] and in the discussion of "We-Learn-It Inc." in [2]. >>> 2.) We are considering two of those (influence) relations P(f(x_j)=y|Data): one, named A, for the true nature (=target) and one, named B, for our model under study (=generalizer). Let P(A and B) be the joint probability distribution for the influence relations for target and generalizer. 3.) Of course, we do not know P(A and B), but in good old Bayesian tradition, we can construct a (hyper-)prior P(C) over the family of probability distributions of the joint distributions C = P(A and B). 4.) NFL now uses the very special prior assumption P(A and B) = P(A)P(B) >>> If I understand you correctly, I would have to disagree. NFL also holds with your P(C) being any prior assumption - more formally, averaging over all priors, you get NFL. So the set of priors for which your favorite algorithm does *worse than random* is just as large as the set for which it does better. (In this sense, the uniform prior is a typical prior, not a pathological one, out on the edge of the space. It is certainly not a "very special prior".) In fact, that's one of the major points of NFL - it's not to see what life would be like if this or that were uniform, but to use such uniformity as a mathematical tool, to get a handle on the underlying geometry of inference, the size of the various spaces (e.g., the size of the space of priors for which you lose to random), etc. The math *starts* with NFL, and then goes on to many other things (see [1]). It's only the beginning chapter of the text book. >>> I say that it is rational to believe (and David does so too, I think) that in real life cross-validation works better in more cases than anti-cross-validation. >>> Oh, most definitely. There are several issues here: 1) what gives with all the "prior-free" general proofs of COLT, given NFL, 2) purely theoretical issues (e.g., as mentioned before, characterizing the relationship between target and generalizers needed for xval. to beat anti-xval.) and 3) perhaps most provocatively of all, seeing if NFL (and the associated mathematical structure) can help you generalize in the real world (e.g., with head-to-head minimax distinctions between generalizers). *** Finally, Eric Baum weighs in: >>> Barak Pearlmutter remarked that saying We have *no* a priori reason to believe that targets with "low Kolmogorov complexity" (or anything else) are/not likely to occur in the real world. (which I gather was a quote from David Wolpert?) is akin to saying we have no a priori reason to believe there is non-random structure in the world, which is not true, since we make great predictions about the world. >>> Well, let's get a bit formal here. Take all the problems we've ever tried to make "great predictions" on. Let's even say that these problems were randomly chosen from those in the real world (i.e., no selection effects of people simply not reporting when their predictions were not so great). And let's for simplicity say that all the predictions were generated by the same generalizer - the algorithm in the brain of Eric Baum will do as a straw man. Okay. Now take all those problems together and view them as one huge training set. Better still, add in all the problems that Eric's anscestors addressed, so that the success of his DNA is also taken into account. That's still one training set. It's a huge one, but it's tiny in comparison to the full spaces it lives in. Saying we (Eric) makes "great predictions" simply means that the xvalidation error of our generalizer (Eric) on that training set is small. (You train on part of the data, and predict on the rest.) Formally (!!!!!), this gives no assuraces whatsoever about any behavior off-training-set. As I've stated before, without assumptions, you cannot conclude that low xvalidation error leads to low off-training set generalization error. And of course, each passing second, each new scene you view, is "off-training-set". The fallacy in Eric's claim was noted all the way back by Hume. Success at inductive inference cannot formally establish the utility of using inductive inference. To claim that it can you have to invoke inductive inference, and that, as any second grader can tell you, is circular reasoning. Practically speaking of course, none of this is a concern in the real world. We are all (me included) quite willing to conclude there is structure in the real world. But as was noted above, what we do in practice is not the issue. The issue is one of theory. *** It's very similar to high-energy physics. There are a bunch of physical constants that, if only slightly varied, would (seem to) make life impossible. Why do they have the values they have? Some invoke the anthropic principle to answer this - we wouldn't be around if they had other values. QED. But many find this a bit of a cop-out, and search for something more fundamental. After all, you could have stopped the progress of physics at any point in the past if you had simply gotten everyone to buy into the anthropic principle at that point in time. Similarly with inductive inference. You could just cop out and say "anthropic principle" - if inference were not possible, we wouldn't be having this debate. But that's hardly a satisfying answer. *** Eric goes on: >>> Consider the problem of learning to predict the pressure of a gas from its temperature. Wolpert's theorem, and his faith in our lack of prior about the world, predict, that any learning algorithm whatever is as likely to be good as any other. This is not correct. >>> To give two examples from just the past month, I'm sure MCI and Coca-Cola would be astonished to know that the algorithms they're so pleased with were designed for them by someone having "faith in our lack of prior about the world". Less glibly, let me address this claim about my "faith" with two quotes from the NFL for supervised learning paper. The first is in the introduction, and the second in a section entitled "On uniform averaging". So neither is exactly hidden ... 1) "It cannot be emphasized enough that no claim is being made .. that all algorithms are equivalent in the real world." 2) "The uniform sums over targets ... weren't chosen because there is strong reason to believe that all targets are equally likely to arise in practice. Indeed, in many respects it is absurd to ascribe such a uniformity over possible targets to the real world. Rather the uniform sums were chosen because such sums are a useful theoretical tool with which to analyze supervised learning." Finally, given that I'm mixing it up with Eric on NFL, I can't help but quote the following from his "What size net gives valid generalization" paper: "We have given bounds (independent of the target) on the training set size vs. neural net size need such that valid generalization can be expected." (Parenthetical comment added - and true.) Nowhere in the paper is there any discussion whatsoever of the apparent contradiction between this statement and NFL-type concerns. Indeed, as mentioned above, with only the conditioning-bar-free mathematics in Eric's paper, there is no way to resolve the contradiction. In this particular sense, that paper is extremely misleading. (See discussion above on misinterpretations of Vapnik's results.) >>>> Creatures evolving in this "play world" would exploit this structure and understand their world in terms of it. There are other things they would find hard to predict. In fact, it may be mathematically valid to say that one could mathematically construct equally many functions on which these creatures would fail to make good predictions. But so what? So would their competition. This is not relevant to looking for one's key, which is best done under the lamppost, where one has a hope of finding it. In fact, it doesn't seem that the play world creatures would care about all these other functions at all. >>> I'm not sure I quite follow this. In particular, the comment about the "competition" seems to be wrong. Let me just carry further Eric's metaphor though, and point out though that it makes a hell of a lot more sense to pull out a flashlight and explore into the surrounding territory for your key than it does to spend all your time with your head down, banging into the lamppost. And NFL is such a flashlight. David Wolpert [1] The current versions of the NFL for supervised learning papers, nfl.ps.1.Z and nfl.ps.2.Z, at ftp.santafe.edu, in pub/dhw_ftp. [2] "The Relationship between PAC, the Statistical Physics Framework, the Bayesian Framework, and the VC Framework", in *The Mathematics of Generalization*, D. Wolpert Ed., Addison-Wesley, 1995. From marco at McCulloch.Ing.UniFI.IT Fri Dec 1 12:21:43 1995 From: marco at McCulloch.Ing.UniFI.IT (Marco Gori) Date: Fri, 01 Dec 1995 18:21:43 +0100 Subject: Italian Neural Network Society Message-ID: <9512011721.AA09634@McCulloch.Ing.UniFI.IT> ============================================================== This is to announce a new web page describing the aims and the activities of the Italian Neural Network Society. The page is hosted at the DSI Web server of the Dipartimento di Sistemi e Informatica, Universita' di Firenze) at the following address: http://www-dsi.ing.unifi.it/neural/siren -- marco gori. =============================================================== From schmidhu at informatik.tu-muenchen.de Sun Dec 3 06:40:25 1995 From: schmidhu at informatik.tu-muenchen.de (Juergen Schmidhuber) Date: Sun, 3 Dec 1995 12:40:25 +0100 Subject: compressibility and generalization Message-ID: <95Dec3.124033+0100_met.116308+385@papa.informatik.tu-muenchen.de> Eric Baum wrote: >>> (1) While it may be that in classical Lattice gas models, a gas does not have high Kolmogorov complexity, this is not the origin of the predictability exploited by physicists. Statistical mechanics follows simply from the assumption that the gas is in a random one of the accessible states, i.e. the states with a given amount of energy. So *define* a *theoretical* gas as follows: Every time you observe it,it is in a random accessible state. Then its Kolmogorov complexity is huge (there are many accessible states) but its macroscopic behavior is predictable. (Actually this an excellent description of a real gas, given quantum mechanics.) <<< (1) The key expression here is ``the assumption that the gas is in a random one of the *accessible* states''. Since the accessible states are defined to be those with equal energy, this greatly restricts the number of possible states. By definition, it is trivial to make a macro-level prediction like ``the total energy will remain constant''. In turn, there are relatively short descriptions of a given history of such a gas. With true random gas, however, there are no invariants eliminating most of the possible states. This makes its history incompressible. (2) Back to: what does this have to do with machine learning? As a first step, we may simply apply Solomonoff's theory of inductive inference to a dynamic system or ``universe''. Loosely speaking, in a universe whose history is compressible, we may expect to generalize well. A simple, old counting argument shows: most computable universes are incompressible. Therefore, in most computable universes you won't generalize well (this is related to what has been (re)discovered in NFL). (3) Hence, the best we may hope for is a learning technique with good expected generalization performance in *arbitrary* compressible universes. Actually, another restriction is necessary: the time required for compression and decompression should be ``tolerable''. To formalize the expression ``tolerable'' is subject of ongoing research. Juergen Schmidhuber IDSIA juergen at idsia.ch From hicks at cs.titech.ac.jp Sun Dec 3 00:32:43 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Sun, 3 Dec 1995 14:32:43 +0900 Subject: Is the universe finite? Message-ID: <199512030532.OAA02207@euclid.cs.titech.ac.jp> I would like to make 2 points. One concerns a clarification of David Wolperts definition of the universe. The second one is a thought problem meant to be an illustration of the inevitability of structure. Point 1: David Wolpert writes: (1) >Practically speaking of course, none of this is a concern in the real >world. We are all (me included) quite willing to conclude there is >structure in the real world. But as was noted above, what we do in >practice is not the issue. The issue is one of theory. (2) >Okay. Now take all those problems together and view them as one huge >training set. Better still, add in all the problems that Eric's >anscestors addressed, so that the success of his DNA is also taken >into account. That's still one training set. It's a huge one, but it's >tiny in comparison to the full spaces it lives in. The above statements seem to me to be contradictory in some meaning. "(1)" is saying we should, when discussing generalization, not concern ourselves with the real universe in which we live, but should consider theoretical alternative universes as well. On the other hand "(2)" seems to say that the real universe in which we live is itself sufficiently "diverse" that any single approach to generalization must on average be the same. What is the universe about which we are talking? Since mathematical models exist in our minds and on paper in this universe, are they included? I feel we ought to distinguish between a single universe (ours for example), and the ensemble of possible universes. Point 2: Lets suppose a universe which is an N-dimensional binary (0/1) vector random variable X, whose elements are iid with p(0)=p(1)=(1/2). Apparently there is no structure in this universe. Now let us consider a universe which is a binary valued N by M matrix random variable AA whose elements are also iid with p(0)=p(1)=(1/2). Let us draw a random instance A from AA. Now we define an M-dimensional integer random variable Y depending on X by p(y=Ax) = p(Ax), where x and y are instances of X and Y respectively. If A happens to be chosen such that y is merely a subset of the elements of x, then the prior p(y), like the prior p(x), will be uniform. But for most choices of A, p(y) will not be uniform at all. So, out of all the possible universes Y, most of them have structure. This happens even though Y and AA have no structure. The structure that Y will have is drawn from a uniform distribution (over AA), but we are only concerned with whether there will be structure or not. Of course, this proves nothing. And now I am going to make a giant leap of analogy. The following statements are not contradictory. (a) In a universe drawn at random from the ensemble of all possible universes, we cannot expect to see any particular structure to be more likely that any other structure. (b) In any given universe, we can expect structure to be present. Would I be correct in saying that only (b) needs to be true in order for cross-validation to be profitable? Craig Hicks Craig Hicks hicks at cs.titech.ac.jp | Hisakata no, hikari nodokeki Ogawa Laboratory, Dept. of Computer Science | Haru no hi ni, Shizu kokoro naku Tokyo Institute of Technology, Tokyo, Japan | Hana no chiruran lab:03-5734-2187 home:03-3785-1974 | Spring smiles with sun beams fax (from abroad): | sifting down through cloudy dreams +81(3)5734-2905 OGAWA LAB | towards the anxious hearts 03-5734-2905 OGAWA LAB (from Japan)| beating pitter pat [ Poem from Hyaku-nin i-syuu -> | while flower petals scatter. From arbib at pollux.usc.edu Sun Dec 3 14:28:26 1995 From: arbib at pollux.usc.edu (Michael A. Arbib) Date: Sun, 3 Dec 1995 11:28:26 -0800 (PST) Subject: VISUOMOTOR COORDINATION: AMPHIBIANS, MODELS, AND COMPARATIVE STUDIES Message-ID: <199512031928.LAA10890@pollux.usc.edu> PRELIMINARY CALL FOR PAPERS Workshop on VISUOMOTOR COORDINATION: AMPHIBIANS, MODELS, AND COMPARATIVE STUDIES Sedona, Arizona, November 22-24, 1996 Co-Directors: Kiisa Nishikawa (Northern Arizona University, Flagstaff) and Michael Arbib (University of Southern California, Los Angeles). Program Committee: Kiisa Nishikawa (Chair), Michael Arbib, Emilio Bizzi, Chris Comer, Peter Ewert, Simon Gizster, Mel Goodale, Ananda Weerasuriya, Walt Wilczynski, and Phil Zeigler. Local Arrangements Chair: Kiisa Nishikawa. This workshop is the sequel to four earlier workshops on the general theme of "Visuomotor Coordination in Frog and Toad: Models and Experiments". The first two were organized by Rolando Lara and Michael Arbib at the University of Massachusetts, Amherst (1981) and Mexico City (1982). The next two were organized by Peter Ewert and Arbib in Kassell and Los Angeles, respectively, with the Proceedings published as follows: Ewert, J.-P. and Arbib, M.A., Eds., 1989, Visuomotor Coordination: Amphibians, Comparisons, Models and Robots, New York: Plenum Press. Arbib, M.A.and J.-P. Ewert, Eds., 1991, Visual Structures and Integrated Functions, Research Notes in Neural Computing 3, Heidelberg, New York: Springer-Verlag. The time is ripe for a fifth Workshop on this theme, with the more generic title "Visuomotor Coordination: Amphibians, Models, and Comparative Studies". The Workshop will be held in Sedona - a beautiful small resort town set in dramatic red hills in Arizona - straight after the Society for Neuroscience meeting in 1996. Next year, Neuroscience ends on Thursday, November 21, 1996, in Washington, DC, so people can fly to Phoenix that evening, meet Friday, Saturday, and Sunday, and fly home Monday November 25th (so that US types not going to Neuroscience get the Saturday stopover that they could not get if we met before Neuroscience). The aim is to study the neural mechanisms of visuomotor coordination in frog and toad both for their intrinsic interest and as a target for developments in computational neuroscience, and also as a basis for comparative and evolutionary studies. The list of subsidiary themes given below is meant to be representative of this comparative dimension, but is not intended to be exhaustive. In each case, the emphasis (but not the exclusive emphasis) will be on papers which contribute to the development of both modeling and experimentation. Central Theme: Visuomotor Coordination in Frog and Toad Subsidiary Themes: Visuomotor Coordination: Comparative and Evolutionary Perspectives Reaching and Grasping in Frog, Pigeon, and Primate Cognitive Maps Auditory Communication (with emphasis on spatial behavior and sensory integration) Sensory Control of Motor Pattern Generators Formal registration information will be available in March of 1996. Scientists who wish to present papers are asked to send three copies of extended abstracts no later than March 31st, 1996 to: Kiisa Nishikawa Department of Biological Sciences Northern Arizona University Flagstaff, AZ 86011-5640 Notification of the Program Committee's decision will be sent out no later than May 31st, 1996. A decision as to whether or not to publish a proceedings is still pending. From theresa at umiacs.UMD.EDU Mon Dec 4 10:13:47 1995 From: theresa at umiacs.UMD.EDU (Theresa) Date: Mon, 04 Dec 1995 10:13:47 -0500 Subject: Postdoc Position in Neural Modeling Message-ID: <199512041513.KAA05125@skippy.umiacs.UMD.EDU> The University of Maryland Institute for Advanced Computer Studies (UMIACS) invites applications for post doctoral positions, beginning summer/fall '96 in the following areas: Real-time Video Indexing, Natural Language Processing, and Neural Modeling. Exceptionally strong candidates from other areas will also be considered. UMIACS, a state-supported research unit, has been the focal point for interdisciplinary and applications-oriented research activities in computing on the College Park campus. The Institute's 40 faculty members conduct research in high performance computing, software engineering, artificial intelligence, systems, combinatorial algorithms, scientific computing, and computer vision. Qualified applicants should send a 1 page statement of research interest, curriculum vitae and the names and addresses of 3 references to: Prof. Joseph Ja'Ja' UMIACS A.V. Williams Building University of Maryland College Park, MD 20742 by April 1. UMIACS strongly encourages applications from minorities and women. EOE/AA From howse at eece.unm.edu Mon Dec 4 11:12:34 1995 From: howse at eece.unm.edu (James W. Howse) Date: Mon, 04 Dec 1995 09:12:34 -0700 Subject: Dissertation Available Message-ID: <9512041612.AA27407@opus.eece.unm.edu> The following PhD dissertation is available by FTP: Gradient and Hamiltonian Dynamics: Some Applications to Neural Network Analysis and System Identification James W. Howse Abstract The work in this dissertation is based on decomposing system dynamics into the sum of dissipative (e.g., convergent) and conservative (e.g., periodic) components. Intuitively, this can be viewed as decomposing the dynamics into a component normal to some surface and components tangent to other surfaces. First, this decomposition was applied to existing neural network architectures to analyze their dynamic behavior. Second, this formalism was employed to create models which learn to emulate the behavior of actual systems. The premise of this approach is that the process of system identification can be considered in two stages: model selection and parameter estimation. In this dissertation a technique is presented for constructing dynamical systems with desired qualitative properties. Thus, the model selection stage consists of choosing the dissipative and conservative portions appropriately so that a certain behavior is obtainable. By choosing the parametrization of the models properly, a learning algorithm has been devised and proven to always converges to a set of parameters for which the error between the output of the actual system and the model vanishes. So these models and the associated learning algorithm are guaranteed to solve certain types of nonlinear identification problems. Retrieval: ftp ftp.eece.unm.edu login as anonymous cd howse get dissertation.ps.Z This is a PostScript file compressed with compress. The dissertation is 133 pages long and formatted to print single-sided. If there are any retrieval or printing problems please let me know. I would welcome any comments or suggestions regarding the dissertation. No hardcopies are available. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= James Howse - howse at eece.unm.edu __ __ __ __ _ _ /\ \/\ \/\ \/\ \/\ `\_/ `\ University of New Mexico \ \ \ \ \ \ `\\ \ \ \ Department of EECE, 224D \ \ \ \ \ \ , ` \ \ `\_/\ \ Albuquerque, NM 87131-1356 \ \ \_\ \ \ \`\ \ \ \_',\ \ Telephone: (505) 277-0805 \ \_____\ \_\ \_\ \_\ \ \_\ FAX: (505) 277-1413 or (505) 277-1439 \/_____/\/_/\/_/\/_/ \/_/ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= From zhuh at helios.ASTON.ac.uk Mon Dec 4 15:33:50 1995 From: zhuh at helios.ASTON.ac.uk (zhuh) Date: Mon, 4 Dec 1995 20:33:50 +0000 Subject: compressibility and generalization Message-ID: <28443.9512042033@sun.aston.ac.uk> On the implecations of No Free Lunch Theorem(s) by David Wolpert, > From: Juergen Schmidhuber > > (3) Hence, the best we may hope for is a learning technique with > good expected generalization performance in *arbitrary* compressible > universes. Actually, another restriction is necessary: the time > required for compression and decompression should be ``tolerable''. > To formalize the expression ``tolerable'' is subject of ongoing > research. However, the deeper NFL Theorem states that this is still impossible: 1. The *non-existence* of structure guarantees any algorithm will neither win nor lose, compared with the "random algorithm", in the long run. If this were all that is there, then NFL would be just tautology. 2. The *mere existence* of structure guarantees a (not uniformly-random) algorithm as likely to lose you a million as to win you a million, even in the long run. It is the *right kind* of structure that makes a good algorithm good. 3. This is by far one of the most important implications of NFL, yet my sample from Connectionist show that it is safe to make the posterior prediction that if someone criticises NFL as irrelevent, then he has not got this far yet. In conclusion: "for arbitrary environment there is an optimal algorithm" is drastically different from "there is an optimal algorithm for arbitrary environment", whatever restrictions you make on the word "arbitrary". -- Huaiyu Zhu, PhD email: H.Zhu at aston.ac.uk Neural Computing Research Group http://neural-server.aston.ac.uk/People/zhuh Dept of Computer Science ftp://cs.aston.ac.uk/neural/zhuh and Applied Mathematics tel: +44 121 359 3611 x 5427 Aston University, fax: +44 121 333 6215 Birmingham B4 7ET, UK From dhw at santafe.edu Mon Dec 4 19:49:47 1995 From: dhw at santafe.edu (David Wolpert) Date: Mon, 4 Dec 95 17:49:47 MST Subject: Non-randomness is no panacea Message-ID: <9512050049.AA16646@sfi.santafe.edu> Craig Hicks writes: >>>>> (1) >Practically speaking of course, none of this is a concern in the real >world. We are all (me included) quite willing to conclude there is >structure in the real world. But as was noted above, what we do in >practice is not the issue. The issue is one of theory. (2) >Okay. Now take all those problems together and view them as one huge >training set. Better still, add in all the problems that Eric's >anscestors addressed, so that the success of his DNA is also taken >into account. That's still one training set. It's a huge one, but it's >tiny in comparison to the full spaces it lives in. The above statements seem to me to be contradictory in some meaning. >>>> Not at all. The second statement is concerned with theoretical issues, whereas the first one is concerned with practical issues. The distinction is ubiquitous in science and engineering. Even in the little corner of academia known as supervised learning, most people are content to distinguish the concerns of COLT (theory) from those of what-works-in-practice. >>> "(1)" is saying we should, when discussing generalization, not concern ourselves with the real universe in which we live, but should consider theoretical alternative universes as well. >>> Were you referring to (2) instead? Neither statement says anything like "we should not concern ourselves with the real universe". >>> On the other hand "(2)" seems to say that the real universe in which we live is itself sufficiently "diverse" that any single approach to generalization must on average be the same. >>> Again, I would have hoped that nothing I have said could be construed as saying something like that. It may or may not be true, but you said it, not me. :-) I am sorry if you were somehow given the wrong impression. >>>> I feel we ought to distinguish between a single universe (ours for example), and the ensemble of possible universes. >>>> This is a time-worn concern. Read up on the past two centuries worth of battles between Bayesians and non-Bayesians... >>>> Lets suppose a universe which is an N-dimensional binary (0/1) vector random variable X, whose elements are iid with p(0)=p(1)=(1/2). Apparently there is no structure in this universe. >>>> NO!!! Forgive my ... passion, but as I've said many times now, even in a purely random universe, there are many very deep distinctions between the behavior of different learning algorithms (and in this sense there is plenty of "structure"). Like head-to-head minimax distinctions. (Or uniform convergence theory ala Vapnik.) Please read the relevent papers! ftp.santafe.edu, pub/dhw_ftp, nfl.ps.1.Z and nfl.ps.2.Z. >>>> (b) In any given universe, we can expect structure to be present. Would I be correct in saying that only (b) needs to be true in order for cross-validation to be profitable? >>>> Nope. The structure can just as easily negate the usefulness of xvalidation as establish it. And in fact, the version of NFL in which one fixes the target and then averages over generalizers says that the state of the universe is (in a certain precise sense), by itself, irrelevent. Structure or not; that fact alone can not determine the utility of xvalidation. *** Although I think it is at best tangential to further discuss Kolmogorov complexity, Juergen Schmidhuber's recent comment deserves a response. He writes: >>>>> (2) Back to: what does this have to do with machine learning? As a first step, we may simply apply Solomonoff's theory of inductive inference to a dynamic system or ``universe''. Loosely speaking, in a universe whose history is compressible, we may expect to generalize well. >>>> How could this be true? Nothing has been specified in Juergen's statement about the loss function, how test sets are generated (IID vs. off-training-set vs. who knows what), the generalizer used, how it is related (if at all) to the prior over targets (a prior which, I take it, Juergen wishes to be "compressible"), the noise process, whether there is noise in the inputs as well as the outputs, etc., etc. Yet all of those factors are crucial in determining the efficacy of the generalizer. Obviously if your generalizer *knows* the "compression scheme of the universe", knows the noise process, etc., then it will generalize well. Is that what you're saying Juergen? It reduces to saying that if you know the prior, you can perform Bayes-optimally. There is certainly no disputing that statement. It is worth bearing in mind though that NFL can be cast in terms of averages over priors. In that guise, it says that there are just as many priors - just as many ways of having a universe be "compressible", loosely speaking - for which your favorite algorithm dies as there are for which it shines. In fact, it's not hard to show that an average over only those priors that are more than a certain distance from the uniform prior results in NFL - under such an average, for OTS error, etc., all algorithms have the same expected performance. The simply fact of having a non-uniform prior does not mean that better-than-random generalization arises. *** Structure, compressibility, whatever you want to call it; it can hurt just as readily as it can help. The simple claim that there is non-randomness in the universe does not establish that any particular algorithm performs better than randomly. To all those who dispute this, I ask that they present a theorem, relating generalization error to "compressibility". (To do this of course, they will have to specify the loss function, noise, etc.) Not words, but math, and not just math concerning Kolmogorov complexity considered in isolation. Math presenting a formal relationship between generalization error and "compressibility". (A relationship that doesn't reduce to the statement that if you have information concerning the prior, you can exploit it to generalize well - no rediscovery of the wheel please.) David Wolpert From hicks at cs.titech.ac.jp Mon Dec 4 20:40:08 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Tue, 5 Dec 1995 10:40:08 +0900 Subject: compressibility and generalization In-Reply-To: Juergen Schmidhuber's message of Sun, 3 Dec 1995 12:40:25 +0100 <95Dec3.124033+0100_met.116308+385@papa.informatik.tu-muenchen.de> Message-ID: <199512050140.KAA05180@euclid.cs.titech.ac.jp> On Sun, 3 Dec 1995 12:40:25, Juergen Schmidhuber's wrote: >(2) Back to: what does this have to do with machine learning? As a >first step, we may simply apply Solomonoff's theory of inductive >inference to a dynamic system or ``universe''. Loosely speaking, >in a universe whose history is compressible, we may expect to >generalize well. A simple, old counting argument shows: most >computable universes are incompressible. Therefore, in most >computable universes you won't generalize well (this is related >to what has been (re)discovered in NFL). In an earlier communication I hypothesized that a typical universe would have structure that could be exploited by cross-validation. This communication from Juergen Schmidhuber contradicts my hypothesis, I think, because of the existence the "simple, old counting argument" showing that "most computable universes are incompressible". I stand corrected. The point I really wanted clarified was what was meant by the asseration that in a typical universe (A) cross-validation works as well as anti-cross validation I will just talk about the problem of (determinisitic or stochastic) function estimation. I can accept that for any set of model functions, there will be an infinity of problems where cross-validation will be of no assistance, because that model does not have the capacity to predict future input/output realtions from any finite set of examples from the past. This could be either becuase the true function is pure noise, or because it looks like pure noise from the perspective of any function from the set of candidate model functions. In this there will be no correlation between predictions and samples, and cross-validation will do its job of telling us that the generalization error is not decreasing. However, I interpret the assertion that anti-cross validation can be expected to work as well as cross-validation to mean that we can equally well expect cross-validation to lie. That is, if cross-validation is telling us that the generalization error is decreasing, we can expect, on average, that the true generalization error is not decreasing. Isn't this a contradiction, if we assume that the samples are really randomly chosen? Of course, we can a posteriori always choose a worst case function which fits the samples taken so far, but contradicts the learned model elsewhere. But if we turn things around and randomly sample that deceptive function anew, the learned model will probably be different, and cross-validation will behave as it should. I think this follows from the principle that the empirical distribution over an ever larger number of samples converges to the the true distribution of a single sample (assuming the true distribution is stationary). Does assertion (A) mean that this principle fails in alternative universes? Respectfully Yours, Craig Hicks Craig Hicks hicks at cs.titech.ac.jp Ogawa Laboratory, Dept. of Computer Science Tokyo Institute of Technology, Tokyo, Japan From juergen at idsia.ch Tue Dec 5 12:50:01 1995 From: juergen at idsia.ch (Juergen Schmidhuber) Date: Tue, 5 Dec 95 18:50:01 +0100 Subject: Compressibility and Generalization Message-ID: <9512051750.AA00953@fava.idsia.ch> Shahab Mohaghegh requested a definition of ``compressibility of the history of a universe''. Let S(t) denote the state of a computable universe at discrete time step t. Let's suppose S(t) can be described by n bits. The history of the universe between time step 1 (big bang) and time step t is compressible if it can be computed by an algorithm whose size is clearly less than tn bits. Given a particular computing device, most histories are incompressible: there are 2^tn possible histories, but there are less than (1/2)^c * 2^tn algorithms with less than 2^(tn-c) bits (c is a small positive constant). With most possible universes, the mutual algorithmic information between past and future is zero, and previous experience won't help to generalize well in the future. There are a few compressible or ``regular'' universes, however. To use ML terminology, some of them allow for ``generalization by analogy''. Some of them allow for ``generalization by chunking''. Some of them allow for ``generalization by exploiting invariants''. Etc. It would be nice to have a method that can generalize well in *arbitrary* regular universes. Juergen Schmidhuber IDSIA P.S.: Sorry, I meant to say: there are less than (1/2)^c * 2^tn algorithms with less than tn-c bits. JS From gluck at pavlov.rutgers.edu Tue Dec 5 16:52:15 1995 From: gluck at pavlov.rutgers.edu (Mark Gluck) Date: Tue, 5 Dec 1995 16:52:15 -0500 Subject: Faculty Openings at Rutgers-Newark for Connectionist Modelers Interested in Cog Sci/Cog Neuro Message-ID: <199512052152.QAA16557@pavlov.rutgers.edu> The following junior faculty openings at Rutgers-Newark may be of interest to connectionist modelers working in the area of Cognitive Psychology and Cognitive Neuroscience. Although a purely theoretical researcher would be considered, someone who combines both theoretical/computational modeling and experimental research would be prefered: - Mark Gluck CENTER FOR MOLECULAR AND BEHAVIORAL NEUROSCIENCE COGNITIVE NEUROSCIENCE One faculty position in human cognitive neuroscience is available at the assistant to full professor level. Scientists with a research focus on the neurobiological basis of higher cortical function in humans, who would be stimulated by the integrative focus and collaborative research environment of the Center for Molecular and Behavioral Neuroscience, are encouraged to apply. Research areas include (but are not limited to) human experimental neuropsychology, neuropsychiatry, brain imaging and neuroplasticity, cognitive neuroscience, neurolinguistics, development, human electrophysiology, computational neuroscience, neural basis of speech, attention, memory, perception, emotion, psychophysics and behavioral genetics. State of the art laboratories and equipment for human research, and a doctoral program in Behavioral and Neural Science are available in the Center. Additional information on our program, research facilities,and faculty can be obtained over the internet at: http://www.cmbn.rutgers.edu/bns-home.html. Neuroscientists interested in brain/behavior relationships in normal and/or clinical populations should send CV, names of three references and a brief letter of research goals and philosophy to: Dr. Paula Tallal, Center for Molecular and Behavioral Neuroscience, Rutgers University, 197 University Avenue, Newark, New Jersey, 07102. Phone: (201) 648-1080 x3200. Fax: (201) 648-1272. Email: tallal at axon.rutgers.edu. COGNITIVE PSYCHOLOGY, ASSISTANT PROFESSOR (TWO POSITIONS) The Department of Psychology at the Newark Campus of Rutgers University invites Ph.D. applications for one tenure track and one term (non-tenure track) Assistant Professor position to expand its program in Cognitive Experimental Psychology. One position is in the area of Attention and the second is in Social Cognition, or Cognitive Development. The positions call for candidates with an active research program and who are effective teachers at both the graduate and undergraduate levels. Candidates must be prepared to teach a variety of undergraduate courses. Send CV and three letters of recommendation to Professor Harold I. Siegel, Acting Chair, Department of Psychology-Cognitive Search, Rutgers University, Newark, NJ 07102. ----- End Included Message ----- From juergen at idsia.ch Wed Dec 6 04:39:11 1995 From: juergen at idsia.ch (Juergen Schmidhuber) Date: Wed, 6 Dec 95 10:39:11 +0100 Subject: Non-randomness is no panacea. Message-ID: <9512060939.AA02202@fava.idsia.ch> In response to David's response dated Mon, 4 Dec 95: I wrote ``Loosely speaking, in a universe whose history is compressible, we may expect to generalize well.''. To make this more precise, let us consider a very simple 1-bit universe --- suppose the problem is to extrapolate a sequence of symbols (bits, without loss of generality). We have already observed a bitstring s and would like to predict the next bit. Let si denote the event ``s is followed by symbol i'' for i in {0,1}. David is absolutely right by reminding us that we need a prior before applying Bayes. And he is right by pointing out that only if we have information concerning the prior, we can exploit it to generalize well. In the context of the present discussion, however, an interesting point is: there is a special prior that is biased towards *arbitrary* compressibility/structure/regularity. Following Solomonoff/Levin/Chaitin/Li&Vitanyi, define P(s), the a priori probability of a bitstring s, as the probability of guessing a (halting) program that computes s on a universal Turing machine U. Here, the way of guessing is defined by the following procedure: initially, the input tape consists of a single square. Whenever the scanning head of the input tape shifts to the right, do: (1) Append a new square. (2) With probability 1/2 fill it with a 0; with probability 1/2 fill it with a 1. Bayes tells us P(s0|s) = P(s|s0)P(s0)/P(s) P(s0/P(s); P(s1|s) = P(s1)/P(s). We are going to predict ``the next bit will be 0'' if P(s0) > P(s1), and vice versa. Due to the coding theorem (Levin 74, Chaitin 75), P(si) = O((1/2)^K(si)) for i in {0,1} (K(x) denotes x' Kolmogorov complexity), the continuation with lower Kolmogorov complexity will (in general) be more likely. If s is ``noisy'' then this will be reflected by its relatively high Kolmogorov complexity. I am not saying anything new here. I'd just like to point that if you know nothing about your universe except that it is regular in some way, then P is of interest. Sadly, most possible universes are completely irregular and incompressible. But for the few (but infinetly many) that are not, P is a prior to consider (at least if we don't care for computing time and constant factors). Perhaps there are too many threads in the current discussion. I'll shut up for a while. Juergen Schmidhuber IDSIA From goldfarb at unb.ca Wed Dec 6 15:54:00 1995 From: goldfarb at unb.ca (Lev Goldfarb) Date: Wed, 6 Dec 1995 16:54:00 -0400 (AST) Subject: Compressibility and Generalization In-Reply-To: <9512051750.AA00953@fava.idsia.ch> Message-ID: On Tue, 5 Dec 1995, Juergen Schmidhuber wrote: > ``compressibility of the history of a universe''. > > There are a few compressible or ``regular'' universes, > however. To use ML terminology, some of them allow for > ``generalization by analogy''. Some of them allow for > ``generalization by chunking''. Some of them allow for > ``generalization by exploiting invariants''. Etc. It > would be nice to have a method that can generalize well > in *arbitrary* regular universes. For a proposal how to capture formally the concept of an "arbitrary regular universe" for the purposes of inductive learning (and generalization), i.e. the concept of a "combinative" representation in a universe, see the two references below as well as the original two papers published in Pattern Recognition (and mentioned in each of the two references). The structure of objects in the universe was discussed on the INDUCTIVE list. It appears, that the concept of a "symbolic" representation has to be formalized first (via the concept of transformation system), and the fundamentally new concept of *inductive class structure*, not present in other ML models, becomes of critical importance. The issue of dynamic object representation, so conspicuously (and not surprisingly) absent from the ongoing (classical) "statistical" discussion of inductive learning, is also brought to the fore. 1. L. Goldfarb and S. Nigam, The unified learning paradigm: A foundation for AI, in V. Honavar and L. Uhr, eds., Artificial Intelligence and Neural Networks: Steps toward Principled Integration, Academic Press, 1994. 2. L. Goldfarb , J. Abela, V.C. Bhavsar, V.N. Kamat, Can a vector space based learning model discover inductive class generalization in a symbolic environment? Pattern Recognition Letters 16, 719-726, 1995. -- Lev Goldfarb From N.Sharkey at dcs.shef.ac.uk Thu Dec 7 07:24:09 1995 From: N.Sharkey at dcs.shef.ac.uk (N.Sharkey@dcs.shef.ac.uk) Date: Thu, 7 Dec 95 12:24:09 GMT Subject: CALL FOR ROBOTICS PAPERS Message-ID: <9512071224.AA11298@entropy.dcs.shef.ac.uk> CALL FOR PAPERS ** LEARNING IN ROBOTS AND ANIMALS ** An AISB-96 two-day workshop University of Sussex, Brighton, UK: April, 1st & 2nd, 1996 Co-Sponsored by IEE Professional Group C4 (Artificial Intelligence) WORKSHOP ORGANISERS: Noel Sharkey (chair), University of Sheffield, UK. Gillian Hayes, University of Edinburgh, UK. Jan Heemskerk, University of Sheffield, UK. Tony Prescott, University of Sheffield, UK. PROGRAMME COMMITTEE: Dave Cliff, UK. Marco Dorigo, Italy. Frans Groen, Netherlands. John Hallam, UK. John Mayhew, UK. Martin Nillson, Sweden Claude Touzet, France Barbara Webb, UK. Uwe Zimmer, Germany. Maja Mataric, USA. For Registration Information: alisonw at cogs.susx.ac.uk In the last five years there has been an explosion of research on Neural Networks and Robotics from both a self-learning and an evolutionary perspective. Within this movement there is also a growing interest in natural adaptive systems as a source of ideas for the design of robots, while robots are beginning to be seen as an effective means of evaluating theories of animal learning and behaviour. A fascinating interchange of ideas has begun between a number of hitherto disparate areas of research and a shared science of adaptive autonomous agents is emerging. This two-day workshop proposes to bring together an international group to both present papers of their most recent research, and to discuss the direction of this emerging field. WORKSHOP FORMAT: The workshop will consist of half-hour presentations with at least 15 minutes being allowed for discussion at the end of each presentation. Short videos of mobile robot systems may be included in presentations. Proposals for robot demonstrations are also welcome. Please contact the workshop organisers if you are considering bringing a robot as some local assistance can be arranged. The workshop format may change once the number of accepted papers is known, in particular, there may be some poster presentations. WORKSHOP CONTRIBUTIONS: Contributions are sought from researchers in any field with an interest in the issues outlined above. Areas of particular interest include the following * Reinforcement, supervised, and imitation learning methods for autonomous robots * Evolutionary methods for robotics * The development of modular architectures and reusable representations * Computational models of animal learning with relevance to robots, robot control systems modelled on animal behaviour * Reviews or position papers on learning in autonomous agents Papers will ideally emphasise real world problems, robot implementations, or show clear relevance to the understanding of learning in both natural and artificial systems. Papers should not exceed 5000 words length. Please submit four hard copies to the Workshop Chair (address below) by 30th January, 1996. All papers will be refereed by the Workshop Committee and other specialists. Authors of accepted papers will be notified by 24th February Final versions of accepted papers must be submitted by 10th March, 1996. A collated set of workshop papers will be distributed to workshop attenders. We are currently negotiating to publish the workshop proceedings as a book. SUBMISSIONS TO: Noel Sharkey Department of Computer Science Regent Court University of Sheffield S1 4DP, Sheffield, UK email: n.sharkey at dcs.sheffield.ac.uk For further information about AISB96 ftp ftp.cogs.susx.ac.uk login as Password: cd pub/aisb/aisb96 From mkearns at research.att.com Thu Dec 7 13:39:00 1995 From: mkearns at research.att.com (Michael J. Kearns) Date: Thu, 7 Dec 95 13:39 EST Subject: COLT 96 Call for Papers, ASCII Message-ID: ______________________________________________________________________ CALL FOR PAPERS---COLT '96 Ninth Conference on Computational Learning Theory Desenzano del Garda, Italy June 28 -- July 1, 1996 ______________________________________________________________________ The Ninth Conference on Computational Learning Theory (COLT '96) will be held in the town of Desenzano del Garda, Italy, from Friday, June 28, through Monday, July 1, 1996. COLT '96 is sponsored by the Universita` degli Studi di Milano. We invite papers in all areas that relate directly to the analysis of learning algorithms and the theory of machine learning, including neural networks, statistics, statistical physics, Bayesian/MDL estimation, reinforcement learning, inductive inference, knowledge discovery in databases, robotics, and pattern recognition. We also encourage the submission of papers describing experimental results that are supported by theoretical analysis. ABSTRACT SUBMISSION. Authors should submit fifteen copies (preferably two-sided) of an extended abstract to: Michael Kearns --- COLT '96 AT&T Bell Laboratories, Room 2A-423 600 Mountain Avenue Murray Hill, New Jersey 07974-0636 Telephone(for overnight mail): (908) 582-4017 Abstracts must be RECEIVED by FRIDAY JANUARY 12, 1996. This deadline is firm. We are also allowing electronic submissions as an alternative to submitting hardcopy. Instructions for how to submit papers electronically can be obtained by sending email to colt96 at cs.cmu.edu with subject "help", or from our web site: http://www.cs.cmu.edu/~avrim/colt96.html which will also be used to provide other program-related information. Authors will be notified of acceptance or rejection on or before Friday, March 15, 1996. Final camera-ready papers will be due by Friday, April 5. Papers that have appeared in journals or other conferences, or that are being submitted to other conferences, are not appropriate for submission to COLT. An exception to this policy is that COLT and STOC have agreed that a paper can be submitted to both conferences, with the understanding that a paper will be automatically withdrawn from COLT if accepted to STOC. ABSTRACT FORMAT. The extended abstract should include a clear definition of the theoretical model used and a clear description of the results, as well as a discussion of their significance, including comparison to other work. Proofs or proof sketches should be included. If the abstract exceeds 10 pages, only the first 10 pages may be examined. A cover letter specifying the contact author and his or her email address should accompany the abstract. PROGRAM FORMAT. At the discretion of the program committee, the program may consist of both long and short talks, corresponding to longer and shorter papers in the proceedings. The short talks will also be coupled with a poster presentation. PROGRAM CHAIRS. Avrim Blum (Carnegie Mellon University) and Michael Kearns (AT&T Bell Laboratories). CONFERENCE AND LOCAL ARRANGEMENTS CHAIRS. Nicolo` Cesa-Bianchi (Universita` di Milano) and Giancarlo Mauri (Universita` di Milano). PROGRAM COMMITTEE. Martin Anthony (London School of Economics), Avrim Blum (Carnegie Mellon University), Bill Gasarch (University of Maryland), Lisa Hellerstein (Northwestern University), Robert Holte (University of Ottawa), Sanjay Jain (National University of Singapore), Michael Kearns (AT&T Bell Laboratories), Nick Littlestone (NEC Research Institute), Yishay Mansour (Tel Aviv University), Steve Omohundro (NEC Research Institute), Manfred Opper (University of Wuerzburg), Lenny Pitt (University of Illinois), Dana Ron (Massachusetts Institute of Technology), Rich Sutton (University of Massachusetts) COLT, ML, AND EUROCOLT. The Thirteenth International Conference on Machine Learning (ML '96) will be held right after COLT '96, on July 3--7 in Bari, Italy. In cooperation with COLT, the EuroCOLT conference will not be held in 1996. STUDENT TRAVEL. We anticipate some funds will be available to partially support travel by student authors. Details will be distributed as they become available. From hicks at cs.titech.ac.jp Thu Dec 7 19:49:53 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Fri, 8 Dec 1995 09:49:53 +0900 Subject: compressibility and generalization In-Reply-To: William Finnoff's message of Thu, 7 Dec 95 15:55:52 MST <9512072255.AA25329@predict.com> Message-ID: <199512080049.JAA10560@euclid.cs.titech.ac.jp> finnoff at predict.com (William Finnoff) wrote: >Reading some of the recent postings concerning NFL theorems, it appears >that there is still some misunderstandings about what they refer to in >the versions dealing with statistical inference. For example, Craig >Hicks writes: >> (paraphrase: I want to clarify the meaning of the following assertion) >> (A) cross-validation works as well as anti-cross >> validation (paraphrase: on average) finnoff at predict.com (William Finnoff) continued: >An example of this >would be the case of a two by two contingency table >where the inputs are, say, 0=patient received treatment A, >1=patient received treatment B, and values of the dependent variable >are 0=patient died within three months, or 1=patient still alive >after three months. ... Using the example given above, this corresponds >to cases where the training data contains no examples of >of a patient receiving one of the treatments (for example, where >the training data only contains examples of patients >that have received treatment A). Since there is no data for treatment B, how can we use cross-validation? In this case statement (A) above is not wrong, but it is implicitly occuring within a context where there is no data to use for cross-validation. If so isn't it rather a trivial statement? Possibly misleading? finnoff at predict.com (William Finnoff) continued: >The NFL theorems state that in this case, unless there is some other prior >information available about the performance of treatment B in keeping patients >alive, all predictions are equivalent in their average expected performance. I certainly wouldn't expect cross-validation to work when it can't even be used. And I think it would work just as well as anti-cross validation, whatever that is, where anti-cross validation is also not being used. In fact, both would score `0', not only on average, but every time, since they are not being used. ---- After further study and reading postings to this list my current understanding is that (A) merely means that for any problem (cross validation >= 0) in the sense that it will never be deceptive (never < 0) taking the average across the ensemble of samplings. However, by taking a straight average over a certain infinite (and arguably universal) ensemble of problems we can obtain Expectation[cross validation] = 0 because in this ensemble the positive scoring problems are an infinitely small proportion. This is exciting, because in our universe at the present time evidently Expectation[cross validation] > 0, which implies a non uniform prior over the ensemble of problems. Or are we just choosing our problems unfairly? And if so, what algorithm are we using (or is using us) to choose them? Craig Hicks hicks at cs.titech.ac.jp Ogawa Laboratory, Dept. of Computer Science Tokyo Institute of Technology, Tokyo, Japan PS. I do not claim to be clear on all the issues, or be free from misunderstandings by any means. PSS. What is anti-cross validation? From WALTSCH at vms.cis.pitt.edu Thu Dec 7 22:27:49 1995 From: WALTSCH at vms.cis.pitt.edu (WALTSCH@vms.cis.pitt.edu) Date: Thu, 07 Dec 1995 23:27:49 -0400 (EDT) Subject: Faculty position is Cognitive Neuroscience Univ. of Pittsburgh Message-ID: <01HYJKVPQW36AM35MW@vms.cis.pitt.edu> ********Faculty Opening in Cognitive Neuroscience************* The Department of Psychology at the University of Pittsburgh seeks a faculty member at the assistant professor level who studies human cognitive neuroscience. The faculty member must have a strong empirical background and a program of research that brings together neuroscience and behavioral techniques and an interest graduate and undergraduate teaching in this area. Candidates are likely to become affiliated with Center for the Neural Basis of Cognition between the University of Pittsburgh and Carnegie Mellon University. For additional information, see WWW httyp://neurocog.lrdc.pitt.edu/search Applications should be sent to: Cognitive Neuroscience Search 455 Langley Hal Psychology Department University of Pittsburgh PGH PA 15260. Applications should include: 1. a statement of research and teaching interest 2. a CV 3. copies of selected publications 4. three letters of reference. Initial consideration will begin January 15, 1996, though applications arriving after that date may be considered. The University of Pittsburgh is an Equal Opportunity/Affirmative Action Employer. Women and minority candidates are especially encouraged to apply. From esann at dice.ucl.ac.be Fri Dec 8 12:39:48 1995 From: esann at dice.ucl.ac.be (esann@dice.ucl.ac.be) Date: Fri, 8 Dec 1995 18:39:48 +0100 Subject: ESANN extended deadline Message-ID: <199512081737.SAA18067@ns1.dice.ucl.ac.be> Dear Colleagues, The deadline to submit papers to the ESANN'96 conference (the 4th European Symposium on Artificial Neural Networks, which will be held in Bruges, Belgium, on April 24-26, 1996) was December 8th, 1995 (today !) as announced in the call for papers. However, as you know, there are important strikes in France and in other countries, and many of you have problems to meet this deadline because of the post office strike (it is even worst because of the airport strike in Belgium...). So we are pleased to announce that we will accept submission of papers until Friday December 15th, 1996 (so next Friday!). Please however ensure that the printed copies (no e-mail or fax please) will reach the conference secretariat (see address below), together with the required information (as described in the call for papers) before this date. Please use private mail delivery services if necessary, and don't forget that in most countries Chronopost in NOT a private mail service (for example, because of the strike, the French Chronopost service was not working this week...), while DHL, TNT Mailfast and other companies are private services, and so could be more efficient in the next few days... If you still have problems to meet the new deadline, please contact me personally at the following e-mail address: esann at dice.ucl.ac.be and we will try to arrange another way to transfer your paper. Please feel free to contact me if you need any other information about the submission of papers. Sincerely yours, Michel Verleysen _____________________________ D facto publications - conference services 45 rue Masui 1210 Brussels Belgium tel: +32 2 245 43 63 fax: +32 2 245 46 94 _____________________________ From giles at research.nj.nec.com Fri Dec 8 14:18:39 1995 From: giles at research.nj.nec.com (Lee Giles) Date: Fri, 8 Dec 95 14:18:39 EST Subject: reprint available Message-ID: <9512081918.AA20599@alta> The following conference paper published in the 2nd International IEEE Conference on "Massively Parallely Processing Using Optical Interconnections," October, 1995 is now available via the NEC Research Institute archive: ____________________________________________________________________________________ "Predictive Control of Opto-Electronic Reconfigurable Interconnection Networks Using Neural Networks" Majd F. Sakr[1,2], Steven P. Levitan[2], C. Lee Giles[1,3], Bill G. Horne[1], Marco Maggini[4], Donald M. Chiarulli[5] [1] NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 [2] Electrical Engineering Department, U. of Pittsburgh, Pittsburgh, PA 15261 [3] UMIACS, U. of Maryland, College Park, MD 20742 [4] Universit` di Firenze, Dipartimento di Sistemi e Informatica, 50139 Firenze, Italy [5] Computer Science Department, U. of Pittsburgh, Pittsburgh, PA 15260 Abstract Opto-electronic reconfigurable interconnection networks are limited by significant control latency when used in large multiprocessor systems. This latency is the time required to analyze the current traffic and reconfigure the network to establish the required paths. The goal of latency hiding is to minimize the effect of this control overhead. In this paper, we introduce a technique that performs latency hiding by learning the patterns of communication traffic and using that information to anticipate the need for communication paths. Hence, the network provides the required communication paths before a request for a path is made. In this study, the communication patterns (memory accesses) of a parallel program are used as input to a time delay neural network (TDNN) to perform on-line training and prediction. These predicted communication patterns are used by the interconnection network controller that provides routes for the memory requests. Based on our experiments, the neural network was able to learn highly repetitive communication patterns, and was thus able to predict the allocation of communication paths, resulting in a reduction of communication latency. ------------------------------------------------------------------------------ http://www.neci.nj.nec.com/homepages/giles.html ftp://external.nj.nec.com/pub/giles/papers/MPPOI.95.ps.Z ------------------------------------------------------------------------------ -- C. Lee Giles / Computer Sciences / NEC Research Institute / 4 Independence Way / Princeton, NJ 08540, USA / 609-951-2642 / Fax 2482 http://www.neci.nj.nec.com/homepages/giles.html == From mablume at sdcc10.ucsd.edu Fri Dec 8 17:03:18 1995 From: mablume at sdcc10.ucsd.edu (Matthias Blume) Date: Fri, 8 Dec 1995 14:03:18 -0800 (PST) Subject: Fuzzy ART architecture papers online Message-ID: <199512082203.OAA06153@e3329-4.ucsd.edu> Dear Connectionists, Two papers describing a simple and efficient architecture for Fuzzy ART and Fuzzy ARTMAP are now available online. (Sorry, hardcopies are not available.) ------------------------------------------------------------------------------ Matthias Blume and Sadik C. Esener, An efficient mapping of Fuzzy ART onto a neural architecture (5 pages), submitted to Neural Networks. A novel mapping of the Fuzzy ART algorithm onto a neural network architecture is described. The architecture does not utilize bi-directional synapses, weight transport, or weight duplication, and requires one fewer layer of processing elements than the architecture originally proposed by Carpenter, Grossberg, & Rosen (1991). In the new architecture, execution of the algorithm takes constant time per input vector regardless of the relationship between the input and existing templates, and several control signals are eliminated. This mapping facilitates hardware implementation of Fuzzy ART and furthermore serves as a tool for envisioning and understanding the algorithm. Keywords: Fuzzy ART, Fuzzy ARTMAP, parallel hardware, neural architecture. ftp://archive.cis.ohio-state.edu/pub/neuroprose/blume.fam_arch.ps.Z http://icse1.ucsd.edu/~mablume/nnletter.ps ------------------------------------------------------------------------------ Matthias Blume and Sadik C. Esener, Optoelectronic Fuzzy ARTMAP processor, Optical Computing, Vol. 10, 1995 OSA Technical Digest Series (Optical Society of America, Washington, DC, 1995), p. 213-215, March 1995. The Fuzzy ARTMAP algorithm can perform well even with weights truncated to 4 bits during training. Furthermore, only the weights corresponding to one processing element are updated after each training sample. Finally, it converges rapidly and relatively uniformly with little dependence on the particular choice of adjustable parameter values and initial state. These characteristics are particularly advantageous for parallel optoelectronic implementations. We map Fuzzy ARTMAP onto an architecture which satisfies the constraints of the hardware, and suggest an implementation which is an appropriate combination of optical and electronic technology. The proposed mapping of the algorithm onto a neural architecture is efficient, requiring only an input layer and one processing layer per fuzzy ART module, and requiring neither weight transport nor multiple copies of weights. The proposed optoelectronic system is simple, yet versatile, and relies on proven components. Keywords: Parallel optoelectronic hardware, Fuzzy ART, neural architecture. ftp://archive.cis.ohio-state.edu/pub/neuroprose/blume.oe_fam.ps.Z http://icse1.ucsd.edu/~mablume/OSA95.ps ------------------------------------------------------------------------------ - Matthias Blume ECE department, UCSD matthias at ucsd.edu http://icse1.ucsd.edu/~mablume From mpp at watson.ibm.com Fri Dec 8 19:27:29 1995 From: mpp at watson.ibm.com (Michael Perrone) Date: Fri, 8 Dec 1995 19:27:29 -0500 (EST) Subject: NFL Summary Message-ID: <9512090027.AA26165@austen.watson.ibm.com> Hi Everyone, There has been a lot of confusion regarding the "No Free Lunch" theorems. Below, I try to summarize what I feel to be the key points. NFL in a Nutshell: ------------------ If you make no assumptions about the target function then on average, all learning algorithms will have the same generalization performance. Apparent Contradiction and Resolution: -------------------------------------- Contradiction: Lots of theoretical results regarding generalization claim to make no assumptions about the target function. Resolution: These theoretical results DO make assumption (which may or may no be explicit) regarding the target. Importance of NFL: ------------------ The NFL results in and of itself is not terribly interesting because it's assumption (that we make no assumptions) is NEVER true. What makes NFL important is that it emphasizes in a very striking way that it is the ASSUMPTIONS that we make about our learning domains that MAKE ALL THE DIFFERENCE. Therefore, I see NFL *NOT* as a criticism of theoretical generalization results; but rather, as a call to examine the assumptions underlying these results because it is there that we can potentially learn the most about machine learning. Examples of Unstated Assumptions: --------------------------------- In practise, there are numerous assumption that we as a community usually make when we attempt to learn a task using out favorite algorithm. Below, I list just a few obvious ones. 1) The training and testing data are IID. 2) The data distribution is "smooth" (i.e. "near" data points are in general more similar than "far" data points). This can also be interpreted as some differentiability conditions. 3) NN's approximate real-world functions reasonably well. 4) Starting with small intial weights is good. 5) Overfitting is bad - early stopping is good. 6) Gaussian error models are the best thing since machine sliced bread. REALLY INTERESTING STUFF: ------------------------- I think that the NFL results point towards what I feel are extremely interesting research topics: Exactly what are the assumptions that certain theoretical results require? Exactly how do these assumptions affect generalization? Which assumptions are necessary/sufficient? How do different assumptions compare? Can we identify a set of assumptions that are equivalent to the assumption that CV model selection improves generalization? Can we do the same for early stopping? Bagging? (You can be damn sure I can do this for averaging... :-) Etc, etc, ... Caveat: ------- All of the above is conditioned on the assumptions that David Wolpert did his math correctly when deriving the NFL theorems... :-) I hope all of this helps clear things up. Comments? Regards, Michael ------------------------------------------------------------------------- Michael P. Perrone 914-945-1779 (office) IBM - Thomas J. Watson Research Center 914-945-4010 (fax) P.O. Box 704 / Rm 36-207 914-245-9746 (home) Yorktown Heights, NY 10598 mpp at watson.ibm.com ------------------------------------------------------------------------- From jlm at crab.psy.cmu.edu Sat Dec 9 17:35:01 1995 From: jlm at crab.psy.cmu.edu (James L. McClelland) Date: Sat, 9 Dec 95 17:35:01 EST Subject: TR Announcement Message-ID: <9512092235.AA21814@crab.psy.cmu.edu.psy.cmu.edu> The following Technical Report is available both electronically from our own FTP server or in hard copy form. Instructions for obtaining copies may be found at the end of this post. ======================================================================== Stochastic Interactive Processing, Channel Separability, and Optimal Perceptual Inference: An Examination of Morton's Law Javier R. Movellan & James L. McClelland Technical Report PDP.CNS.95.4 December 1995 In this paper we examine a regularity found in human perception, called Morton's law, in which stimulus and context have independent influences on perception. This regularity has been used in the past to argue that perception is a feed-forward, non-interactive process. Building on earlier work by McClelland ( Cognitive Psychology, 1991) we illustrate how Morton's law may emerge from stochastic interactions between simple processing units. To this end we consider the properties of interactive diffusion networks, the continuous stochastic limit of standard artificial neural models. If, as we believe, human information processing involves using noisy processing elements to process potentially noisy inputs, such models may ultimately serve as foundations for a theory of human information processing. We show that Morton's law emerges in recurrent diffusion networks when the units are organized into separable channels, feed-forward processing is not a necessary condition for Morton's law to hold. Failures to exhibit Morton's law provide evidence that the information channels are not separable. This result can be used to analyze cognitive models as well as actual brain structures. Finally, we illustrate how diffusion networks can be organized to implement optimal Bayesian perceptual inference. ======================================================================= Retrieval information for pdp.cns TRs: unix> ftp 128.2.248.152 # hydra.psy.cmu.edu Name: anonymous Password: ftp> cd pub/pdp.cns ftp> binary ftp> get pdp.cns.95.4.ps.Z # gets this tr ftp> quit unix> zcat pdp.cns.95.4.ps.Z | lpr # or however you print postscript NOTE: The compressed file is 255910 bytes long. Uncompressed, the file is 727359 byes long. The printed version is 66 total pages long. For those who do not have FTP access, physical copies can be requested from Barbara Dorney . For a list of available PDP.CNS Technical Reports: > get README For the titles and abstracts: > get ABSTRACTS From hicks at cs.titech.ac.jp Sun Dec 10 09:24:29 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Sun, 10 Dec 1995 23:24:29 +0900 Subject: NFL Summary In-Reply-To: Michael Perrone's message of Fri, 8 Dec 1995 19:27:29 -0500 (EST) <9512090027.AA26165@austen.watson.ibm.com> Message-ID: <199512101424.XAA13664@euclid.cs.titech.ac.jp> Micheal Perrone writes: > I think that the NFL results point towards what I feel are extremely > interesting research topics: > ... > Can we identify a set of assumptions that are equivalent to the > assumption that CV model selection improves generalization? CV is nothing more than the random sampling of prediction ability. If the average over the ensemble of samplings of this ability on 2 different models A and B come out showing that A is better than B, then by definition A is better than B. This assumes only that the true domain and the ensemble of all samplings coincide. Therefore CV will not, on average, cause a LOSS in prediction ability. That is, when it fails, it fails gracefully, on average. It cannot be consistently deceptive. (A quick note: Sometimes it is advocated that a complexity parameter be set by splitting the data set into training and testing, and using CV. Then with the complexity parameter fixed the whole data set can be used to train the other parameters. Behind this is an ASSUMPTION about the independence of the complexity from the other parameters. Of course it often works in practice, but it violates the principle in the above paragraph, so I do not count this as real CV here.) Two prerequisites exist to obtain a GAIN with CV 1) The objective function must be "compressible". I.e., it cannot be noise. 2) We must have a model which can recognize the structure in the data. This structure might be quite hard to see, as in chaotic signals. I think NFL says that on average CV will not obtain GAINful results, because the chance that a randomly selected problem and a randomly selected algorithm will hit it off is vanishingly small. (Or even any fixed problem and a randomly selected algorithm.) But I think it tells us something more important as well. It tells us that not using CV means we are always implicitly trusting our a priori knowledge. Any reasonable learning algorithm can always predict the training data, or a "smoothed" version of it. But because of the NFL theorem, this, over the ensemble of all algorithms and problems, means nothing. On average there will be no improvement in the off training set error. Fortunately, CV will report this fact by showing a zero correlation between prediction and true value on the off training set data. (Of course this is only the performance of CV on average over the ensemble of off training set datas; CV may be deceptive for a single off training set data.) Thus, we shouldn't think we can do away with CV unless we admit to having great faith in our prior. Going back to NFL, I think it poses another very interesting problem: Supposing we have "a foot in the door". That is, an algorithm which makes some sense of the data by showing some degree of prediction capability. Can we always use this prediction ability to gain better prediction ability? Is there some kind of ability to perform something like steepest descent over the space of algorithms, ONCE we are started on a slope? Is there a provable snowball effect? I think NFL reminds us that we are already rolling down the hill, and we shouldn't think otherwise. Craig Hicks Tokyo Institute of Technology From goldfarb at unb.ca Sun Dec 10 10:52:29 1995 From: goldfarb at unb.ca (Lev Goldfarb) Date: Sun, 10 Dec 1995 11:52:29 -0400 (AST) Subject: NFL Summary In-Reply-To: <9512090027.AA26165@austen.watson.ibm.com> Message-ID: On Fri, 8 Dec 1995, Michael Perrone wrote: > NFL in a Nutshell: > ------------------ > If you make no assumptions about the target function [specifically, about the axiomatic structure of the sample space and the inductive generalization, i.e. which ones are the most general for the purpose] Strangely as it may sound at first, try to inductively learn the subgroup of some large group with the group structure completely hidden. No statistics will reveal the underlying group structure. Objects in the universe do have structure, especially when they have to be represented, as we have learned from the data types in computer science: TO REPRESENT AN OBJECT IS TO MAKE SOME ASSUMPTIONS ABOUT THE OPERATIONS RELATED TO ITS MANIPULATION. Cheers, Lev Goldfarb From XIAODONG at rivendell.otago.ac.nz Sun Dec 10 20:46:21 1995 From: XIAODONG at rivendell.otago.ac.nz (Xiaodong Li, Otago University, New Zealand) Date: Mon, 11 Dec 1995 14:46:21 +1300 Subject: Paper available "Connectionist Model Based on an Optical Thin-Film Model" Message-ID: <01HYONVDU5GYLBVSXM@rivendell.otago.ac.nz> FTP-host: archive.cis.ohio-state.edu FTP-filename:/pub/neuroprose/xli.thinfilm.ps.Z The file xli.thinfilm.ps.Z is now available for ftp from Neuroprose repository. Connectionist Learning Using an Optical Thin-Film Model (4 pages) Martin Purvis and Xiaodong Li Computer and Information Science University of Otago Dunedin, New Zealand ABSTRACT: An alternative connectionist architecture to the one based on the neuroanatomy of biological organisms is described. The proposed architecture is based on an optical thin-film multilayer model, with the thicknesses of thin-film layers serving as adjustable 'weights' for the computation. Inputs are encoded into the corresponding refractive indices of individual thin-film layers, while the outputs are typically measured by the overall reflection coefficients off the thin-film layers, at different wavelengths. The nature of the model and some example calculations (a pattern recognition and the classification on the iris data set) that exhibit behaviour typical of conventional connectionist architectures are described. This model has also been used in solving the XOR and 16 four-bit parity problems, and it has demonstrated comparable performance to that of a conventional feed-forward neural netwrok model using Back-propagation learning. This paper is also available at the proceeding of the Second New Zealand International Two-Stream Conference on Artificial Neural Nteworks and Expert Systems (ANNES'95), IEEE Computer Society Press, Los Almamitos, California, 1995, pp. 63-66. Comments are greatly appreciated. -- Xiaodong Li Email: Xiaodong at otago.ac.nz Http: http://divcom.otago.ac.nz:800/COM/INFOSCI/SECML/xdli/xiao.htm (Postscript file of this paper is also available here at my hoempage) From prechelt at ira.uka.de Mon Dec 11 07:11:32 1995 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Mon, 11 Dec 1995 13:11:32 +0100 Subject: NN Benchmarking WWW homepage Message-ID: <"iraun1.ira.487:11.12.95.12.12.22"@ira.uka.de> The homepage of the very successful NIPS*95 workshop on benchmarking has now been converted into a repository for information about benchmarking issues: Status quo, methodology, facilities, and related info. I kindly ask everybody who has additional information that should be on the page (in particular sources or potential sources of learning data of all kinds) to submit that information to me. Other comments are also welcome. The URL is http://wwwipd.ira.uka.de/~prechelt/NIPS_bench.html The page is also still reachable over the benchmarking workshop link on the NIPS*95 homepage. Below is a textual version of the page. Lutz Lutz Prechelt (http://wwwipd.ira.uka.de/~prechelt/) | Whenever you Institut f. Programmstrukturen und Datenorganisation | complicate things, Universitaet Karlsruhe; D-76128 Karlsruhe; Germany | they get (Phone: +49/721/608-4068, FAX: +49/721/694092) | less simple. =============================================== Benchmarking of learning algorithms information repository page Abstract: Proper benchmarking of (neural network and other) learning architectures is a prerequisite for orderly progress in this field. In many published papers deficiencies can be observed in the benchmarking that is performed. A workshop about NN benchmarking at NIPS*95 addressed the status quo of benchmarking, common errors and how to avoid them, currently existing benchmark collections, and, most prominently, a new benchmarking facility including a results database. This page contains pointers to written versions or slides of most of the talks given at the workshop plus some related material. The page is intended to be a repository for such information to be used as a reference by researchers in the field. Note that most links lead to Postscript documents. Please send any additions or corrections you might have to Lutz Prechelt (prechelt at ira.uka.de). Workshop Chairs: Thomas G. Dietterich , Geoffrey Hinton , Wolfgang Maass , Lutz Prechelt [communicating chair] Terry Sejnowski Assessment of the status quo: * Lutz Prechelt. A quantitative study of current benchmarking practices. A quantitative survey of 400 journal articles of 1993 and 1994 on NN algorithms. Most articles used far too few problems during benchmarking. * Arthur Flexer. Statistical Evaluation of Neural Network Experiments: Minimum Requirements and Current Practice. Says that it is insufficient what is reported about the benchmarks and how. Methodology: * Tom Dietterich. Experimental Methodology Benchmarking types, correct statistical testing, synthetic versus real-world data, understanding via algorithm mutation or data mutation, data generators. * Lutz Prechelt. Some notes on neural learning algorithm benchmarking. A few general remarks about volume, validity, reproducibility, and comparability of benchmarking; DOs and DON'Ts. * Brian Ripley. What can we learn from the study of the design of experiments? (Only two slides, though). * Brian Ripley. Statistical Ideas for Selecting Network Architectures. (Also somewhat related to benchmarking.) Benchmarking facilities: * Previously available NN benchmarking data collections CMU nnbench, UCI machine learning databases archive, Proben1, StatLog data, ELENA data. Advantages of these: UCI is large and growing and popular, Statlog has largest and most orderly collection of results available (in a book, though), and Proben1 is most easy to use and best supports reproducible experiments. Elena and nnbench have no particular advantages. Disadvantages: UCI and Probem1 have too few and too unstructured results available, Proben1 is also inflexible and small, Statlog is partially confidential and neither data nor results collection are growing. * Carl Rasmussen and Geoffrey Hinton. DELVE: A thoroughly designed benchmark collection A proposal of data, terminology, and procedures and a facility for the collection of benchmarking results. This is the newly proposed standard for benchmarking NN (and other) learning algorithms. DELVE is currently still under construction at the University of Toronto. Other sources of data: (Thanks to Nici Schraudolph ) There is a large amount of game data about the board game Go available on the net. One starting point is here. Others are the Go game database project, and the Go game server. The database holds several hundred thousand games of Go and could for instance be used for advanced reinforcement learning projects. Last correction: 1995/12/11 Please send additions and corrections to Lutz Prechelt, prechelt at ira.uka.de. To NIPS homepage. To original homepage of this workshop. From mpp at watson.ibm.com Mon Dec 11 08:42:59 1995 From: mpp at watson.ibm.com (Michael Perrone) Date: Mon, 11 Dec 1995 08:42:59 -0500 (EST) Subject: compressibility and generalization In-Reply-To: <199512080049.JAA10560@euclid.cs.titech.ac.jp> from "hicks@cs.titech.ac.jp" at Dec 8, 95 09:49:53 am Message-ID: <9512111342.AA25646@austen.watson.ibm.com> [hicks at cs.titech.ac.jp wrote:] > PSS. What is anti-cross validation? Suppose we are given a set of functions and a crossvalidation data set. The CV and Anti-CV algorithms are as follows: CV: Choose the function with the best performance on the CV set. Anti-CV: Choose the function with the worst performance on the CV set. (And for this year's NIPS motif: Anti-EM: Dorothy? Dorothy? :-) Regards, Michael ------------------------------------------------------------------------- Michael P. Perrone 914-945-1779 (office) IBM - Thomas J. Watson Research Center 914-945-4010 (fax) P.O. Box 704 / Rm 36-207 914-245-9746 (home) Yorktown Heights, NY 10598 mpp at watson.ibm.com ------------------------------------------------------------------------- From hicks at cs.titech.ac.jp Mon Dec 11 20:01:05 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Tue, 12 Dec 1995 10:01:05 +0900 Subject: compressibility and generalization In-Reply-To: "Michael Perrone"'s message of Mon, 11 Dec 1995 08:42:59 -0500 (EST) <9512111342.AA25646@austen.watson.ibm.com> Message-ID: <199512120101.KAA16136@euclid.cs.titech.ac.jp> "Michael Perrone" wrote: >[hicks at cs.titech.ac.jp wrote:] >> PSS. What is anti-cross validation? >Suppose we are given a set of functions and a crossvalidation data set. >The CV and Anti-CV algorithms are as follows: > CV: Choose the function with the best performance on the CV set. >Anti-CV: Choose the function with the worst performance on the CV set. case 1: * Either the target function is (noise/uncompressible/has no structure), or none of the candidate functions have any correlation with the target function.* In this case both Anti-CV and CV provide (ON AVERAGE) equal improvement in prediction ability: none. For that matter so will ANY method of selection. Moreover, if we plot a graph of the number of data used for training vs. the estimated error (using the residual data), we will (ON AVERAGE) see no decrease in estimated error. Since CV provides an estimated prediction error, it can also tell us "you might as well be using anti-cross validation, or random selection for that matter, because it will be equally useless". case 2: * The target (is compressible/has structure), and some the candidate functions are positively correlated with the target function.* In this case CV will outperform anti-CV (ON AVERAGE). By ON AVERAGE I mean the expectation across the ensemble of samples for a FIXED target function. This is different from the ensemble and distribution of target functions, which is a much bigger question. We known much already about about the ensemble of samples from a fixed target function. I am not avoiding the issue of the ensemble or distribution of target functions, but merely showing that we have 2 general cases, and that in both of them CV is never WORSE than anti-CV. It follows that whatever the distribution of targets is, CV is never worse (ON AVERAGE) than anti-CV. I don't believe this contradicts NFL in any way. It just clarifies the role that CV can play. Learning and monitoring prediction error go hand in hand. This is even more true for cases when the underlying function may be changing and the data has the form of an infinite stream. Craig Hicks Tokyo Institute of Technology From GIOIELLO at cres.it Mon Dec 11 19:13:43 1995 From: GIOIELLO at cres.it (GIOIELLO) Date: Tue, 12 Dec 1995 01:13:43 +0100 Subject: A neural net based OCR demo for both Windows/DOS and Mac OS is available Message-ID: <01HYP9T0BSPU934ROD@cres.it> Dear Netters, An OCR demo for Mac OS is available at the following URL: ftp://ftpcsai.diepa.unipa.it/pub/demos/OCR-demo.cpt.hqx A Windows and DOS version is also available at the following URL: ftp://ftpcsai.diepa.unipa.it/pub/demos/OCR-Win.zip this latter version offers a more rich set of capabilities too. The OCR is based on a three-layer MLP. The conjugate gradient descent techniques were used while training the net. Training and test set were those of NIST. The related papers will be found at the following URL: ftp://ftpcsai.diepa.unipa.it/pub/papers/handwritten Several VLSI architectures to implement the OCR device using a digital implementation of the proposed MLP are also described in the papers. An overwiev of the activities we carry on can be found at the following URL: http://wwwcsai.diepa.unipa.it/research/projects/vlsinn/handcare/handcare.html Best Regards, Giuseppe A. M. Gioiello E-Mail: gioiello at diepa.unipa.it URL: http://wwwcsai.diepa.unipa.it/people/doctors/gioiello/gioiello.html From ernst at kuk.klab.caltech.edu Tue Dec 12 12:02:22 1995 From: ernst at kuk.klab.caltech.edu (Ernst Niebur) Date: 12 Dec 1995 17:02:22 GMT Subject: Training opportunities in Computational Neuroscience at Johns Hopkins University Message-ID: The Zanvyl Krieger Mind/Brain Institute at Johns Hopkins University is an interdisciplinary research center devoted to the investigation of the neural mechanisms of mental function and particularly to the mechanisms of perception: How is complex information represented and processed in the brain, how is it stored and retrieved, and which brain centers are critical for these operations? The Institute intends to significantly enhance its research program in Computational Neuroscience and encourages students with interest in this domain to apply for the graduate program in the Neuroscience department. Research opportunities exist in all of the laboratories of the Institute. Interdisciplinary projects, involving the student in more than one laboratory, are particularly encouraged. At present, MBI faculty include (listed with primary field of interest and methodology used): C. Ed Connor, PhD: Visual selective attention (electrophysiology in the awake behaving monkey). Stewart Hendry, PhD: Organization and plasticity of mammalian cerebral cortex (primate neuroanatomy). Steve S. Hsiao, PhD: Neurophysiology of tactile perception (electrophysiology in the awake behaving monkey). Kenneth O. Johnson, PhD: Neurophysiology of the somatosensory system (electrophysiology in the awake behaving monkey). Guy McKhann, MD (Director of MBI): Cognitive and neurologic outcomes after cardiac surgery; immunologic attack on peripheral motor axonal membranes in the human and experimental animal (neurology). Ernst Niebur, PhD: Theoretical Neuroscience (computational and mathematical modeling). Gian F Poggio, PhD: Analysis of Stereopsis and Texture (electrophysiology in the awake behaving monkey). Michael A. Steinmetz, PhD: Neurophysiological mechanisms in visual-spatial perception (electrophysiology in the awake behaving monkey). Ruediger von der Heydt, PhD: Neural mechanisms of visual perception (electrophysiology in the awake behaving monkey). Additional research opportunities exist in collaborative work with faculty in the Psychology Department (located next door to the Mind/Brain Institute), in particular with Drs. Howard Egeth (attention, perception, cognition), Michael Rudd (computational vision, psychophysics), Trisha Van Zandt (mathematical modelling, neural networks and memory), and Steven Yantis (visual perception, attention, mathematical modeling). All students accepted to the PhD program of the Neuroscience department receive full tuition remission plus a stipend at or above the National Institutes of Health predoctoral level. The Mind/Brain Institute is located on the very attractive Homewood campus in Northern Baltimore. Applicants should have a B.S. or B.A. with a major in any of the biological or physical sciences. Applicants are required to take the Graduate Record Examination (GRE), both the aptitude tests and an advanced test, or the Medical College Admission Test. Further information on the admission procedure can be obtained from the Department of Neuroscience: Director of Graduate Studies Neuroscience Training Program Department of Neuroscience The Johns Hopkins University School of Medicine 725 Wolfe Street Baltimore, MD 21205 Completed applications (including three letters of recommendation and either GRE scores or Medical College Admission Test scores) must be _received_ by January 1, 1996 at the above address. Candidates for whom this is impossible, or those who need additional information, should immediately contact Prof. Ernst Niebur The Zanvyl Krieger Mind/Brain Institute Johns Hopkins University 3400 N. Charles Street Baltimore, MD 21218 niebur at jhu.edu -- Ernst Niebur Krieger Mind/Brain Institute Asst. Prof. of Neuroscience Johns Hopkins University niebur at jhu.edu 3400 N. Charles Street (410)516-8643, -8640 (secr), -8648 (fax) Baltimore, MD 21218 From dhw at santafe.edu Tue Dec 12 17:25:06 1995 From: dhw at santafe.edu (David Wolpert) Date: Tue, 12 Dec 95 15:25:06 MST Subject: The last of a dying thread Message-ID: <9512122225.AA00709@sfi.santafe.edu> Some comments on the NFL thread. Huaiyu Zhu writes >>> 2. The *mere existence* of structure guarantees a (not uniformly-random) algorithm as likely to lose you a million as to win you a million, even in the long run. It is the *right kind* of structure that makes a good algorithm good. >>> This is a crucial point. It also seems to be one lost on many of the contributors to this thread, even those subsequent to Zhu's posting. Please note in particular that the knowledge that "the universe is highly compressible" can NOT, by itself, be used to circumvent NFL. I can only plead again: Those who are interested in this issue should look at the papers directly, so they have at least passing familiarity with the subject before disussing it. :-) ftp.santafe.edu, pub/dhw_ftp, nfl.1.ps.Z and nfl.2.ps.Z. Craig Hicks then writes: >>> However, I interpret the assertion that anti-cross validation can be expected to work as well as cross-validation to mean that we can equally well expect cross-validation to lie. That is, if cross-validation is telling us that the generalization error is decreasing, we can expect, on average, that the true generalization error is not decreasing. Isn't this a contradiction, if we assume that the samples are really randomly chosen? Of course, we can a posteriori always choose a worst case function which fits the samples taken so far, but contradicts the learned model elsewhere. But if we turn things around and randomly sample that deceptive function anew, the learned model will probably be different, and cross-validation will behave as it should. >>> That's part of the power of the NFL theorems - they prove that Hicks' intuition, an intuition many people share, is in fact wrong. >>> I think this follows from the principle that the empirical distribution over an ever larger number of samples converges to the the true distribution of a single sample (assuming the true distribution is stationary). >>> Nope. The central limit theorem is not directly germane. See all the previous discussion on NFL and Vapnik. >>>> CV is nothing more than the random sampling of prediction ability. If the average over the ensemble of samplings of this ability on 2 different models A and B come out showing that A is better than B, then by definition A is better than B. This assumes only that the true domain and the ensemble of all samplings coincide. Therefore CV will not, on average, cause a LOSS in prediction ability. That is, when it fails, it fails gracefully, on average. It cannot be consistently deceptive. Fortunately, CV will report this (failure to generalize) by showing a zero correlation between prediction and true value on the off training set data. (Of course this is only the performance of CV on average over the ensemble of off training set datas; CV may be deceptive for a single off training set data.) >>> This is wrong (or at best misleading). Please read the NFL papers. In fact, if the head-to-head minimax hypothesis concerning xvalidation presented in those papers is correct, xvalidation is wrong more often than it is right. In which case CV is "deceptive" more often (!!!) than not. Lev Goldfarb wrote >>> Strangely as it may sound at first, try to inductively learn the subgroup of some large group with the group structure completely hidden. No statistics will reveal the underlying group structure. >>> It may help if people read some of the many papers (Cox, deFinnetti, Erickson and Smith, etc., etc.) that prove that the only consistent way of dealing with uncertainty is via probability theory. In other words, there is nothing *but* statistics, in the real world. (Perhaps occuring in prior knowledge that you're looking for a group, but statistics nonetheless.) David Wolpert From lemm at LORENTZ.UNI-MUENSTER.DE Wed Dec 13 09:46:52 1995 From: lemm at LORENTZ.UNI-MUENSTER.DE (Joerg_Lemm) Date: Wed, 13 Dec 1995 15:46:52 +0100 Subject: NFL and practice Message-ID: <9512131446.AA13879@xtp141.uni-muenster.de> Some remarks to Craig Hicks arguments on crossvalidation and NFL in general from my point of view: One may discuss NFL for theoretical reasons, but the conditions under which NFL-Theorems hold are not those which are normally met in practice. 1.) In short, NFL assumes that data, i.e. information of the form y_i=f(x_i), do not contain information about function values on a non-overlapping test set. This is done by postulating "unrestricted uniform" priors, or uniform hyperpriors over nonumiform priors... (with respect to Craig's two cases this average would include a third case: target and model are anticorrelated so anticrossvalidation works better) and "vertical" likelihoods. So, in a NFL setting data never say something about function values for new arguments. This seems rather trivial under this assumption and one has to ask how natural is such a NFL situation. 2.) Information of the form y_i=f(x_i) is rather special and not what we normally have. There is much information which is not of this "single sharp data" type. (Examples see below.) There is absolutly no reason why information which depends on more than one f(x_i) should not be incorporated. (This can be done using nonuniform priors or in a way more symmetrical to "sharp data".) NFL just describes the situation in which we don't have any such information but much of the (then quite useless) "sharp data". But these sharp data are not less (maybe more) obscure as other forms of information. Information which is not of this "single sharp data" form but includes many or all f(x_i) to produce one answer normally induces correlations between target and generalizer if included into the generalizer. At the same time there is no real off training set anymore! Examples: 3) Informations like symmetries (even if only approximate), maxima, Fouriercomponents (and much, much more ...) involve more than one f(x_i). Fouriercomponents, for example, can be seen as sharp data but for different basisvectors, i.e. asking for momentum instead of location. This shows again, that the definition of "sharp data" corresponds to choosing a "basis of questions" and is no natural entity!!! 4) Real measurements (especially of continuous variables) normally do also NOT have the form y_i=f(x_i) ! They mostly perform some averaging over f(x_i) or at least they have some noise on the x_i (as small as you like, but present). In the latter case of "sharp" noise posing the same question several times gives you also an average of several (nearby) y with different x_i of the underlying true function. In both cases the averaging is equivalent to regularization for the "effective" function which we can observe!!! This shows that smoothness of the expectation (in contrast to uniform priors) is the result of the measurement process and therefore is a real phenomena for "effective" functions. There is no need to see it just as a subjective prior! (The same could be said on a quantummechanical level, but that's another story.) It follows that NFL results do NOT hold for the "effective" functions in such situations, even if assuming NFL for the underlying true functions. 5.) NFL again: Averaging or noise in the input space of the x_i requires a probability distribution in that space which can be defined independently from a specific function. Noise means that x_i is a random variable dependend from an actual question z_i, i.e. p(actual argument = x_i | question=z_i) and it is f(z_i) which we can observe. If you don't accept a given p(x_i|z_i), I am sure you can average over "all possible" of such relations with unrestricted "uniform" priors to find that it is impossible to obtain any information about any function without assuming a priori that you know something about what you are asking. This could be seen as another NFL-Theorem for questions: You do not even get informations about a single function value if you don't know (assume,define) a priori what you are asking! 6.) With respect to the underlying "true" function off-training set error itself, an important concept for NFL, is in general no longer a measurable quantity if input noise or averaging is present!! (For simplicity let's assume that noise or averaging includes all questions x_i. Then in the case of noise you only have a probability for the x_i to belong to the "true" training set and averaging includes all questions x_i.) So for the "true" functions there remains nothing NFL can say something about and for the "effective" functions NFL is not valid! To conclude: In many interesting cases "effective" function values contain information about other function values and NFL does not hold! The very special handling of "sharp data" in comparison to other information must be discussed for much more learning theories. Joerg Lemm (Institute for Theoretical Physics I, University of Muenster, Germany) From wray at ptolemy-ethernet.arc.nasa.gov Wed Dec 13 17:06:42 1995 From: wray at ptolemy-ethernet.arc.nasa.gov (Wray Buntine) Date: Wed, 13 Dec 95 14:06:42 PST Subject: one revised paper and NIPS slides by Buntine Message-ID: <9512132206.AA08307@ptolemy.arc.nasa.gov> Dear Connectionists Please note the following two WWW resources. One, a forthcoming journal paper, and the other, slides from a NIPS'95 Workshop presentation. Also, please note my new address, email, and company. I am no longer at Heuristicrats. Wray Buntine Thinkbank, Inc. +1 (510) 540-6080 [voice] 1678 Shattuck Avenue, Suite 320 +1 (510) 540-6627 [fax] Berkeley, CA 94709 wray at Thinkbank.COM ============ Article URL: http://www.thinkbank.com/wray/graphbib.ps.Z (about 240Kb compressed) TITLE: A guide to the literature on learning probabilistic networks from data AUTHOR: Wray Buntine, Thinkbank JOURNAL: Accepted for IEEE Trans. on Knowledge and Data Eng., Final draft submitted. ABSTRACT: This literature review discusses different methods under the general rubric of learning Bayesian networks from data, and includes some overlapping work on more general probabilistic networks. Connections are drawn between the statistical, neural network, and uncertainty communities, and between the different methodological communities, such as Bayesian, description length, and classical statistics. Basic concepts for learning and Bayesian networks are introduced and methods are then reviewed. Methods are discussed for learning parameters of a probabilistic network, for learning the structure, and for learning hidden variables. The presentation avoids formal definitions and theorems, as these are plentiful in the literature, and instead illustrates key concepts with simplified examples. KEYWORDS: Bayesian networks, graphical models, hidden variables, learning, learning structure, probabilistic networks, knowledge discovery =========== Talk URL: http://www.thinkbank.com/wray/refs.html (and look under Talks for NIPS) TITLE: Compiling Probabilistic Networks and Some Questions this Poses. AUTHOR: Wray Buntine WORKSHOP: NIPS'95 Workshop on Learning Graphical Models ABSTRACT: Probabilistic networks (or similar) provide a high-level language that can be used as the input to a compiler for generating a learning or inference algorithm. Example compilers are BUGS (inputs a Bayes net with plates) by Gilks, Spiegelhalter, et al., and MultiClass (inputs a dataflow graph) by Roy. This talk will cover three parts: (1) an outline of the arguments for such compilers for probabilistic networks, (2) an introduction to some compilation techniques, and (3) the presentation of some theoretical challenges that compilation poses. High-level language compilers are usually justified as a rapid prototyping tool. In learning, rapid prototyping arises for the following reasons: good priors for complex networks are not obvious and experimentation can be required to understand them; several algorithms may suggest themselves and experimentation is required for comparative evaluation. These and other justifications will be described in the context of some current research on learning probabilistic networks, and past research on learning classification trees and feed-forward neural networks. Techniques for compilation include the data flow graph, automatic differentiation, Monte Carlo Markov Chain samplers of various kinds, and the generation of C code for certain exact inference tasks. With this background, I will then pose a number of research questions to the audience. =========== From bernabe at cnm.us.es Tue Dec 12 07:39:41 1995 From: bernabe at cnm.us.es (Bernabe Linares B.) Date: Tue, 12 Dec 95 13:39:41 +0100 Subject: two papers in neuroprose Message-ID: <9512121239.AA17985@cnm1.cnm.us.es> FTP-host: archive.cis.ohio-state.edu FTP-file: pub/neuroprose/bernabe.art1-nn.ps.Z (30 pages, 257846 bytes) pub/neuroprose/bernabe.art1-vlsi.ps.Z (26 pages, 311686 bytes) The files "bernabe.art1-nn.ps.Z" and "bernabe.art1-vlsi.ps.Z" are now available for copying from the Neuroprose repository. They contain two papers which have been accepted for publication in the following journals: PAPER1: Journal: IEEE Transactions on VLSI Systems Title: "A Real-Time Clustering Microchip Neural Engine" File: bernabe.art1-vlsi.ps.Z PAPER2: Journal: Neural Networks Title: "A Modified ART1 Algorithm more suitable for VLSI Implementations" File: bernabe.art1-nn.ps.Z Authors: Teresa Serrano-Gotarredona and Bernabe Linares-Barranco Filiation: National Microelectronics Center (CNM), Sevilla, SPAIN. Sorry, no hardcopies available. Brief description of papers follows: -------------------------------------------------------------------- PAPER1: ------- File: bernabe.art1-vlsi.ps.Z, 26 pages, 311686 bytes. Title: "A Real-Time Clustering Microchip Neural Engine" Abstract This paper presents an analog current-mode VLSI implementation of an unsupervised clustering algorithm. The clustering algorithm is based on the popular ART1 algorithm [1], but has been modified resulting in a more VLSI-friendly algorithm [2], [3] that allows a more efficient hardware implementation with simple circuit operators, little memory requirements, modular chip assembly capability, and higher speed figures. The chip described in this paper implements a network that can cluster 100 binary pixels input patterns into up to 18 different categories. Modular expansibility of the system is directly possible by assembling an NxM array of chips without any extra interfacing circuitry, so that the maximum number of clusters is 18xM and the maximum number of bits of the input pattern is Nx100. Pattern classification and learning is performed in 1.8us, which is an equivalent computing power of 4.4x10^9 connections per second plus connection-updates per second. The chip has been fabricated in a standard low cost 1.6um double-metal single-poly CMOS process, has a die area of 1cm^2, and is mounted in a 120-pin PGA package. Although internally the chip is analog in nature, it interfaces to the outside world through digital signals, and thus has a true asynchronous digital behavior. Experimental chip test results are available, obtained through digital chip test equipment. Fault tolerance at the system level operation is demonstrated through the experimental testing of faulty chips. -------------------------------------------------------------------- PAPER2: ------- File: bernabe.art1-nn.ps.Z, 30 pages, 257846 bytes. Title: "A Modified ART1 Algorithm more suitable for VLSI Implementations" Abstract This paper presents a modification to the original ART1 algorithm [Carpenter, 1987a] that is conceptually similar, can be implemented in hardware with less sophisticated building blocks, and maintains the computational capabilities of the originally proposed algorithm. This modified ART1 algorithm (which we will call here ART1m) is the result of hardware motivated simplifications investigated during the design of an actual ART1 chip [Serrano, 1994, 1996]. The purpose of this paper is simply to justify theoretically that the modified algorithm preserves the computational properties of the original one and to study the difference in behavior between the two approaches. -------------------------------------------------------------------- ftp instructions are: % ftp archive.cis.ohio-state.edu Name : anonymous Password: ftp> cd pub/neuroprose ftp> binary ftp> get bernabe.art1-nn.ps.Z ftp> get bernabe.art1-vlsi.ps.Z ftp> quit % uncompress bernabe.art1-nn.ps.Z % uncompress bernabe.art1-vlsi.ps.Z % lpr -P bernabe.art1-nn.ps % lpr -P bernabe.art1-vlsi.ps These files are also available from the node "ftp.cnm.us.es", user "anonymous", directory /pub/bernabe/publications, files: "NN_art1theory_96.ps.Z" and "TVLSI_art1chip_96.ps.Z". Any feedback will be appreciated. Thanks, Dr. Bernabe Linares-Barranco National Microelectronics Center (CNM) Dept. of Analog Design Ed. CICA, Av. Reina Mercedes s/n, 41012 Sevilla, SPAIN. Phone: 34-5-4239923, Fax: 34-5-4624506, E-mail: bernabe at cnm.us.es From bishopc at helios.aston.ac.uk Wed Dec 13 14:52:48 1995 From: bishopc at helios.aston.ac.uk (Prof. Chris Bishop) Date: Wed, 13 Dec 1995 19:52:48 +0000 Subject: New Book: Neural Networks for Pattern Recognition Message-ID: <1400.9512131952@sun.aston.ac.uk> -------------------------------------------------------------------- NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK -------------------------------------------------------------------- "Neural Networks for Pattern Recognition" ----------------------------------------- Christopher M. Bishop (Oxford University Press) Full details at: http://neural-server.aston.ac.uk/NNPR/ This book provides the first comprehensive treatment of neural networks from the perspective of statistical pattern recognition. * 504 pages * 160 figures * 129 graded exercises * a self-contained introduction to statistical pattern recogniton * an extensive treatment of Bayesian methods * paperback and hardback editions * 300 references Contents: --------- 1. Statistical Pattern Recognition 2. Probability Density Estimation 3. Single-layer Networks 4. The Multi-layer Perceptron 5. Radial Basis Functions 6. Error Functions 7. Parameter Optimization Algorithms 8. Pre-processing and Feature Extraction 9. Learning and Generalization 10. Bayesian Techniques ***** Instructors wishing to use this text as the basis for a course may request a complimentary examination copy from the publishers. (USA: fax request to 212-726-6442 with brief description of the course) ***** Ordering information: --------------------- ISBN 0-19-853864-2 paperback 0-19-853849-9 hardback USA: 45 dollars paperback ---- 98 dollars hardback Credit card orders: Tel: 1-800-451-7556 (toll free) By post, send payment to: Order Dept. Oxford University Press 2001 Evans Road Cary, NC 27513 USA (3 dollars shipping for first copy, 1 dollar each thereafter) Canada: Tel: 1-800-387-8020 (toll free) ------- UK: 25 pounds paperback --- 55 pounds hardback Tel: 01536 454 534 (from the UK) Tel: +44 1536 454 534 (from abroad) By post, send payment to: CWO Department Oxford University Press Saxon Way West, Corby Northants NN18 9ES, UK (3.53 pounds postage) By fax: 01536 746 337 (from the UK) +44 1536 746 337 (from abroad) ---------------------------------------------------------------------- Prof. Christopher M. Bishop Tel. +44 (0)121 333 4631 Neural Computing Research Group Fax. +44 (0)121 333 4586 Dept. of Computer Science c.m.bishop at aston.ac.uk & Applied Mathematics http://neural-server.aston.ac.uk/ Aston University Birmingham B4 7ET, UK ---------------------------------------------------------------------- From zhuh at helios.aston.ac.uk Thu Dec 14 13:12:43 1995 From: zhuh at helios.aston.ac.uk (zhuh) Date: Thu, 14 Dec 1995 18:12:43 +0000 Subject: No free lunch for Cross Validation! Message-ID: <2237.9512141812@sun.aston.ac.uk> Dear Colleagues, A little while ago someone claimed that Cross validation will benefit from the presence of any structure, and if there is no structure it does no harm; yet NFL explicitly states that a structure can be equally good or bad for any given method, depending on how they match each other; yet It was further claimed that they do not conflict with each other. I was quite curious and did the following five-minute experiment to find out which is correct. Suppose we have a Gaussian variable x, with mean mu and unit variance. We have the following three estimators for estimating mu from a sample of size n. A: The sample mean. It is optimal both in the sense of Maximum Likelihood and Least Mean Squares. B: The maximum of sample. It is a bad estimator in any reasonable sense. C: Cross validation to choose between A and B, with one extra data point. The numerical result with n=16 and averaged over 10000 samples, gives mean squared error: A: 0.0627 B: 3.4418 C: 0.5646 This clearly shows that cross validation IS harmful in this case, despite the fact it is based on a larger sample. NFL still wins! Many of you might jump on me at this point: But this is a very artificial example, which is not what normally occurs in practice. To this I have two answers, short and long. The short answer is from principle. Any counter-example, however artificial it is, clearly demolishes the hope that cross validation is a "universally beneficial method". The longer answer is divided in several parts, which hopefully will answer any potential criticism from any aspect: 1. The cross validation is performed on extra data points. We are not requiring it to perform as good as the mean on 17 data points. If it cannot extract more information from the one extra data point, a minimum requirement is that it keeps the information in the original 16 points. But it can't even do this. 2. The maximum of a sample is the 100 percentile. The median is the 50 percentile, which is in fact a quite reasonable estimator. Let us use a larger cross validation set (of size k), and replace B with a different percentile. The result is that, for the median, CV needs k>2 to work. For 70 percentile CV needs k>16. The required k increases dramatically with the percentile. 3. It is not true that we have set up a case in which cross validation can't win. There is indeed a small probability that a sample can be so bad that the sample maximum is even a better estimate than the sample mean. However to utilise such rare chances to good effect k must be at least several hundred (maybe exponential) while n=16. We know such k exists since k=infinity certainly helps. Yet to adopt such a method is clearly absurd. 4. Although we have chosen estimator A to be the known optimal estimator in this case, it can be replaced by something else. For example, both A and B can be some reasonable averages over percentiles, so that without detailed analysis it may appear doing cross validation might give a C which is better than both A and B. Such believes can be defeated by similar counter-examples. 5. The above scheme of cross validation may appear different from what is familiar, but here is a "practical example" which shows that it is indeed what people normally do. Suppose we have a random variable which is either Gaussian or Cauchy. Consider the following three estimators: A: Sample mean: It has 100% efficiency for Gaussian, and 0% efficiency for Cauchy. B: Sample median: It is 2/pi=63.66% efficient for Gaussian and 8/pi^2=81.06% efficient for Cauchy. C: Cross validation on an additional sample of size k, to choose between A and B. Intuitively it appears quite reasonable to expect cross validation to pick out the correct one, for most of the time, so that, if averaged over all samples, C ought to be superior to both A and B. But no!! This will depend on the PRIOR mixing probability of these two sub-models. If the variable is in fact always Gaussian, then we have just seen that if n=16, CV will be worse unless k>2. The same is even more true in the reversed order, since the mean is an essentially useless estimator for Cauchy. 6. In any of the above cases, "anti cross validation" would be even more disastrous. If you are not convinced by these arguments, or if you want to know more about efficiency, then maybe the following reference can help: Fisher, R.A.: Theory of statistical estimation, Proc. Camb. Phil. Soc., Vol. 122, pp. 700-725, 1925. If you are more or less convinced, I have the following speculation: Several centuries ago, the French Academy of Science (or is it the Royal Society?) made a decision that they would no longer examine inventions of "perpetual motion machines", on the ground that the Law of Energy Conservation was so reliable that it would defeat any such attempt. History proved that this was a wise decision, which assisted the effort of designing machines which utilise energy in fuel. Should we expect the same fate for "the universally beneficial methods" in the face of NFL? Should we put more effort in designing methods which use prior information? posterior information <= prior information + data information. -- Huaiyu Zhu, PhD email: H.Zhu at aston.ac.uk Neural Computing Research Group http://neural-server.aston.ac.uk/People/zhuh Dept of Computer Science ftp://cs.aston.ac.uk/neural/zhuh and Applied Mathematics tel: +44 121 359 3611 x 5427 Aston University, fax: +44 121 333 6215 Birmingham B4 7ET, UK From C.Campbell at bristol.ac.uk Thu Dec 14 11:21:26 1995 From: C.Campbell at bristol.ac.uk (I C G Campbell) Date: Thu, 14 Dec 1995 16:21:26 +0000 (GMT) Subject: New Web Page (Bristol University, UK) Message-ID: <199512141621.QAA11250@zeus.bris.ac.uk> The Neural Computing Research Group at Bristol University, UK has recently set up a WWW page describing their interests at: http://www.fen.bris.ac.uk/engmaths/research/neural/neural.html Our interests cover three main areas: theory of neural computation, modelling simple neurobiological systems and applications of neural computing in engineering. Collectively we have produced in excess of 100 publications related to neural computing in these topic areas. Further details about these publications, current research interests and research grants may be found on the above page. Merry Xmas Colin Campbell University of Bristol From robert at fit.qut.edu.au Thu Dec 14 19:24:04 1995 From: robert at fit.qut.edu.au (Robert Andrews) Date: Fri, 15 Dec 1995 10:24:04 +1000 Subject: Rule Extraction Mailing List Message-ID: <199512150024.KAA15975@ocean.fit.qut.edu.au> =-=-=-=-= RULE EXTRACTION FROM ARTIFICIAL NEURAL NETWORKS =-=-=-=-=-=-=-=- ANNOUNCEMENT OF MAILING LIST Rule Extraction from Artificial Neural Networks and the related field of Rule Refinement are topics of increasing interest and importance. This is to announce the formation of a moderated mailing list for researchers and students interested in these areas. If you are interested in becoming a subscriber to this list please send the following information by return mail: Name: Organisation/Institution: E-mail Address: =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Mr Robert Andrews School of Information Systems robert at fit.qut.edu.au Faculty of Information Technology R.Andrews at qut.edu.au Queensland University of Technology +61 7 864 1656 (voice) GPO Box 2434 _--_|\ +61 7 864 1969 (fax) Brisbane Q 4001 / QUT Australia \_.--._/ http://www.fit.qut.edu.au/staff/~robert v =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- From l.s.smith at cs.stir.ac.uk Fri Dec 15 05:12:09 1995 From: l.s.smith at cs.stir.ac.uk (Dr L S Smith (Staff)) Date: Fri, 15 Dec 1995 10:12:09 GMT Subject: TR on generalization available Message-ID: <19951215T101209Z.KAA27913@katrine.cs.stir.ac.uk> Dear all: We have a new TR available by ftp from here: CCCN Technical report CCCN-21, December 1995. A Theoretical Study of the Generalization Ability of Feed-Forward Neural Networks. M J Roberts. By making assumptions on the probability distribution of the potentials in a feed-forward neural network we have derived lower bounds for the generalization ability of the network in terms of the number of training patterns. The results are consistent with simulations carried out on a simple geometrical function. The URL is ftp://ftp.cs.stir.ac.uk/pub/tr/cccn/TR21.ps.Z If you really can't access this hard copies are available, but only as a last resort. Dr Leslie S. Smith Dept of Computing and Mathematics, Univ of Stirling Stirling FK9 4LA Scotland lss at cs.stir.ac.uk (NeXTmail welcome) Tel (44) 1786 467435 Fax (44) 1786 464551 www http://www.cs.stir.ac.uk/~lss/ From bastiane at irit.fr Fri Dec 15 09:07:57 1995 From: bastiane at irit.fr (bastiane@irit.fr) Date: Fri, 15 Dec 1995 15:07:57 +0100 Subject: Call for papers for DYNN'96 Message-ID: <199512151407.PAA05193@irit.irit.fr> CALL FOR PAPERS FOR DYNN'96 International workshop on NEURAL NETWORKS DYNAMICS AND PATTERN RECOGNITION. Toulouse - France 12 and 13 of March 1996 Organized by ONERA-CERT Sponsored by DRET of French MOD, US Air Force Scientific Research and Pole Universitaire Europeen de Toulouse. Organizers: Manuel SAMUELIDES (ONERA-CERT), Bernard DOYON (INSERM), Gregory TARR (US AF), Simon THORPE (CNRS). Practical Information: Emmanuel DAUCE (dauce at cert.fr) *********************** OBJECTIVES OF THE WORKSHOP. *************************** This workshop is designed to allow information exchange and discussion between theoretical scientists working on models of neuronal dynamics and engineersnners who are looking for efficient devices to process sensor information. Continuous activation state units as well as Integrate and Fire neurons or oscillators are elementary components of Dynamical Neural Networks. Attractor neural networks as well as transitory data-driven dynamics will be considered. The common features between these models is the conversion of spatial information into spatio-temporal data flow which allows specific processing. Mathematical models involved use dynamical systems and stochastic processes. They will be compared to the results of numerical simulations and the latest neuro-physiological data concerning the dynamics of biological neural nets. The main aim of the workshop is to encourage significant advances concerning the dynamics of biologically plausible neural networks and their applications to pattern recognition. *********************** ORGANIZATION OF THE WORKSHOP. ***************************** Scheduled talks will take place on the 12th and the 13 th of March. There will be invited talks as well as submitted contributions. About 24 talks of 30 minutes will be scheduled with time for discussion and panels. Informal discussion and collective work may be scheduled on the 14 th. Extended abstract (one or two pages) of submitted contribution have to be send for acceptation by e-mail to dauce at cert.fr or by post to Manuel Samuelides, DERI ONERA-CERT, BP 4025, 31055 Toulouse CEDEX, FRANCE. Provisional list of invited lecturers: J.P.AUBIN, M.COTTRELL, J.DEMONGEOT, J.DAYHOFF,G.DREYFUS, M.HIRSCH, J.TAYLOR. (This list will be completed) The number of attendants to the workshop is limited to 40 in order to allow living exchange and real discussion. Copies of abstracts and slides will be provided to participants. The registration fees amount to FF 1,200 including 2 nights with american breakfast(11th and 12 th) at a first class hotel in downtown Toulouse (Holyday Inn, Crown Plaza), two lunches on the site of the workshop, the workshop banquet, transportation to and from CERT, coffee beaks, the general costs of the workshop facilities and equipment. Payment should be made either by check payable to " AGENT COMPTABLE DU CERT ONERA " in French francs only or by Bank transfer to "AGENT COMPTABLE DU CERT ONERA" Bank: Societe Generale Ramonville Saint Agne Account N? 30003 /02117/ 00037291008/93 Please state the workshop reference: DYNN'96 on all transactions. *********************** IMPORTANT DATES: **************** 15th of January: Dead-line for contributions and declarations of interest. 31 th of January: Date for signification of accepted contribution and expedition of final programming of the workshop 15 th of February: Dead-Line for Inscriptions to the workshop. To avoid postage delay, e-mail will be accepted as a usual communication If you want to attend DYNN'96 please use your computer to reply at once -------------------------------------------------------------------------------- Name Organization Adress e mail ( ) wishes the information about the final program ( ) wishes to attend DYNN'96 ( ) will submit a contribution entitled: ----------------------------------------------------------------------------- Please send your reply to the following e-mail dauce at cert.fr or to xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x Professor Manuel SAMUELIDES x x DERI ONERA-CERT x x BP 4025 x x 31055 Toulouse CEDEX x x FRANCE x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Manuel SAMUELIDES ----------------------------------------------------------------- for research: Chercheur a l'ONERA-CERT samuelid at cert.fr for Teaching Professeur a l'ENSAE Manuel.Samuelides at supaero.fr Tel: (33) 62 17 81 06 Fax: (33) 62 17 83 30 From lemm at LORENTZ.UNI-MUENSTER.DE Fri Dec 15 09:28:49 1995 From: lemm at LORENTZ.UNI-MUENSTER.DE (Joerg_Lemm) Date: Fri, 15 Dec 1995 15:28:49 +0100 Subject: NFL and practice Message-ID: <9512151428.AA24811@xtp141.uni-muenster.de> Huaiyu Zhu responsed to >> One may discuss NFL for theoretical reasons, but >> the conditions under which NFL-Theorems hold >> are not those which are normally met in practice. and wrote >Exactly the opposite. The theory behind NFL is trivial (in some sense). >The power of NFL is that it deals directly with what is rountinely >practiced in the neural network community today. That depends on how you understand practice. E.g. in nearly all cases functions are somewhat smooth. This is a prior which exists in reality (for example because of input noise in the measuring process). And the situation would we hopeless if we would not use this fact in practice. (That is just what also NFL says.) But, if Huaiyu means that it is necessary to think about the priors in "practice" explicitly, then I fully aggree! But what I wanted to say is: WE DO HAVE "PRIORS" (BETTER SAY CORRELATIONS BETWEEN ANSWERS TO DIFFERENT QUESTIONS) IN MOST CASES and they are NOT obscure, but very often at least as well MEASUREABLE as "normal" sharp data y_i=f(x_i). Even more: situations without "priors" are VERY artificial. So if we specify the "priors" (and the lesson from NFL is that we should if we want to make a good theory) then we cannot use NFL anymore.(What should it be used for then?) >Joerg continued with examples of various priors of practical concern, >including smoothness, symmetry, positive correlation, iid samples, etc. >These are indeed very important priors which match the real world, >and they are the implicit assumptions behind most algorithms. > >What NFL tells us is: If your algorithm is designed for such a prior, >then say so explicitly so that a user can decide whether to use it. >You can't expect it to be also good for any other prior which you have >not considered. In fact, in a sense, you should expect it to perform >worse than a purely random algorithm on those other priors. Maybe the problem is that Huaiyu Zhu uses the word "PRIOR" for every information which is not of the sharp data form y_i=f(x_i). It suggests that we know something before starting our generalizer. NO, that is not the normal case!!! I mentioned many examples (like measurement with input noise) where "priors" are just normal information which should be used DURING learning like sharp data! (Sharp data might be even not available at all!) And of course using wrong "priors" is similar to using wrong sharp data. But I fully aggree that most algorithm uses "prior" information only implicitly and that there is a lot of theoretical work to do. In response to >> In many interesting cases "effective" function values contain information >> about other function values and NFL does not hold! Huaiyu Zhu continues >This is like saying "In many interesting cases we do have energy sources, >and we can make a machine running forever, so the natural laws against >`perpetual motion machines' do not hold." Indeed, it is a little bit like that, but a system without energy sources is a much better approximation for some real world systems compared to a world without "priors" (i.e. without correlated answers over different questions)! So the energy law is useful, but models for worlds without correlated information are NOT, except maybe that they tell us to include the correlation properly! Joerg Lemm (Institute for Theoretical Physics I, University of Muenster, Germany) From shastri at ICSI.Berkeley.EDU Fri Dec 15 16:34:24 1995 From: shastri at ICSI.Berkeley.EDU (Lokendra Shastri) Date: Fri, 15 Dec 1995 13:34:24 PST Subject: Technical report --- negated knowledge and inconsistency Message-ID: <199512152134.NAA06683@kulfi.ICSI.Berkeley.EDU> Dealing with negated knowledge and inconsistency in a neurally motivated model of memory and reflexive reasoning. Lokendra Shastri and Dean J. Grannes TR-95-041 ICSI August 1995 Recently, SHRUTI has been proposed as a connectionist model of rapid reasoning. It demonstrates how a network of simple neuron- like elements can encode a large number of specific facts as well as systematic knowledge (rules) involving n-ary relations, quanti- fication and concept hierarchies, and perform a class of reasoning with extreme efficiency. The model, however, does not deal with negated facts and rules involving negated antecedents and consequents. We describe an extension of SHRUTI that can encode positive as well as negated knowledge and use such knowledge during reflexive reasoning. The extended model explains how an agent can hold inconsistent knowledge in its long-term memory without being ``aware'' that its beliefs are inconsistent, but detect a contradiction whenever inconsistent beliefs that are within a certain inferential distance of each other become co-active during an episode of reasoning. Thus the model is not logically omniscient, but detects contradictions whenever it tries to use inconsistent knowledge. The extended model also explains how limited attentional focus or action under time pressure can lead an agent to produce an erroneous response. A biologically signficant feature of the model is that it uses only local inhibition to encode negated knowledge. Like the basic model, the extended model encodes and propagates dynamic bindings using temporal synchrony. Key Words: long-term memory; rapid reasoning; dynamic bindings; synchrony; knowledge representation; neural oscillations; short-term memory; negation; inconsistent knowledge. ftp-server: ftp.icsi.berkeley.edu (128.32.201.55) ftp-file: /pub/techreports/1995/tr-95-041.ps.Z Lokendra Shastri International Computer Science Institute 1947 Center Street, Suite 600 Berkeley, CA 94704 http://www.icsi.berkeley.edu/~shastri ========================== Detailed instructions for retrieving the report: unix% ftp ftp.icsi.berkeley.edu Name (ftp.icsi.berkeley.edu:): anonymous Password: your_name at your_machine ftp> cd /pub/techreports/1995 ftp> binary ftp> get tr-95-041.ps.Z ftp> quit unix% uncompress tr-95-041.ps.Z unix% lpr tr-95-041.ps If your name server does not know about ftp.icsi.berkeley.edu, use 128.32.201.55 instead. All files in this archive can also be obtained through an e-mail interface in case direct ftp is not available. To obtain instructions, send mail containing the line `send help' to: ftpmail at ICSI.Berkeley.EDU As a last resort, hardcopies may be ordered for a small fee. Send mail to info at ICSI.Berkeley.EDU for more information. From cherkaue at cs.wisc.edu Fri Dec 15 19:03:15 1995 From: cherkaue at cs.wisc.edu (cherkaue@cs.wisc.edu) Date: Fri, 15 Dec 1995 18:03:15 -0600 Subject: No free lunch for Cross Validation! Message-ID: <199512160003.SAA03324@mozzarella.cs.wisc.edu> In reply to Huaiyu Zhu's message > ... > >A little while ago someone claimed that > Cross validation will benefit from the presence of any structure, > and if there is no structure it does no harm; > > ... > >Suppose we have a Gaussian variable x, with mean mu and unit variance. >We have the following three estimators for estimating mu from a >sample of size n. > A: The sample mean. It is optimal both in the sense of Maximum >Likelihood and Least Mean Squares. > B: The maximum of sample. It is a bad estimator in any reasonable sense. > C: Cross validation to choose between A and B, with one extra data point. > >The numerical result with n=16 and averaged over 10000 samples, gives >mean squared error: > A: 0.0627 B: 3.4418 C: 0.5646 >This clearly shows that cross validation IS harmful in this case, >despite the fact it is based on a larger sample. NFL still wins! You forgot D: Anti-cross validation to choose between A and B, with one extra data point. I don't understand your claim that "cross validation IS harmful in this case." You seem to equate "harmful" with "suboptimal." Cross validation is a technique we use to guess the answer when we don't already know the answer. You give technique A the benefit of your prior knowledge of the true answer, but C must operate without this knowledge. A fair comparison would pit C against D, not C against A. As you say: >6. In any of the above cases, "anti cross validation" would be even >more disastrous. Kevin Cherkauer Computer Sciences Dept. University of Wisconsin-Madison cherkauer at cs.wisc.edu From pkso at castle.ed.ac.uk Sat Dec 16 10:06:41 1995 From: pkso at castle.ed.ac.uk (P Sollich) Date: Sat, 16 Dec 95 15:06:41 GMT Subject: Thesis on Query Learning available Message-ID: <9512161506.aa29855@uk.ac.ed.castle> FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/Thesis/sollich.thesis.tar.Z Dear fellow connectionists, the following Ph.D. thesis is now available for copying from the neuroprose archive: ASKING INTELLIGENT QUESTIONS --- THE STATISTICAL MECHANICS OF QUERY LEARNING Peter Sollich Department of Physics University of Edinburgh, U.K. Abstract: This thesis analyses the capabilities and limitations of query learning by using the tools of statistical mechanics to study learning in feed-forward neural networks. In supervised learning, one of the central questions is the issue of generalization: Given a set of training examples in the form of input-output pairs produced by an unknown {\em teacher} rule, how can one generate a {\em student} which {\em generalizes}, i.e., which correctly predicts the outputs corresponding to inputs not contained in the training set? The traditional paradigm has been to study learning from {\em random examples}, where training inputs are sampled randomly from some given distribution. However, random examples contain redundant information, and generalization performance can thus be improved by {\em query learning}, where training inputs are chosen such that each new training example will be maximally `useful' as measured by a given {\em objective function}. We examine two common kinds of queries, chosen to optimize the objective functions, generalization error and entropy (or information), respectively. Within an extended Bayesian framework, we use the techniques of statistical mechanics to analyse the average case generalization performance achieved by such queries in a range of learning scenarios, in which the functional forms of student and teacher are inspired by models of neural networks. In particular, we study how the efficacy of query learning depends on the form of teacher and student, on the training algorithm used to generate students, and on the objective function used to select queries. The learning scenarios considered are simple but sufficiently generic to allow general conclusions to be drawn. We first study perfectly learnable problems, where the student can reproduce the teacher exactly. From an analysis of two simple model systems, the high-low game and the linear perceptron, we conclude that query learning is much less effective for rules with continuous outputs -- provided they are `invertible' in the sense that they can essentially be learned from a finite number of training examples -- than for rules with discrete outputs. Queries chosen to minimize the entropy generally achieve generalization performance close to the theoretical optimum afforded by minimum generalization error queries, but can perform worse than random examples in scenarios where the training algorithm is under-regularized, i.e., has too much `confidence' in corrupted training data. For imperfectly learnable problems, we first consider linear students learning from nonlinear perceptron teachers and show that in this case the structure of the student space determines the efficacy of queries chosen to minimize the entropy in {\em student} space. Minimum {\em teacher} space queries, on the other hand, perform worse than random examples due to lack of feedback about the progress of the student. For students with discrete outputs, we find that in the absence of information about the teacher space, query learning can lead to self-confirming hypotheses far from the truth, misleading the student to such an extent that it will not approximate the teacher optimally even for an infinite number of training examples. We investigate how this problem depends on the nature of the noise process corrupting the training data, and demonstrate that it can be alleviated by combining query learning with Bayesian techniques of model selection. Finally, we assess which of our conclusions carry over to more realistic neural networks, by calculating finite size corrections to the thermodynamic limit results and by analysing query learning in a simple two-layer neural network. The results suggest that the statistical mechanics analysis is often relevant to real-world learning problems, and that the potentially significant improvements in generalization performance achieved by query learning can be made available, in a computationally cheap manner, for realistic multi-layer neural networks. Criticism, comments and suggestions are welcome. Merry Christmas everyone! Peter Sollich -------------------------------------------------------------------------- Peter Sollich Department of Physics University of Edinburgh e-mail: P.Sollich at ed.ac.uk Kings Buildings phone: +44 - (0)131 - 650 5236 Mayfield Road Edinburgh EH9 3JZ, U.K. -------------------------------------------------------------------------- RETRIEVAL INSTRUCTIONS: Get `sollich.thesis.tar.Z' from the `Thesis' subdirectory of the neuroprose archive. Uncompress, and unpack the resulting tar file (on UNIX: uncompress sollich.thesis.tar.Z; tar xf - < sollich.thesis.tar). This will yield the postscript files listed below. Contact me if there are any problems with retrieval and or printing. QUICK GUIDE for busy readers: For a first look, see sollich_title.ps (has abstract and table of contents). File sollich_chapter1.ps contains a general introduction to query learning and an overview of the literature. Finally, for a summary of the main results and open questions, see sollich_chapter9.ps. LIST OF FILES: ------------------------------------------------------------------------------ Filename No of Size in KB Contents pages (compressed/ uncompressed) ------------------------------------------------------------------------------ sollich_title.ps 8 37/ 75 Title, Declaration, Acknowledgements, Publications, Abstract, Table of contents ------------------------------------------------------------------------------ sollich_chapter1.ps 8 48/ 98 Introduction ------------------------------------------------------------------------------ sollich_chapter2.ps 10 48/ 101 A probabilistic framework for query selection ------------------------------------------------------------------------------ sollich_chapter3.ps 21 128/ 376 Perfectly learnable problems: Two simple examples ------------------------------------------------------------------------------ sollich_chapter4.ps 19 135/ 337 Imperfectly learnable problems: Linear students ------------------------------------------------------------------------------ sollich_chapter5.ps 40 228/ 565 Query learning assuming the inference model is correct ------------------------------------------------------------------------------ sollich_chapter6.ps 12 244/1050 Combining query learning and model selection ------------------------------------------------------------------------------ sollich_chapter7.ps 20 217/ 558 Towards realistic neural networks I: Finite size effects ------------------------------------------------------------------------------ sollich_chapter8.ps 24 136/ 299 Towards realistic neural networks II: Multi-layer networks ------------------------------------------------------------------------------ sollich_chapter9.ps 5 31/ 59 Summary and Outlook ------------------------------------------------------------------------------ sollich_bib.ps 8 37/ 68 Bibliography ------------------------------------------------------------------------------ From zhuh at helios.aston.ac.uk Mon Dec 18 08:11:50 1995 From: zhuh at helios.aston.ac.uk (zhuh) Date: Mon, 18 Dec 1995 13:11:50 +0000 Subject: NFL and practice Message-ID: <4332.9512181311@sun.aston.ac.uk> I accidentally sent my reply to Joerg Lemm, instead of Connnetionist. Since he replied to the Connectionist, I'll reply here as well, and include my original posting at the end. I quite agree with Joerg's observation about learning algorithms in practice, and the priors they use. The key difference is Is it legitimate to be vague about prior? Put it another way, Do you claim the algorithm can pick up whatever prior automatically, instead of being specified before hand? My answer is NO, to both questions, because for an algorithm to be good on any prior is exactly the same as for an algorithm to be good without prior, as NFL told us. For purely cosmetic reasons, it might be helpful to translate the useless "No free lunch theorem" :-) Without specifying a particular prior, any algorithm is as good as random guess, into the equivalent, but infinitely more useful, "You have to pay for lunch Theorem" :-) For an algorithm to perform better than random guess, a particular prior should be specified. On a more practical level, > E.g. in nearly all cases functions are somewhat smooth. Do you specify the scale on which it is smooth? > This is a prior which exists in reality (for example because > of input noise in the measuring process). If you average smoothness over all scales, in a certain uniform way, you get a prior which contains no smoothness at all. If you average them in a non- uniform way, you actually specify a non-uniform prior, which is the crucial piece of information for any algorithm to work at all. > And the situation would we hopeless > if we would not use this fact in practice. It would still be hopeless if we only used the fact of "somewhat smooth", instead of specifying how smooth. See the following for theory and examples: Zhu, H. and Rohwer, R.: Bayesian regression filters and the issue of priors, 1995. To appear in Neural Computing and Applications. ftp://cs.aston.ac.uk/neural/zhuh/reg_fil_prior.ps.Z My original posting is enclosed as the following: ----- Begin Included Message ----- From imlm at tuck.cs.fit.edu Mon Dec 18 16:39:40 1995 From: imlm at tuck.cs.fit.edu (IMLM Workshop (pkc)) Date: Mon, 18 Dec 1995 16:39:40 -0500 Subject: CFP: AAAI-96 Workshop on Integrating Multiple Learned Models Message-ID: <199512182139.QAA10740@tuck.cs.fit.edu> CALL FOR PAPERS/PARTICIPATION INTEGRATING MULTIPLE LEARNED MODELS FOR IMPROVING AND SCALING MACHINE LEARNING ALGORITHMS to be held in conjunction with AAAI 1996 Portland, Oregon August 1996 Most modern machine learning research uses a single model or learning algorithm at a time, or at most selects one model from a set of candidate models. Recently however, there has been considerable interest in techniques that integrate the collective predictions of a set of models in some principled fashion. With such techniques often the predictive accuracy and/or the training efficiency of the overall system can be improved, since one can "mix and match" among the relative strengths of the models being combined. The goal of this workshop is to gather researchers actively working in the area of integrating multiple learned models, to exchange ideas and foster collaborations and new research directions. In particular, we seek to bring together researchers interested in this topic from the fields of Machine Learning, Knowledge Discovery in Databases, and Statistics. Any aspect of integrating multiple models is appropriate for the workshop. However we intend the focus of the workshop to be improving prediction accuracies, and improving training performance in the context of large training databases. More precisely, submissions are sought in, but not limited to, the following topics: 1) Techniques that generate and/or integrate multiple learned models. In particular, techniques that do so by: * using different training data distributions (in particular by training over different partitions of the data) * using different output classification schemes (for example using output codes) * using different hyperparameters or training heuristics (primarily as a tool for generating multiple models) 2) Systems and architectures to implement such strategies. In particular: * parallel and distributed multiple learning systems * multi-agent learning over inherently distributed data A paper need not be submitted to participate in the workshop, but space may be limited so contact the organizers as early as possible if you wish to participate. The workshop format is planned to encompass a full day of half hour presentations with discussion periods, ending with a brief period for summary and discussion of future activities. Notes or proceedings for the workshop may be provided, depending on the submissions received. Submission requirements: i) A short paper of not more than 2000 words detailing recent research results must be received by March 18, 1996. ii) The paper should include an abstract of not more than 150 words, and a list of keywords. Please include the name(s), email address(es), address(es), and phone number(s) of the author(s) on the first page. The first author will be the primary contact unless otherwise stated. iii) Electronic submissions in postscript or ASCII via email are preferred. Three printed copies (preferrably double-sided) of your submission are also accepted. iv) Please also send the title, name(s) and email address(es) of the author(s), abstract, and keywords in ASCII via email. Submission address: imlm at cs.fit.edu Philip Chan IMLM Workshop Computer Science Florida Institute of Technology 150 W. University Blvd. Melbourne, FL 32901-6988 407-768-8000 x7280 (x8062) 407-984-8461 (fax) Important Dates: Paper submission deadline: March 18, 1996 Notification of acceptance: April 15, 1996 Final copy: May 13, 1996 Chairs: Salvatore Stolfo, Columbia University sal at cs.columbia.edu David Wolpert, Santa Fe Institute dhw at santafe.edu Philip Chan, Florida Institute of Technology pkc at cs.fit.edu General Inquiries: Please address general inquiries to one of the co-chairs or send them to: imlm at cs.fit.edu Up-to-date workshop information is maintained on WWW at: http://cs.fit.edu/~imlm/ or http://www.cs.fit.edu/~imlm/ From ces at negi.riken.go.jp Mon Dec 18 20:36:45 1995 From: ces at negi.riken.go.jp (ces@negi.riken.go.jp) Date: Tue, 19 Dec 95 10:36:45 +0900 Subject: PhD Thesis Announcement : nonlinear filters Message-ID: <9512190136.AA21982@negi.riken.go.jp>  FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/Thesis/chng.thesis.ps.Z Dear fellow connectionists, the following Ph.D. thesis is now available for copying from the neuroprose archive: (Sorry, no hardcopies available.) - ----------------------------------------------------------------------- Applications of nonlinear filters with the linear-in-the-parameter structure Eng-Siong CHNG Department of Electrical Engineering University of Edinburgh, U.K. Abstract: The subject of this thesis is the application of nonlinear filters, with the linear-in-the-parameter structure, to time series prediction and channel equalisation problems. In particular, the Volterra and the radial basis function (RBF) expansion techniques are considered to implement the nonlinear filter structures. These approaches, however, will generate filters with very large numbers of parameters. As large filter models require significant implementation complexity, they are undesirable for practical implementations. To reduce the size of the filter, the orthogonal least squares (OLS) algorithm is considered to perform model selection. Simulations were conducted to study the effectiveness of subset models found using this algorithm, and the results indicate that this selection technique is adequate for many practical applications. The other aspect of the OLS algorithm studied is its implementation requirements. Although the OLS algorithm is very efficient, the required computational complexity is still substantial. To reduce the processing requirement, some fast OLS methods are examined. Two major applications of nonlinear filters are considered in this thesis. The first involves the use of nonlinear filters to predict time series which possess nonlinear dynamics. To study the performance of the nonlinear predictors, simulations were conducted to compare the performance of these predictors with conventional linear predictors. The simulation results confirm that nonlinear predictors normally perform better than linear predictors. Within this study, the application of RBF predictors to time series that exhibit homogeneous nonstationarity is also considered. This type of time series possesses the same characteristic throughout the time sequence apart from local variations of mean and trend. The second application involves the use of filters for symbol-decision channel equalisation. The decision function of the optimal symbol-decision equaliser is first derived to show that it is nonlinear, and that it may be realised explicitly using a RBF filter. Analysis is then carried out to illustrate the difference between the optimum equaliser's performance and that of the conventional linear equaliser. In particular, the effects of delay order on the equaliser's decision boundaries and bit error rate (BER) performance are studied. The minimum mean square error (MMSE) optimisation criterion for training the linear equaliser is also examined to illustrate the sub-optimum nature of such a criterion. To improve the linear equaliser's performance, a method which adapts the equaliser by minimising the BER is proposed. Our results indicate that the linear equalisers performance is normally improved by using the minimum BER criterion. The decision feedback equaliser (DFE) is also examined. We propose a transformation using the feedback inputs to change the DFE problem to a feedforward equaliser problem. This unifies the treatment of the equaliser structures with and without decision feedback. ----------------------------------------------------------- Criticism, comments and suggestions are welcome. Merry Christmas everyone! Eng Siong - -------------------------------------------------------------------------- Eng Siong CHNG Lab. for ABS, Frontier Research Programme, RIKEN, email : ces at negi.riken.go.jp 2-1 Hirosawa, Wako-Shi, Saitama 351-01, JAPAN. - -------------------------------------------------------------------------- RETRIEVAL INSTRUCTIONS: FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/Thesis/chng.thesis.ps.Z File size : 1715073 bytes Number of pages : 165 pages unix> ftp archive.cis.ohio-state.edu Connected to archive.cis.ohio-state.edu. 220 archive.cis.ohio-state.edu FTP server ready. Name: anonymous 331 Guest login ok, send ident as password. Password:neuron 230 Guest login ok, access restrictions apply. ftp> binary 200 Type set to I. ftp> cd pub/neuroprose/Thesis 250 CWD command successful. ftp> get chng.thesis.ps.Z 200 PORT command successful. 150 Opening BINARY mode data connection for chng.thesis.ps.Z 226 Transfer complete. ftp> quit 221 Goodbye. unix> uncompress chng.thesis.ps.Z unix> lpr chng.thesis.ps (postscript printer) Contact me if there are any problems with retrieval and or printing. ------- End of Forwarded Message From hag at santafe.edu Mon Dec 18 21:22:57 1995 From: hag at santafe.edu (Howard A. Gutowitz) Date: Mon, 18 Dec 1995 19:22:57 -0700 (MST) Subject: Exploring the Space of CA Message-ID: <9512190222.AA29140@sfi.santafe.edu> Announcing: "Exploring the Space of Cellular Automata" Cellular automata can be thought of as a restricted kind of neural net, in which the cells take on only a finite set of values, and connections are local and regular. This is set of interactive web pages designed to help you learn about CA, and the use of the lambda parameter to find critical regions in the space of CA. Credits: Concept: Chris Langton CA simulation program: Patrick Hayden. cgi interface: Eric Carr. Text: Chris Langton , Howard Gutowitz, and Eric Carr. Available from: http://alife.santafe.edu/alife/topics/ca/caweb -- Howard Gutowitz | hag at neurones.espci.fr ESPCI | http://www.santafe.edu/~hag Laboratoire d'Electronique | home: (331) 4707-3843 10 rue Vauquelin | office: (331) 4079-4697 75005 Paris, France | fax: (331) 4079-4425 From hicks at cs.titech.ac.jp Mon Dec 18 23:58:07 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Tue, 19 Dec 1995 13:58:07 +0900 Subject: NFL, practice, and CV Message-ID: <199512190458.NAA28669@euclid.cs.titech.ac.jp> Huaiyu Zhu wrote: >You can't make every term positive in your balance sheet, if the grand >total is bound to be zero. There ARE functions which are always non-negative, but which under an appropriate measure integrate to 0. It only requires that 1) the support of the non-negative values is vanishingly small, 2) the non-negative values are bounded So the above statement by Dr. Zhu is not true. In fact I think this ability for pointwise positive values to dissapear under integration is key to the "zero-sum" aspect of the NFL theorem holding true, despite the fact that we obviously see so many examples of working algorithms. My key point: A zero-sum (infinite) universe doesn't require negative values. ---- There is another important issue which needs to be clarified, and that is the definition of CV and the kinds of problems to which it can be applied. Now anybody can make whatever definition they want, and then come to some conclusions based upon that definition, and that conclusion may be correct given that definition. However, there are also advantages to sharing a common intellectual currency. I quote below from "An Introduction to the Bootstrap" by Efron and Tibshirani, 1993, Chapter 17.1. It describes well what I meant when I talked monitoring prediction error in a previous posting, and describes CV as a method for doing that. ================================================== In our discussion so far we have focused on a number of measures of statistical accuracy: standard errors, biases, and confidence intervals. All of these are measures of accuracy for parameters of a model. Prediction error is a different quantity that measures how well a model predicts the response value of a future observation. It is often used for model selection, since it is ensible ot choose a model that has the lowest prediction error among a set of candidates. Cross-validation is a standard tool for estimating prediction error. It is an old idea (predating the bootstrap) that has enjoyed a comeback in recent years with the increase in available computing power and speed. In this chapter we discuss cross-validation, the bootstrap, and some other closely related techniques for estimation of prediction error. In regression models, prediction error refers to the expected squared difference between a future response and its prediction from the model: PE = E(y - \hat{y})^2. The expectation refers to repeated sampling from the true population. Prediction error also arises in th eclassification problem, where the repsponse falls into one of k unordered classes. For example, the possible reponses might be Republican, Democrat, or Independent in a political survey. In classification problems prediction error is commonly defined as the probability of an incorrect classification PE = Prob(\hat{y} \neq y), also called the misclassification rate. The methods described in this chapter apply to both definitions of prediction error, and also to others. ================================================== Craig Hicks Tokyo Institute of Technology From zhuh at helios.aston.ac.uk Tue Dec 19 10:14:20 1995 From: zhuh at helios.aston.ac.uk (zhuh) Date: Tue, 19 Dec 1995 15:14:20 +0000 Subject: NFL, practice, and CV Message-ID: <8208.9512191514@sun.aston.ac.uk> This is in reply to the critisism by Craig Hicks and Kevin Cherkauer, and will be my last posting in this thread. Craig Hicks thought that my statement (A) > >You can't make every term positive in your balance sheet, if the grand > >total is bound to be zero. is contradictory to his statements (B) > There ARE functions which are always non-negative, but which under > an appropriate measure integrate to 0. > It only requires that > > 1) the support of the non-negative values is vanishingly small, > 2) the non-negative values are bounded But they are actually talking about different things. There is a big difference between positive and non-negative. For all practical purposes, the functions described by (B) can be regarded as identically zero. Translating back to the original topic, statement (B) becomes (C) There are algorithms which are always no worse than random guessing, on any prior, provided that 1) The priors on which it performs better than random guessing have zero probability to occur in practice. 2) It cannot be infinitely better on these priors. It is true that something improbable may still be possible, but this is only of academic interest. In most of modern treatment of function spaces, functions are only identified up to a set of measure zero, so that phrases like "almost everywhere" or "almost surely" are redundent. I suspect that due to the way NFL are proved, even (C) is impossible, but this does not matter anyway, because (C) itself is of no practical interest whatsoever. > ---- Considering cross validation, Craig wrote > > There is another important issue which needs to be clarified, and that is the > definition of CV and the kinds of problems to which it can be applied. Now > anybody can make whatever definition they want, and then come to some > conclusions based upon that definition, and that conclusion may be correct > given that definition. However, there are also advantages to sharing a common > intellectual currency. Risking a little bit over-simplification, I would like to summarise the two usages of CV as the following (CV1) A method for evaluating estimates, (CV2) A method for evaluating estimators. The key difference is that in (CV1), a decision is made for each sample, while in (CV2) a decision is made for all samples. If (CV1) is applied on two algorithms A and B, then we can always define a third algorithm C, by always choosing the estimate given by either A or B which is favoured by (CV1). But my previous counter-example shows that averaging over all samples, C can be worse than A. One may seek refuge in statements like "optimal decision for each sample does not mean optimal decision for all samples". Well, such incoherent inference is the defining characteristic of non-Bayesian statistics. In Bayesian decision theory it is well known that A method is optimal iff it is optimal on almost all samples, (excluding various measure zero anomolies.) The case of (CV2) is quite different. It is of a higher level than algorithms like A and B. It is in fact a statistical estimator mapping (D,A,f) to to a real number r, where D is a finite data set, A is a given algorithm, f is an objective function, and r is the predicted average performance. It should therefore be compared with other such methods. This appears not to be a topic considered in this discussion. -------------- Kevin Cherkauer wrote > > You forgot > > D: Anti-cross validation to choose between A and B, with one extra data > point. Well, I did not forget that, as you have quoted below, point 6. > > I don't understand your claim that "cross validation IS harmful in this case." > You seem to equate "harmful" with "suboptimal." See my original answer, points 1. and 4. > Cross validation is a technique > we use to guess the answer when we don't already know the answer. This is true for any statistical estimator. > You give > technique A the benefit of your prior knowledge of the true answer, but C must > operate without this knowledge. The prior knowledge is that the distribution is a unit Gaussian with unspecified mean, the true answer is its mean. No, they are not the same thing. C also operates with the knowledge that the distribution is a unit Gaussian, but it refuses to use this knowledge (which implies A is better than B). Instead, it insists on evaluating A and B on a cross validation set. That's why it performs miserably. > A fair comparison would pit C against D, not C > against A. As you say: > > >6. In any of the above cases, "anti cross validation" would be even > >more disastrous. If the definition was that "An algorithm is good if it is no worse than the worst algorithm", then I would have no objection. Well, almost any algorithm would be good in this sense. However, if the phrase "in any of the above cases" is droped without putting a prior restriction as remedy, then it's also true that all algorithm is as bad as the worst algorithm. Huaiyu PS. I think I have already talked enough about this subject so I'll shut up from now on, unless there's anything new to say. More systematic treatment of these subjects instead of counter-examples can be found in the ftp site below. -- Huaiyu Zhu, PhD email: H.Zhu at aston.ac.uk Neural Computing Research Group http://neural-server.aston.ac.uk/People/zhuh Dept of Computer Science ftp://cs.aston.ac.uk/neural/zhuh and Applied Mathematics tel: +44 121 359 3611 x 5427 Aston University, fax: +44 121 333 6215 Birmingham B4 7ET, UK From minton at ISI.EDU Tue Dec 19 14:53:27 1995 From: minton at ISI.EDU (minton@ISI.EDU) Date: Tue, 19 Dec 95 11:53:27 PST Subject: JAIR article Message-ID: <9512191953.AA11913@sungod.isi.edu> Readers of this mailing list may be interested in the following JAIR article, which was just published: Weiss, S.M. and Indurkhya, N. (1995) "Rule-based Machine Learning Methods for Functional Prediction", Volume 3, pages 383-403. PostScript: volume3/weiss95a.ps (527K) compressed, volume3/weiss95a.ps.Z (166K) Abstract: We describe a machine learning method for predicting the value of a real-valued function, given the values of multiple input variables. The method induces solutions from samples in the form of ordered disjunctive normal form (DNF) decision rules. A central objective of the method and representation is the induction of compact, easily interpretable solutions. This rule-based decision model can be extended to search efficiently for similar cases prior to approximating function values. Experimental results on real-world data demonstrate that the new techniques are competitive with existing machine learning and statistical methods and can sometimes yield superior regression performance. The PostScript file is available via: -- comp.ai.jair.papers -- World Wide Web: The URL for our World Wide Web server is http://www.cs.washington.edu/research/jair/home.html -- Anonymous FTP from either of the two sites below: CMU: p.gp.cs.cmu.edu directory: /usr/jair/pub/volume3 Genoa: ftp.mrg.dist.unige.it directory: pub/jair/pub/volume3 -- automated email. Send mail to jair at cs.cmu.edu or jair at ftp.mrg.dist.unige.it with the subject AUTORESPOND, and the body GET VOLUME3/FILE-NM (e.g., GET VOLUME3/MOONEY95A.PS) Note: Your mailer might find our files too large to handle. Also, note that compressed files cannot be emailed, since they are binary files. -- JAIR Gopher server: At p.gp.cs.cmu.edu, port 70. For more information about JAIR, check out our WWW or FTP sites, or send electronic mail to jair at cs.cmu.edu with the subject AUTORESPOND and the message body HELP, or contact jair-ed at ptolemy.arc.nasa.gov. From lucas at scr.siemens.com Tue Dec 19 12:26:15 1995 From: lucas at scr.siemens.com (Lucas Parra) Date: Tue, 19 Dec 1995 12:26:15 -0500 (EST) Subject: Preprint: Symplectic Nonlinear Component Analysis Message-ID: <199512191726.MAA04146@owl.scr.siemens.com> Dear fellow connectionists, a preprint of the following NIPS*95 paper is available at: ftp://archive.cis.ohio-state.edu/pub/neuroprose/parra.nips95.ps.Z Symplectic Nonlinear Component Analysis Lucas C. Parra Siemens Corporate Research lucas at scr.siemens.com Statistically independent features can be extracted by finding a factorial representation of a signal distribution. Principal Component Analysis (PCA) accomplishes this for linear correlated and Gaussian distributed signals. Independent Component Analysis (ICA), formalized by Comon (1994), extracts features in the case of linear statistical dependent but not necessarily Gaussian distributed signals. Nonlinear Component Analysis finally should find a factorial representation for nonlinear statistical dependent distributed signals. This paper proposes for this task a novel feed-forward, information conserving, nonlinear map - the explicit symplectic transformations. It also solves the problem of non-Gaussian output distributions by considering single coordinate higher order statistics. From jlm at crab.psy.cmu.edu Wed Dec 20 18:16:31 1995 From: jlm at crab.psy.cmu.edu (James L. McClelland) Date: Wed, 20 Dec 95 18:16:31 EST Subject: Technical Report Available Message-ID: <9512202316.AA19275@crab.psy.cmu.edu.psy.cmu.edu> The following Technical Report is available electronically from our FTP server or in hard copy form. Instructions for obtaining copies may be found at the end of this post. ======================================================================== On the Time Course of Perceptual Choice: A Model Based on Principles of Neural Computation Marius Usher & James L. McClelland Carnegie Mellon University and the Center for the Neural Basis of Cognition Technical Report PDP.CNS.95.5 December 1995 The time course of information processing is discussed in a model based on leaky, stochastic, non-linear accumulation of activation in mutually inhibitory processing units. The model addresses data from choice tasks using both time-controlled (e.g., deadline or response signal) and standard reaction time paradigms, and accounts simultaneously for aspects of data from both paradigms. In special cases, the model becomes equivalent to a classical diffusion process, but in general a more complex type of diffusion occurs. Mutual inhibition counteracts the effects of information leakage, allows flexible choice behavior regardless of the number of alternatives, and contributes to accounts of additional data from tasks requiring choice with conflict stimuli and word identification tasks. ====================================================================== Retrieval information for pdp.cns TRs: unix> ftp 128.2.248.152 # hydra.psy.cmu.edu Name: anonymous Password: ftp> cd pub/pdp.cns ftp> binary ftp> get pdp.cns.95.5.ps.Z # gets this tr ftp> quit unix> zcat pdp.cns.95.5.ps.Z | lpr # or however you print postscript NOTE: The compressed file is 567,075 bytes long. Uncompressed, the file is 1,768,398 byes long. The printed version is 53 total pages long. For those who do not have FTP access, physical copies can be requested from Barbara Dorney . For a list of available PDP.CNS Technical Reports: > get README For the titles and abstracts: > get ABSTRACTS From dhw at santafe.edu Wed Dec 20 20:00:48 1995 From: dhw at santafe.edu (David Wolpert) Date: Wed, 20 Dec 95 18:00:48 MST Subject: NFL once again, I'm afraid Message-ID: <9512210100.AA06007@sfi.santafe.edu> First and foremost, I would like to request that this NFL thread fade out. It is only sowing confusion - people should read the papers on NFL to understand NFL. [[ Moderator's note: I concur. We've had enough "No Free Lunch" discussion for a while; people are starting to protest. Future discussion should be done in email. -- Dave Touretzky, CONNECTIONISTS moderator ]] Full stop. *After* that, after there is common grounding, we can all debate. There is much else that connectionist is more appropriate for in the meantime. (To repeat: ftp.santafe.edu, pub/dhw_ftp, nfl.1.ps.Z and nfl.2.ps.Z.) Please, I'm on my knees, use the time that would have been spent thrashing at connectionist in a more fruitful fashion. Like by reading the NFL papers. :-) *** Hicks writes: >>> case 1: * Either the target function is (noise/uncompressible/has no structure), or none of the candidate functions have any correlation with the target function.* Since CV provides an estimated prediction error, it can also tell us "you might as well be using anti-cross validation, or random selection for that matter, because it will be equally useless". >>> This is wrong. Construct the following algorithm: "If CV says one of the algorithms under consideration has particularly low error in comparison to the other, use that algorithm. Otherwise, choose randomly among the algorithms." Averaged over all targets, this will do exactly as well as the algorithm that always guesses randomly among the algorithms. (For zero-one loss, either OTS error or IID error with a big input space, etc.) So you cannot rely on CV's error estimate *at all* (unless you impose a prior over targets or some such, etc.). Alternatively, keep in mind the following simple argument: In its uniform prior(targets) formulation, NFL holds even for error distributions conditioned on *any* property of the training set. So in particular, you can condition on having a training set for which CV says "yep, I'm sure; choose that one". And NFL still holds. So even in those cases where CV "is sure", by following CV, you'll die as often as not. >>> case 2: * The target (is compressible/has structure), and some the candidate functions are positively correlated with the target function.* In this case CV will outperform anti-CV (ON AVERAGE). >>> This is wrong. As has been mentioned many times, having structure in the target, by itself, gains you nothing. And as has also been mentioned, if "the candidate functions are positively correlated with the target function", then in fact *anti-CV wins*. READ THE PAPERS. >>> By ON AVERAGE I mean the expectation across the ensemble of samples for a FIXED target function. This is different from the ensemble and distribution of target functions, which is a much bigger question. >>> This distinction is irrelevent. There are versions of NFL that address both of these cases (as well as many others). READ THE PAPERS. ***** Lemm writes: >>> 1.) In short, NFL assumes that data, i.e. information of the form y_i=f(x_i), do not contain information about function values on a non-overlapping test set. >>> This is wrong. See all the previous discussion about how NFL holds even if you restrict yourself to targets with a lot of structure. The problem is that the structure can hurt just as easily as help. There is no need for the data set to contain no information about the test set - simply that the limited types of information can "confuse" the learning algorithm at hand. READ THE PAPERS. >>> This is done by postulating "unrestricted uniform" priors, or uniform hyperpriors over nonumiform priors... >>> This is wrong. There is (obviously) a version of NFL that holds for uniform priors. And there is another version in which one averages over all priors - so the uniform prior has measure 0. But one can also restrict oneself to average only over those priors "with a lot of structure", and again get NFL. And there are many other versions of NFL in which there is *no* prior, because things are conditioned on a fixed target. Exactly as in (non-Bayesian) sampling theory statistics. Some of those alternative NFL results involve saying "if you're conditioning on a target, there are as many such targets where you die as where you do well". Other NFL results never vary the target *in any sense*, even to compare different targets. Rather they vary something concerning the generalizer. This is the case with the more sophisticated xvalidation results, for example. READ THE PAPERS. >>> There is much information which is not of this "single sharp data" type. (Examples see below.) >>> *Obviously* if you have extra information and/or knowledge beyond that in the training set, you can (often) do better than randomly. That's what Bayesian analysis is all about. More generally, as I have proven in [1], the probability of error can be written as a non-Euclidean inner product between the learning algorithm and the posterior. So obviously if your posterior is structured in an appropriate manner, that can be exploited by the algorithm. This was never the issue however. The issue had to do with "blind" supervised learning, in which one has no such additional information. Like in COLT, for example. You're arguing apples and oranges here. >>> 4) Real measurements (especially of continuous variables) normally do also NOT have the form y_i=f(x_i) ! They mostly perform some averaging over f(x_i) or at least they have some noise on the x_i (as small as you like, but present). >>> Again, this is obvious. And stated explicitly in the papers, moreover. And completely irrelevent to the current discussion. The issue at hand has *always* been "sharp" data. And if you look at what's done in the neural net community, or in COLT, 95% of it assumes "sharp data". Indeed, there are many other assumptions almost always made and almost never true that Lemm has missed. Like making a "weak filtering assumption": assume the target and the distribution over inputs are independent. But again, just like in COLT, we're starting simple here, with such assumptions intact. READ THE PAPERS. >>> This shows that smoothness of the expectation (in contrast to uniform priors) is the result of the measurement process and therefore is a real phenomena for "effective" functions. >>> To give one simple example, what about with categorical data, where there is not even a partial ordering over the inputs? What does "locally smooth" even mean then? And even if we're dealing with real valued spaces, if there's input space noise, NFL simply changes to be a statement concerning test set elements that are sufficiently far (on the scale of the input space noise) from the elements of the training set. The input space noise makes the math more messy, but doesn't change the underlying phenomenon. (Readers interested in previous work on the relationship between local (!) regularization, smoothness, and input noise should see Bishop's Neural Computation article of about 6 months ago.) >>> Even more: situations without "priors" are VERY artificial. So if we specify the "priors" (and the lesson from NFL is that we should if we want to make a good theory) then we cannot use NFL anymore.(What should it be used for then?) >>> Sigh. 1) I am a Bayesian whenever feasible. (In fact, I've been taken to task for being "too Bayesian".) But situations without obvious priors - or where eliciting the priors is not trivial and you don't have the time - are in fact *very* common. A simple example is a project I am currently involved on for detecting phone fraud for MCI. Quick, tell me the prior probability that a fraudulent call arises from area code 617 vs. the prior probability that a non-fraudulent call does... 2) Essentially all of COLT is non-Bayesian. (Although some of it makes assumptions about things like the support of the priors.) You haven't a prayer of really understanding what COLT has to say without keeping in mind the admonitions of NFL. 3) As I've now said until I'm blue in the face, NFL is only the starting point. What it's "good for", beyond proving to people that they must pay attention to their assumptions, be wary of COLT-type claims, etc. is: head-to-head minimax theory, scrambled algorithms theory, hypothesis-averaging theory, etc., etc., etc. READ THE PAPERS. **** Zhu writes: >>> I quite agree with Joerg's observation about learning algorithms in practice, and the priors they use. The key difference is Is it legitimate to be vague about prior? Put it another way, Do you claim the algorithm can pick up whatever prior automatically, instead of being specified before hand? My answer is NO, to both questions, because for an algorithm to be good on any prior is exactly the same as for an algorithm to be good without prior, as NFL told us. >>> Yes! Everybody, LISTEN TO ZHU!!!! David Wolpert [1] - Wolpert, D. "The Relationshop Between PAC, the Statistical Physics Framework, the Bayesian Framework, and the VC Framework", in "The Mathematics of Generalization", D. Wolpert (Ed.), Addison-Wesley, 1995 From terry at salk.edu Wed Dec 20 20:34:15 1995 From: terry at salk.edu (Terry Sejnowski) Date: Wed, 20 Dec 95 17:34:15 PST Subject: Senior Position at GSU Message-ID: <9512210134.AA16333@salk.edu> Forwarded to Connectionists: Date: Mon, 18 Dec 1995 15:00:23 -0500 (EST) From: Donald Edwards Subject: job Dear friends and colleagues, I am writing to let you know of a senior position in computational neuroscience available here in the Department of Biology at Georgia State University. This person would join neurobiologists, physicists, mathematicians and computer scientists in the newly established Center for Neural Communication and Computation, and would participate in the graduate program in Neurobiology in the Department of Biology. This person would also help guide the construction, equipping and staffing of a Laboratory for Computational Neuroscience for which funds have already been obtained from the George Research Alliance. Georgia State University is located in downtown Atlanta. For more information, please contact me at this address, or call at (404) 651-3148. To apply, please send a letter of intent, c.v., and two letters of reference to Search Committee for Computational Neuroscience, Department of Biology, Georgia State University, Atlanta, GA 30302-4010. FAX: (404) 651-2509. Please share this message with anyone who might be interested. Thanks for your consideration, Don Edwards From erik at kuifje.bbf.uia.ac.be Thu Dec 21 12:48:50 1995 From: erik at kuifje.bbf.uia.ac.be (Erik De Schutter) Date: Thu, 21 Dec 95 17:48:50 GMT Subject: Crete Course in Computational Neuroscience Message-ID: <9512211748.AA27308@kuifje.bbf.uia.ac.be> CRETE COURSE IN COMPUTATIONAL NEUROSCIENCE AUGUST 25 - SEPTEMBER 21, 1996 CRETE, GREECE DIRECTORS: Erik De Schutter (University of Antwerp, Belgium) Idan Segev (Hebrew University, Jerusalem, Israel) Jim Bower (California Institute of Technology, USA) Adonis Moschovakis (University of Crete, Greece) The Crete Course in Computational Neuroscience introduces students to the practical application of computational methods in neuroscience, in particular how to create biologically realistic models of neurons and networks. The course consists of two complimentary parts. A distinguished international faculty gives morning lectures on topics in experimental and computational neuroscience. The rest of the day is spent learning how to use simulation software and how to implement a model of the system the student wishes to study. The first week of the course introduces students to the most important techniques in modeling single cells, networks and neural systems. Students learn how to use the GENESIS, NEURON, XPP and other software packages on their individual unix workstations. During the following three weeks the lectures will be more general, moving from modeling single cells and subcellular processes through the simulation of simple circuits and large neuronal networks and, finally, to system level models of the cortex and the brain. The course ends with a presentation of the student modeling projects. The Crete Course in Computational Neuroscience is designed for advanced graduate students and postdoctoral fellows in a variety of disciplines, including neurobiology, physics, electrical engineering, computer science and psychology. Students are expected to have a basic background in neurobiology as well as some computer experience. A total of 25 students will be accepted, the majority of whom will be from the European Union and affiliated countries. A tuition fee of 500 ECU ($700) covers travel to Crete, lodging and all course-related expenses for European nationals. We encourage students from the Far East and the USA to also apply to this international course. More information and application forms can be obtained: - WWW access: http://bbf-www.uia.ac.be/CRETE/Crete_index.html - by mail: Prof. E. De Schutter Born-Bunge Foundation University of Antwerp - UIA, Universiteitsplein 1 B2610 Antwerp Belgium - email: crete_course at kuifje.bbf.uia.ac.be APPLICATION DEADLINE: April 10th, 1996. Applicants will be notified of the results of the selection procedures before May 1st. FACULTY: M. Abeles (Hebrew University, Jerusalem, Israel), D.J. Amit (University of Rome, Italy and Hebrew University, Israel), R.E. Burke (NIH, USA), C.E. Carr (University of Maryland, USA), A. Destexhe (Universit Laval, Canada), R.J. Douglas (Institute of Neuroinformatics, Zurich, Switzerland), T. Flash (Weizmann Institute, Rehovot, Israel), A. Grinvald (Weizmann Institute, Israel), J.J.B. Jack (Oxford University, England), C. Koch (California Institute of Technology, USA), H. Korn (Institut Pasteur, France), A. Lansner (Royal Institute Technology, Sweden), R. Llinas (New York University, USA), E. Marder (Brandeis University, USA), M. Nicolelis (Duke University, USA), J.M. Rinzel (NIH, USA), W. Singer (Max-Planck Institute, Frankfurt, Germany), S. Tanaka (RIKEN, Japan), A.M. Thomson (Royal Free Hospital, England), S. Ullman (Weizmann Institute, Israel), Y. Yarom (Hebrew University, Israel). The Crete Course in Computational Neuroscience is supported by the European Commission (4th Framework Training and Mobility of Researchers program) and by The Brain Science Foundation (Tokyo). Local administrative organization: the Institute of Applied and Computational Mathematics of FORTH (Crete, GR). From udah075 at kcl.ac.uk Thu Dec 21 12:53:21 1995 From: udah075 at kcl.ac.uk (Rasmus Petersen) Date: Thu, 21 Dec 95 17:53:21 GMT Subject: studentships for European students Message-ID: <3027.9512211753@maths1.mth.kcl.ac.uk> ************************************************************** Studentships - For EU Students - Please note new age limit It was agreed by the Human Resources Committee and endorsed by the Executive Board of NEuroNet in Paris that up to 10,000 ECU be allocated for studentships each year. These provide support for registration, accommodation and travel to designated workshops and conferences with a significant tutorial component. (The studentships are a fixed value). Up to 22 studentships of 450 ECU each will be available for the NEuroFuzzy '96 workshop and tutorials in Prague from 16th-18th April 1996. Applications for these studentships must be received in the NEuroNet Office before 31st December 1995. Successful applicants will be notified in January 1996. Up to 20 studentships of 500 ECU each will be available for the ICANN '96 conference in Bochum, Germany from 16th-19th July 1996. Applications for these studentships must be received in the NEuroNet Office before 3rd March 1996. Successful applicants will be notified in April 1996. Applicants for studentships are limited to full-time students, who are EU nationals, and aged 30 years or less. (Priority will be given to applicants aged under 25 years of age). All applications should be accompanied by a letter of support from the applicant's Head of Department and should contain verification of the applicant's age, status as a student and nationality. All applications will be reviewed by the Human Resources Committee of NEuroNet. Please apply in writing to the NEuroNet Administrator: Ms Terhi Garner NEuroNet Department of Electronic and Electrical Engineering King's College London Strand, London WC2R 2LS, UK Fax: +44 (0) 171 873 2559 *********************************************************************** From dhw at santafe.edu Fri Dec 29 19:54:42 1995 From: dhw at santafe.edu (dhw@santafe.edu) Date: Fri, 29 Dec 95 17:54:42 MST Subject: Postdoc opening Message-ID: <9512300054.AA17781@yaqui> The Santa Fe Institute is soliciting applications for a TXN postdoctoral fellow. The fellow is expected to perform research in Machine Learning, Artificial Intelligence, or related areas of statistics. Information about the SFI can be found at http://www.santafe.edu/. Candidates should have a Ph.D. (or expect to receive one soon) and should have backgrounds in computer science, mathematics, statistics, or related fields. Applicants should submit a curriculum vitae, list of publications, statement of research interests, and three letters of recommendation. Please submit your materials in one complete package. Incomplete applications will not be considered. All application materials must be received by March 1, 1996. Decisions will be made by April, 1996. Send complete application packages only, preferably hard copy, to: TXN Postdoctoral Committee Attention: David Wolpert Santa Fe Institute 1399 Hyde Park Road Santa Fe, New Mexico 87501 Include your e-mail address and/or fax number. The SFI is an equal opportunity employer. Women and minorities are encouraged to apply. From bozinovs at delusion.cs.umass.edu Sun Dec 31 17:55:53 1995 From: bozinovs at delusion.cs.umass.edu (bozinovs@delusion.cs.umass.edu) Date: Sun, 31 Dec 1995 17:55:53 -0500 Subject: New Book Message-ID: <9512312255.AA25407@delusion.cs.umass.edu> Dear Connectionists, Happy New Year to everybody! At the end of the year I have a pleasure to announce a new book in the field. Advertisment: ********************************************************************* New Book! New Book! New Book! New Book! New Book! New Book! --------------------------------------------------------------------- CONSEQUENCE DRIVEN SYSTEMS CONSEQUENCE DRIVEN SYSTEMS CONSEQUENCE DRIVEN SYSTEMS by Stevo Bozinovski *201 pages *79 figures *27 algorithm descriptions *8 tables Among its special features, the book: --------------------------------------- ** provides a unified theory of response-sensitive teaching and learning ** as a result of that theory describes a generic architecture of a neuro-genetic agent capable of performing in 1) consequence sensitive teaching, 2) reinforcement learning, and 3) self-reinforcement learning paradigms ** describes the Crossbar Adaptive Array (CAA) architecture, an 1981 neural network developed within the Adaptive Networks Group, as an example of a neuro-genetic agent ** explains how the CAA architecture was the first neural network that solved a delayed reinforcement learning task, the Dungeons-and-Dragons task, in 1981 ** explains how the 1981 learning method (shown on the cover of the book) is actually the well known, 1989 rediscovered, Q-learning method ** introduces the Benefit-Cost CAA (B-C CAA), as extension of the 1981 Benefit-only CAA architecture ** introduces at-subgoal-go-back algorithm as modification of the 1981 at-goal-go-back CAA algorithm ** introduces a new type of neuron, denoted as Provoking Adaptive Unit, for dealing with tasks of Distributed Consequence Programming ** illustrates the usage of those neurons as routers in a routing-in-networks-with-faults task ** uses parallel programming technique in describing the algorithms throughout the book ----------------------------------------- Ordering information ISBN 9989-684-06-5, Gocmar Press, 1995 price: $15, paperback For further information contact the author: bozinovs at cs.umass.edu ********************************************************************** CONTENTS: 1. INTRODUCTION 1.1. The framework 1.2. Agents and architectures 1.3. Neural architectures 1.3.1. Greedy policy neural architectures 1.3.2. Recurrent architectures 1.3.3. Crossbar architectures 1.3.4. Subsumption architecture adaptive arrays 1.4. Problems. Emotional Graphs 1.5. Games. Emotional Petri Nets 1.6. Parallel programming 1.7. Bibliographical and other notes 2. CONSEQUENCE LEARNING AGENTS: A STRUCTURAL THEORY 2.1. The agent-environment interface 2.2. A taxonomy of learning paradigms 2.3. Classes of consequence learning agents 2.4. A generic consequence learning architecture 2.5. Learning rules and routines 2.6. Bibliographical and other notes 3. CONSEQUENCE DRIVEN TEACHING 3.1. Class T agents 3.2. Learners 3.3. Teachers 3.3.1. Toward a theory of teaching systems 3.3.2. Teaching strategies 3.4. Curriculums 3.4.1. Curriculum grammars and languages 3.4.2. Curriculum space approach 3.5. Pattern classification teaching as integer programming 3.6. Pattern classification teaching as dynamic programming 3.7. Bibliographical and other notes 4. EXTERNAL REINFORCEMENT LEARNING 4.1. Reinforcement learningh NG agents 4.2. Associative Search Network (ASN) 4.2.1. Basic ASN 4.2.2. Reionforcement predictive ASN 4.3. Actor-Critic architecture 4.4. Bibliographical and other notes 5. SELF-REINFORCEMENT LEARNING 5.1. Conceptual framework 5.2. Self-reinforcement learning and the NG agents 5.3. The Crossbar Adaptive Array architecture 5.4. How it works 5.4.1. Defining primary goals from the genetic environment 5.4.2. Secondary reinforcement mechanism 5.4.3. The CAA learning method 5.5. Example of a CAA architecture 5.6. Solving problems with a CAA architecture 5.6.1. Learning in emotional graphs: Maze running 5.6.2. Learning in loosely defined emotional graphs: Pole balancing 5.7. Another example of a CAA architecture 5.8. Using entropy in Markov Decision Processes 5.9. Issues on the genetic environment 5.9.1. CAA architecture as an optimization architecture 5.9.2. Complemetarity with the Genetic Algorithms 5.9.3. Self-reinforcement: Genetic environment approach 5.10. Bibliographical and other notes 6. CONSEQUENCE PROGRAMMING 6.1. Dynamic Programming and Markov Decision Problems 6.2. Introducing cost in the CAA architecture 6.3. Q-learning 6.4. A taxonomy of the CAA-method based learning algorithms 6.5. Producing optimal solution in a stochastic environment 6.6. Distributed Consequence Programming: A neural theory 6.6.1. Provoking units: Axon provoked neurons 6.6.2. An illustration: Routing in client-server networks with faults 6.7. Bibliographical and other notes 7. SUMMARY 8. REFERENCES 9. INDEX ********************************************************************* From dhw at santafe.edu Fri Dec 1 11:18:19 1995 From: dhw at santafe.edu (David Wolpert) Date: Fri, 1 Dec 95 09:18:19 MST Subject: Correcting misunderstandings about NFL Message-ID: <9512011618.AA27395@sfi.santafe.edu> This posting is to correct some misunderstandings that were recently posted concerning the NFL theorems. I also draw attention to some of the incorrect interpretations commonly ascribed to certain COLT results. *** Joerg Lemm writes: >>> 1.) If there is no relation between the function values on the test and training set (i.e. P(f(x_j)=y|Data) equal to the unconditional P(f(x_j)=y) ), then, having only training examples y_i = f(x_i) (=data) from a given function, it is clear that I cannot learn anything about values of the function at different arguments, (i.e. for f(x_j), with x_j not equal to any x_i = nonoverlapping test set). >>> Well put. Now here's the tough question: Vapnik *proves* that it is unlikely (for large enough training sets and small enough VC dimension generalizers) for error on the training set and full "generalization error" to be grealy different. Regardless of the target. Using this, Baum and Haussler even wrote a paper "What size net gives valid generalization?" in which no assumptions whatsoever are made about the target, and yet the authors are able to provide a response the question of their title. HOW IS THAT POSSIBLE GIVEN WHAT YOU JUST WROTE???? NFL is "obvious". And so are VC bounds on generalization error (well, maybe not "obvious"). And so is the PAC "proof" of Occam's razor. And yet the latter two bound generalization error (for those cases where training set error is small enough) without making any assumptions about the target. What gives? The answer: The math of those works is correct. But far more care must be exercised in the interpretation of that math than you will find in those works. The care involves paying attention to what goes on the right-hand side of the conditioning bars in one's probabilities, and the implications of what goes there. Unfortunately, such conditioning bar are completely absent in those works... (In fact, the sum-total of the difference between Bayesian and COLT approaches to supervised batch learning lies in what's on the right-hand side of those bars, but that's another story. See [2].) As an example, it is widely realized that VC bounds suffer from being worst-case. However there is another hugely important caveat to those bounds. The community as a whole simply is not aware of that caveat, because the caveat concerns what goes on the right-hand side of the conditioning bar, and this is NEVER made explicit. This caveat is the fact that VC bounds do NOT concern Pr(IID generalization error | observed error on the training set, training set size, VC dimension of the generalizer). But you wouldn't know that to read the claims made on behalf of those bounds ... To give one simple example of the ramifications of this: Let's say you have a favorite low-VC generalizer. And in the course of your career you parse though learning problems, either explicitly or (far more commonly) without even thinking about it. When you come across one with a large training set on which your generalizer has small generalization error, you want to invoke Vapnik to say you have assuraces about full generalization error. Well, sorry. You don't and you can't. You simply can't escape Bayes by using confidence intervals. Confidence intervals in general (not just in VC work) have the annoying property that as soon as you try to use them, very often you contradict the underlying statistical assumptions behind them. Details are in [1] and in the discussion of "We-Learn-It Inc." in [2]. >>> 2.) We are considering two of those (influence) relations P(f(x_j)=y|Data): one, named A, for the true nature (=target) and one, named B, for our model under study (=generalizer). Let P(A and B) be the joint probability distribution for the influence relations for target and generalizer. 3.) Of course, we do not know P(A and B), but in good old Bayesian tradition, we can construct a (hyper-)prior P(C) over the family of probability distributions of the joint distributions C = P(A and B). 4.) NFL now uses the very special prior assumption P(A and B) = P(A)P(B) >>> If I understand you correctly, I would have to disagree. NFL also holds with your P(C) being any prior assumption - more formally, averaging over all priors, you get NFL. So the set of priors for which your favorite algorithm does *worse than random* is just as large as the set for which it does better. (In this sense, the uniform prior is a typical prior, not a pathological one, out on the edge of the space. It is certainly not a "very special prior".) In fact, that's one of the major points of NFL - it's not to see what life would be like if this or that were uniform, but to use such uniformity as a mathematical tool, to get a handle on the underlying geometry of inference, the size of the various spaces (e.g., the size of the space of priors for which you lose to random), etc. The math *starts* with NFL, and then goes on to many other things (see [1]). It's only the beginning chapter of the text book. >>> I say that it is rational to believe (and David does so too, I think) that in real life cross-validation works better in more cases than anti-cross-validation. >>> Oh, most definitely. There are several issues here: 1) what gives with all the "prior-free" general proofs of COLT, given NFL, 2) purely theoretical issues (e.g., as mentioned before, characterizing the relationship between target and generalizers needed for xval. to beat anti-xval.) and 3) perhaps most provocatively of all, seeing if NFL (and the associated mathematical structure) can help you generalize in the real world (e.g., with head-to-head minimax distinctions between generalizers). *** Finally, Eric Baum weighs in: >>> Barak Pearlmutter remarked that saying We have *no* a priori reason to believe that targets with "low Kolmogorov complexity" (or anything else) are/not likely to occur in the real world. (which I gather was a quote from David Wolpert?) is akin to saying we have no a priori reason to believe there is non-random structure in the world, which is not true, since we make great predictions about the world. >>> Well, let's get a bit formal here. Take all the problems we've ever tried to make "great predictions" on. Let's even say that these problems were randomly chosen from those in the real world (i.e., no selection effects of people simply not reporting when their predictions were not so great). And let's for simplicity say that all the predictions were generated by the same generalizer - the algorithm in the brain of Eric Baum will do as a straw man. Okay. Now take all those problems together and view them as one huge training set. Better still, add in all the problems that Eric's anscestors addressed, so that the success of his DNA is also taken into account. That's still one training set. It's a huge one, but it's tiny in comparison to the full spaces it lives in. Saying we (Eric) makes "great predictions" simply means that the xvalidation error of our generalizer (Eric) on that training set is small. (You train on part of the data, and predict on the rest.) Formally (!!!!!), this gives no assuraces whatsoever about any behavior off-training-set. As I've stated before, without assumptions, you cannot conclude that low xvalidation error leads to low off-training set generalization error. And of course, each passing second, each new scene you view, is "off-training-set". The fallacy in Eric's claim was noted all the way back by Hume. Success at inductive inference cannot formally establish the utility of using inductive inference. To claim that it can you have to invoke inductive inference, and that, as any second grader can tell you, is circular reasoning. Practically speaking of course, none of this is a concern in the real world. We are all (me included) quite willing to conclude there is structure in the real world. But as was noted above, what we do in practice is not the issue. The issue is one of theory. *** It's very similar to high-energy physics. There are a bunch of physical constants that, if only slightly varied, would (seem to) make life impossible. Why do they have the values they have? Some invoke the anthropic principle to answer this - we wouldn't be around if they had other values. QED. But many find this a bit of a cop-out, and search for something more fundamental. After all, you could have stopped the progress of physics at any point in the past if you had simply gotten everyone to buy into the anthropic principle at that point in time. Similarly with inductive inference. You could just cop out and say "anthropic principle" - if inference were not possible, we wouldn't be having this debate. But that's hardly a satisfying answer. *** Eric goes on: >>> Consider the problem of learning to predict the pressure of a gas from its temperature. Wolpert's theorem, and his faith in our lack of prior about the world, predict, that any learning algorithm whatever is as likely to be good as any other. This is not correct. >>> To give two examples from just the past month, I'm sure MCI and Coca-Cola would be astonished to know that the algorithms they're so pleased with were designed for them by someone having "faith in our lack of prior about the world". Less glibly, let me address this claim about my "faith" with two quotes from the NFL for supervised learning paper. The first is in the introduction, and the second in a section entitled "On uniform averaging". So neither is exactly hidden ... 1) "It cannot be emphasized enough that no claim is being made .. that all algorithms are equivalent in the real world." 2) "The uniform sums over targets ... weren't chosen because there is strong reason to believe that all targets are equally likely to arise in practice. Indeed, in many respects it is absurd to ascribe such a uniformity over possible targets to the real world. Rather the uniform sums were chosen because such sums are a useful theoretical tool with which to analyze supervised learning." Finally, given that I'm mixing it up with Eric on NFL, I can't help but quote the following from his "What size net gives valid generalization" paper: "We have given bounds (independent of the target) on the training set size vs. neural net size need such that valid generalization can be expected." (Parenthetical comment added - and true.) Nowhere in the paper is there any discussion whatsoever of the apparent contradiction between this statement and NFL-type concerns. Indeed, as mentioned above, with only the conditioning-bar-free mathematics in Eric's paper, there is no way to resolve the contradiction. In this particular sense, that paper is extremely misleading. (See discussion above on misinterpretations of Vapnik's results.) >>>> Creatures evolving in this "play world" would exploit this structure and understand their world in terms of it. There are other things they would find hard to predict. In fact, it may be mathematically valid to say that one could mathematically construct equally many functions on which these creatures would fail to make good predictions. But so what? So would their competition. This is not relevant to looking for one's key, which is best done under the lamppost, where one has a hope of finding it. In fact, it doesn't seem that the play world creatures would care about all these other functions at all. >>> I'm not sure I quite follow this. In particular, the comment about the "competition" seems to be wrong. Let me just carry further Eric's metaphor though, and point out though that it makes a hell of a lot more sense to pull out a flashlight and explore into the surrounding territory for your key than it does to spend all your time with your head down, banging into the lamppost. And NFL is such a flashlight. David Wolpert [1] The current versions of the NFL for supervised learning papers, nfl.ps.1.Z and nfl.ps.2.Z, at ftp.santafe.edu, in pub/dhw_ftp. [2] "The Relationship between PAC, the Statistical Physics Framework, the Bayesian Framework, and the VC Framework", in *The Mathematics of Generalization*, D. Wolpert Ed., Addison-Wesley, 1995. From marco at McCulloch.Ing.UniFI.IT Fri Dec 1 12:21:43 1995 From: marco at McCulloch.Ing.UniFI.IT (Marco Gori) Date: Fri, 01 Dec 1995 18:21:43 +0100 Subject: Italian Neural Network Society Message-ID: <9512011721.AA09634@McCulloch.Ing.UniFI.IT> ============================================================== This is to announce a new web page describing the aims and the activities of the Italian Neural Network Society. The page is hosted at the DSI Web server of the Dipartimento di Sistemi e Informatica, Universita' di Firenze) at the following address: http://www-dsi.ing.unifi.it/neural/siren -- marco gori. =============================================================== From schmidhu at informatik.tu-muenchen.de Sun Dec 3 06:40:25 1995 From: schmidhu at informatik.tu-muenchen.de (Juergen Schmidhuber) Date: Sun, 3 Dec 1995 12:40:25 +0100 Subject: compressibility and generalization Message-ID: <95Dec3.124033+0100_met.116308+385@papa.informatik.tu-muenchen.de> Eric Baum wrote: >>> (1) While it may be that in classical Lattice gas models, a gas does not have high Kolmogorov complexity, this is not the origin of the predictability exploited by physicists. Statistical mechanics follows simply from the assumption that the gas is in a random one of the accessible states, i.e. the states with a given amount of energy. So *define* a *theoretical* gas as follows: Every time you observe it,it is in a random accessible state. Then its Kolmogorov complexity is huge (there are many accessible states) but its macroscopic behavior is predictable. (Actually this an excellent description of a real gas, given quantum mechanics.) <<< (1) The key expression here is ``the assumption that the gas is in a random one of the *accessible* states''. Since the accessible states are defined to be those with equal energy, this greatly restricts the number of possible states. By definition, it is trivial to make a macro-level prediction like ``the total energy will remain constant''. In turn, there are relatively short descriptions of a given history of such a gas. With true random gas, however, there are no invariants eliminating most of the possible states. This makes its history incompressible. (2) Back to: what does this have to do with machine learning? As a first step, we may simply apply Solomonoff's theory of inductive inference to a dynamic system or ``universe''. Loosely speaking, in a universe whose history is compressible, we may expect to generalize well. A simple, old counting argument shows: most computable universes are incompressible. Therefore, in most computable universes you won't generalize well (this is related to what has been (re)discovered in NFL). (3) Hence, the best we may hope for is a learning technique with good expected generalization performance in *arbitrary* compressible universes. Actually, another restriction is necessary: the time required for compression and decompression should be ``tolerable''. To formalize the expression ``tolerable'' is subject of ongoing research. Juergen Schmidhuber IDSIA juergen at idsia.ch From hicks at cs.titech.ac.jp Sun Dec 3 00:32:43 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Sun, 3 Dec 1995 14:32:43 +0900 Subject: Is the universe finite? Message-ID: <199512030532.OAA02207@euclid.cs.titech.ac.jp> I would like to make 2 points. One concerns a clarification of David Wolperts definition of the universe. The second one is a thought problem meant to be an illustration of the inevitability of structure. Point 1: David Wolpert writes: (1) >Practically speaking of course, none of this is a concern in the real >world. We are all (me included) quite willing to conclude there is >structure in the real world. But as was noted above, what we do in >practice is not the issue. The issue is one of theory. (2) >Okay. Now take all those problems together and view them as one huge >training set. Better still, add in all the problems that Eric's >anscestors addressed, so that the success of his DNA is also taken >into account. That's still one training set. It's a huge one, but it's >tiny in comparison to the full spaces it lives in. The above statements seem to me to be contradictory in some meaning. "(1)" is saying we should, when discussing generalization, not concern ourselves with the real universe in which we live, but should consider theoretical alternative universes as well. On the other hand "(2)" seems to say that the real universe in which we live is itself sufficiently "diverse" that any single approach to generalization must on average be the same. What is the universe about which we are talking? Since mathematical models exist in our minds and on paper in this universe, are they included? I feel we ought to distinguish between a single universe (ours for example), and the ensemble of possible universes. Point 2: Lets suppose a universe which is an N-dimensional binary (0/1) vector random variable X, whose elements are iid with p(0)=p(1)=(1/2). Apparently there is no structure in this universe. Now let us consider a universe which is a binary valued N by M matrix random variable AA whose elements are also iid with p(0)=p(1)=(1/2). Let us draw a random instance A from AA. Now we define an M-dimensional integer random variable Y depending on X by p(y=Ax) = p(Ax), where x and y are instances of X and Y respectively. If A happens to be chosen such that y is merely a subset of the elements of x, then the prior p(y), like the prior p(x), will be uniform. But for most choices of A, p(y) will not be uniform at all. So, out of all the possible universes Y, most of them have structure. This happens even though Y and AA have no structure. The structure that Y will have is drawn from a uniform distribution (over AA), but we are only concerned with whether there will be structure or not. Of course, this proves nothing. And now I am going to make a giant leap of analogy. The following statements are not contradictory. (a) In a universe drawn at random from the ensemble of all possible universes, we cannot expect to see any particular structure to be more likely that any other structure. (b) In any given universe, we can expect structure to be present. Would I be correct in saying that only (b) needs to be true in order for cross-validation to be profitable? Craig Hicks Craig Hicks hicks at cs.titech.ac.jp | Hisakata no, hikari nodokeki Ogawa Laboratory, Dept. of Computer Science | Haru no hi ni, Shizu kokoro naku Tokyo Institute of Technology, Tokyo, Japan | Hana no chiruran lab:03-5734-2187 home:03-3785-1974 | Spring smiles with sun beams fax (from abroad): | sifting down through cloudy dreams +81(3)5734-2905 OGAWA LAB | towards the anxious hearts 03-5734-2905 OGAWA LAB (from Japan)| beating pitter pat [ Poem from Hyaku-nin i-syuu -> | while flower petals scatter. From arbib at pollux.usc.edu Sun Dec 3 14:28:26 1995 From: arbib at pollux.usc.edu (Michael A. Arbib) Date: Sun, 3 Dec 1995 11:28:26 -0800 (PST) Subject: VISUOMOTOR COORDINATION: AMPHIBIANS, MODELS, AND COMPARATIVE STUDIES Message-ID: <199512031928.LAA10890@pollux.usc.edu> PRELIMINARY CALL FOR PAPERS Workshop on VISUOMOTOR COORDINATION: AMPHIBIANS, MODELS, AND COMPARATIVE STUDIES Sedona, Arizona, November 22-24, 1996 Co-Directors: Kiisa Nishikawa (Northern Arizona University, Flagstaff) and Michael Arbib (University of Southern California, Los Angeles). Program Committee: Kiisa Nishikawa (Chair), Michael Arbib, Emilio Bizzi, Chris Comer, Peter Ewert, Simon Gizster, Mel Goodale, Ananda Weerasuriya, Walt Wilczynski, and Phil Zeigler. Local Arrangements Chair: Kiisa Nishikawa. This workshop is the sequel to four earlier workshops on the general theme of "Visuomotor Coordination in Frog and Toad: Models and Experiments". The first two were organized by Rolando Lara and Michael Arbib at the University of Massachusetts, Amherst (1981) and Mexico City (1982). The next two were organized by Peter Ewert and Arbib in Kassell and Los Angeles, respectively, with the Proceedings published as follows: Ewert, J.-P. and Arbib, M.A., Eds., 1989, Visuomotor Coordination: Amphibians, Comparisons, Models and Robots, New York: Plenum Press. Arbib, M.A.and J.-P. Ewert, Eds., 1991, Visual Structures and Integrated Functions, Research Notes in Neural Computing 3, Heidelberg, New York: Springer-Verlag. The time is ripe for a fifth Workshop on this theme, with the more generic title "Visuomotor Coordination: Amphibians, Models, and Comparative Studies". The Workshop will be held in Sedona - a beautiful small resort town set in dramatic red hills in Arizona - straight after the Society for Neuroscience meeting in 1996. Next year, Neuroscience ends on Thursday, November 21, 1996, in Washington, DC, so people can fly to Phoenix that evening, meet Friday, Saturday, and Sunday, and fly home Monday November 25th (so that US types not going to Neuroscience get the Saturday stopover that they could not get if we met before Neuroscience). The aim is to study the neural mechanisms of visuomotor coordination in frog and toad both for their intrinsic interest and as a target for developments in computational neuroscience, and also as a basis for comparative and evolutionary studies. The list of subsidiary themes given below is meant to be representative of this comparative dimension, but is not intended to be exhaustive. In each case, the emphasis (but not the exclusive emphasis) will be on papers which contribute to the development of both modeling and experimentation. Central Theme: Visuomotor Coordination in Frog and Toad Subsidiary Themes: Visuomotor Coordination: Comparative and Evolutionary Perspectives Reaching and Grasping in Frog, Pigeon, and Primate Cognitive Maps Auditory Communication (with emphasis on spatial behavior and sensory integration) Sensory Control of Motor Pattern Generators Formal registration information will be available in March of 1996. Scientists who wish to present papers are asked to send three copies of extended abstracts no later than March 31st, 1996 to: Kiisa Nishikawa Department of Biological Sciences Northern Arizona University Flagstaff, AZ 86011-5640 Notification of the Program Committee's decision will be sent out no later than May 31st, 1996. A decision as to whether or not to publish a proceedings is still pending. From theresa at umiacs.UMD.EDU Mon Dec 4 10:13:47 1995 From: theresa at umiacs.UMD.EDU (Theresa) Date: Mon, 04 Dec 1995 10:13:47 -0500 Subject: Postdoc Position in Neural Modeling Message-ID: <199512041513.KAA05125@skippy.umiacs.UMD.EDU> The University of Maryland Institute for Advanced Computer Studies (UMIACS) invites applications for post doctoral positions, beginning summer/fall '96 in the following areas: Real-time Video Indexing, Natural Language Processing, and Neural Modeling. Exceptionally strong candidates from other areas will also be considered. UMIACS, a state-supported research unit, has been the focal point for interdisciplinary and applications-oriented research activities in computing on the College Park campus. The Institute's 40 faculty members conduct research in high performance computing, software engineering, artificial intelligence, systems, combinatorial algorithms, scientific computing, and computer vision. Qualified applicants should send a 1 page statement of research interest, curriculum vitae and the names and addresses of 3 references to: Prof. Joseph Ja'Ja' UMIACS A.V. Williams Building University of Maryland College Park, MD 20742 by April 1. UMIACS strongly encourages applications from minorities and women. EOE/AA From howse at eece.unm.edu Mon Dec 4 11:12:34 1995 From: howse at eece.unm.edu (James W. Howse) Date: Mon, 04 Dec 1995 09:12:34 -0700 Subject: Dissertation Available Message-ID: <9512041612.AA27407@opus.eece.unm.edu> The following PhD dissertation is available by FTP: Gradient and Hamiltonian Dynamics: Some Applications to Neural Network Analysis and System Identification James W. Howse Abstract The work in this dissertation is based on decomposing system dynamics into the sum of dissipative (e.g., convergent) and conservative (e.g., periodic) components. Intuitively, this can be viewed as decomposing the dynamics into a component normal to some surface and components tangent to other surfaces. First, this decomposition was applied to existing neural network architectures to analyze their dynamic behavior. Second, this formalism was employed to create models which learn to emulate the behavior of actual systems. The premise of this approach is that the process of system identification can be considered in two stages: model selection and parameter estimation. In this dissertation a technique is presented for constructing dynamical systems with desired qualitative properties. Thus, the model selection stage consists of choosing the dissipative and conservative portions appropriately so that a certain behavior is obtainable. By choosing the parametrization of the models properly, a learning algorithm has been devised and proven to always converges to a set of parameters for which the error between the output of the actual system and the model vanishes. So these models and the associated learning algorithm are guaranteed to solve certain types of nonlinear identification problems. Retrieval: ftp ftp.eece.unm.edu login as anonymous cd howse get dissertation.ps.Z This is a PostScript file compressed with compress. The dissertation is 133 pages long and formatted to print single-sided. If there are any retrieval or printing problems please let me know. I would welcome any comments or suggestions regarding the dissertation. No hardcopies are available. =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= James Howse - howse at eece.unm.edu __ __ __ __ _ _ /\ \/\ \/\ \/\ \/\ `\_/ `\ University of New Mexico \ \ \ \ \ \ `\\ \ \ \ Department of EECE, 224D \ \ \ \ \ \ , ` \ \ `\_/\ \ Albuquerque, NM 87131-1356 \ \ \_\ \ \ \`\ \ \ \_',\ \ Telephone: (505) 277-0805 \ \_____\ \_\ \_\ \_\ \ \_\ FAX: (505) 277-1413 or (505) 277-1439 \/_____/\/_/\/_/\/_/ \/_/ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= From zhuh at helios.ASTON.ac.uk Mon Dec 4 15:33:50 1995 From: zhuh at helios.ASTON.ac.uk (zhuh) Date: Mon, 4 Dec 1995 20:33:50 +0000 Subject: compressibility and generalization Message-ID: <28443.9512042033@sun.aston.ac.uk> On the implecations of No Free Lunch Theorem(s) by David Wolpert, > From: Juergen Schmidhuber > > (3) Hence, the best we may hope for is a learning technique with > good expected generalization performance in *arbitrary* compressible > universes. Actually, another restriction is necessary: the time > required for compression and decompression should be ``tolerable''. > To formalize the expression ``tolerable'' is subject of ongoing > research. However, the deeper NFL Theorem states that this is still impossible: 1. The *non-existence* of structure guarantees any algorithm will neither win nor lose, compared with the "random algorithm", in the long run. If this were all that is there, then NFL would be just tautology. 2. The *mere existence* of structure guarantees a (not uniformly-random) algorithm as likely to lose you a million as to win you a million, even in the long run. It is the *right kind* of structure that makes a good algorithm good. 3. This is by far one of the most important implications of NFL, yet my sample from Connectionist show that it is safe to make the posterior prediction that if someone criticises NFL as irrelevent, then he has not got this far yet. In conclusion: "for arbitrary environment there is an optimal algorithm" is drastically different from "there is an optimal algorithm for arbitrary environment", whatever restrictions you make on the word "arbitrary". -- Huaiyu Zhu, PhD email: H.Zhu at aston.ac.uk Neural Computing Research Group http://neural-server.aston.ac.uk/People/zhuh Dept of Computer Science ftp://cs.aston.ac.uk/neural/zhuh and Applied Mathematics tel: +44 121 359 3611 x 5427 Aston University, fax: +44 121 333 6215 Birmingham B4 7ET, UK From dhw at santafe.edu Mon Dec 4 19:49:47 1995 From: dhw at santafe.edu (David Wolpert) Date: Mon, 4 Dec 95 17:49:47 MST Subject: Non-randomness is no panacea Message-ID: <9512050049.AA16646@sfi.santafe.edu> Craig Hicks writes: >>>>> (1) >Practically speaking of course, none of this is a concern in the real >world. We are all (me included) quite willing to conclude there is >structure in the real world. But as was noted above, what we do in >practice is not the issue. The issue is one of theory. (2) >Okay. Now take all those problems together and view them as one huge >training set. Better still, add in all the problems that Eric's >anscestors addressed, so that the success of his DNA is also taken >into account. That's still one training set. It's a huge one, but it's >tiny in comparison to the full spaces it lives in. The above statements seem to me to be contradictory in some meaning. >>>> Not at all. The second statement is concerned with theoretical issues, whereas the first one is concerned with practical issues. The distinction is ubiquitous in science and engineering. Even in the little corner of academia known as supervised learning, most people are content to distinguish the concerns of COLT (theory) from those of what-works-in-practice. >>> "(1)" is saying we should, when discussing generalization, not concern ourselves with the real universe in which we live, but should consider theoretical alternative universes as well. >>> Were you referring to (2) instead? Neither statement says anything like "we should not concern ourselves with the real universe". >>> On the other hand "(2)" seems to say that the real universe in which we live is itself sufficiently "diverse" that any single approach to generalization must on average be the same. >>> Again, I would have hoped that nothing I have said could be construed as saying something like that. It may or may not be true, but you said it, not me. :-) I am sorry if you were somehow given the wrong impression. >>>> I feel we ought to distinguish between a single universe (ours for example), and the ensemble of possible universes. >>>> This is a time-worn concern. Read up on the past two centuries worth of battles between Bayesians and non-Bayesians... >>>> Lets suppose a universe which is an N-dimensional binary (0/1) vector random variable X, whose elements are iid with p(0)=p(1)=(1/2). Apparently there is no structure in this universe. >>>> NO!!! Forgive my ... passion, but as I've said many times now, even in a purely random universe, there are many very deep distinctions between the behavior of different learning algorithms (and in this sense there is plenty of "structure"). Like head-to-head minimax distinctions. (Or uniform convergence theory ala Vapnik.) Please read the relevent papers! ftp.santafe.edu, pub/dhw_ftp, nfl.ps.1.Z and nfl.ps.2.Z. >>>> (b) In any given universe, we can expect structure to be present. Would I be correct in saying that only (b) needs to be true in order for cross-validation to be profitable? >>>> Nope. The structure can just as easily negate the usefulness of xvalidation as establish it. And in fact, the version of NFL in which one fixes the target and then averages over generalizers says that the state of the universe is (in a certain precise sense), by itself, irrelevent. Structure or not; that fact alone can not determine the utility of xvalidation. *** Although I think it is at best tangential to further discuss Kolmogorov complexity, Juergen Schmidhuber's recent comment deserves a response. He writes: >>>>> (2) Back to: what does this have to do with machine learning? As a first step, we may simply apply Solomonoff's theory of inductive inference to a dynamic system or ``universe''. Loosely speaking, in a universe whose history is compressible, we may expect to generalize well. >>>> How could this be true? Nothing has been specified in Juergen's statement about the loss function, how test sets are generated (IID vs. off-training-set vs. who knows what), the generalizer used, how it is related (if at all) to the prior over targets (a prior which, I take it, Juergen wishes to be "compressible"), the noise process, whether there is noise in the inputs as well as the outputs, etc., etc. Yet all of those factors are crucial in determining the efficacy of the generalizer. Obviously if your generalizer *knows* the "compression scheme of the universe", knows the noise process, etc., then it will generalize well. Is that what you're saying Juergen? It reduces to saying that if you know the prior, you can perform Bayes-optimally. There is certainly no disputing that statement. It is worth bearing in mind though that NFL can be cast in terms of averages over priors. In that guise, it says that there are just as many priors - just as many ways of having a universe be "compressible", loosely speaking - for which your favorite algorithm dies as there are for which it shines. In fact, it's not hard to show that an average over only those priors that are more than a certain distance from the uniform prior results in NFL - under such an average, for OTS error, etc., all algorithms have the same expected performance. The simply fact of having a non-uniform prior does not mean that better-than-random generalization arises. *** Structure, compressibility, whatever you want to call it; it can hurt just as readily as it can help. The simple claim that there is non-randomness in the universe does not establish that any particular algorithm performs better than randomly. To all those who dispute this, I ask that they present a theorem, relating generalization error to "compressibility". (To do this of course, they will have to specify the loss function, noise, etc.) Not words, but math, and not just math concerning Kolmogorov complexity considered in isolation. Math presenting a formal relationship between generalization error and "compressibility". (A relationship that doesn't reduce to the statement that if you have information concerning the prior, you can exploit it to generalize well - no rediscovery of the wheel please.) David Wolpert From hicks at cs.titech.ac.jp Mon Dec 4 20:40:08 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Tue, 5 Dec 1995 10:40:08 +0900 Subject: compressibility and generalization In-Reply-To: Juergen Schmidhuber's message of Sun, 3 Dec 1995 12:40:25 +0100 <95Dec3.124033+0100_met.116308+385@papa.informatik.tu-muenchen.de> Message-ID: <199512050140.KAA05180@euclid.cs.titech.ac.jp> On Sun, 3 Dec 1995 12:40:25, Juergen Schmidhuber's wrote: >(2) Back to: what does this have to do with machine learning? As a >first step, we may simply apply Solomonoff's theory of inductive >inference to a dynamic system or ``universe''. Loosely speaking, >in a universe whose history is compressible, we may expect to >generalize well. A simple, old counting argument shows: most >computable universes are incompressible. Therefore, in most >computable universes you won't generalize well (this is related >to what has been (re)discovered in NFL). In an earlier communication I hypothesized that a typical universe would have structure that could be exploited by cross-validation. This communication from Juergen Schmidhuber contradicts my hypothesis, I think, because of the existence the "simple, old counting argument" showing that "most computable universes are incompressible". I stand corrected. The point I really wanted clarified was what was meant by the asseration that in a typical universe (A) cross-validation works as well as anti-cross validation I will just talk about the problem of (determinisitic or stochastic) function estimation. I can accept that for any set of model functions, there will be an infinity of problems where cross-validation will be of no assistance, because that model does not have the capacity to predict future input/output realtions from any finite set of examples from the past. This could be either becuase the true function is pure noise, or because it looks like pure noise from the perspective of any function from the set of candidate model functions. In this there will be no correlation between predictions and samples, and cross-validation will do its job of telling us that the generalization error is not decreasing. However, I interpret the assertion that anti-cross validation can be expected to work as well as cross-validation to mean that we can equally well expect cross-validation to lie. That is, if cross-validation is telling us that the generalization error is decreasing, we can expect, on average, that the true generalization error is not decreasing. Isn't this a contradiction, if we assume that the samples are really randomly chosen? Of course, we can a posteriori always choose a worst case function which fits the samples taken so far, but contradicts the learned model elsewhere. But if we turn things around and randomly sample that deceptive function anew, the learned model will probably be different, and cross-validation will behave as it should. I think this follows from the principle that the empirical distribution over an ever larger number of samples converges to the the true distribution of a single sample (assuming the true distribution is stationary). Does assertion (A) mean that this principle fails in alternative universes? Respectfully Yours, Craig Hicks Craig Hicks hicks at cs.titech.ac.jp Ogawa Laboratory, Dept. of Computer Science Tokyo Institute of Technology, Tokyo, Japan From juergen at idsia.ch Tue Dec 5 12:50:01 1995 From: juergen at idsia.ch (Juergen Schmidhuber) Date: Tue, 5 Dec 95 18:50:01 +0100 Subject: Compressibility and Generalization Message-ID: <9512051750.AA00953@fava.idsia.ch> Shahab Mohaghegh requested a definition of ``compressibility of the history of a universe''. Let S(t) denote the state of a computable universe at discrete time step t. Let's suppose S(t) can be described by n bits. The history of the universe between time step 1 (big bang) and time step t is compressible if it can be computed by an algorithm whose size is clearly less than tn bits. Given a particular computing device, most histories are incompressible: there are 2^tn possible histories, but there are less than (1/2)^c * 2^tn algorithms with less than 2^(tn-c) bits (c is a small positive constant). With most possible universes, the mutual algorithmic information between past and future is zero, and previous experience won't help to generalize well in the future. There are a few compressible or ``regular'' universes, however. To use ML terminology, some of them allow for ``generalization by analogy''. Some of them allow for ``generalization by chunking''. Some of them allow for ``generalization by exploiting invariants''. Etc. It would be nice to have a method that can generalize well in *arbitrary* regular universes. Juergen Schmidhuber IDSIA P.S.: Sorry, I meant to say: there are less than (1/2)^c * 2^tn algorithms with less than tn-c bits. JS From gluck at pavlov.rutgers.edu Tue Dec 5 16:52:15 1995 From: gluck at pavlov.rutgers.edu (Mark Gluck) Date: Tue, 5 Dec 1995 16:52:15 -0500 Subject: Faculty Openings at Rutgers-Newark for Connectionist Modelers Interested in Cog Sci/Cog Neuro Message-ID: <199512052152.QAA16557@pavlov.rutgers.edu> The following junior faculty openings at Rutgers-Newark may be of interest to connectionist modelers working in the area of Cognitive Psychology and Cognitive Neuroscience. Although a purely theoretical researcher would be considered, someone who combines both theoretical/computational modeling and experimental research would be prefered: - Mark Gluck CENTER FOR MOLECULAR AND BEHAVIORAL NEUROSCIENCE COGNITIVE NEUROSCIENCE One faculty position in human cognitive neuroscience is available at the assistant to full professor level. Scientists with a research focus on the neurobiological basis of higher cortical function in humans, who would be stimulated by the integrative focus and collaborative research environment of the Center for Molecular and Behavioral Neuroscience, are encouraged to apply. Research areas include (but are not limited to) human experimental neuropsychology, neuropsychiatry, brain imaging and neuroplasticity, cognitive neuroscience, neurolinguistics, development, human electrophysiology, computational neuroscience, neural basis of speech, attention, memory, perception, emotion, psychophysics and behavioral genetics. State of the art laboratories and equipment for human research, and a doctoral program in Behavioral and Neural Science are available in the Center. Additional information on our program, research facilities,and faculty can be obtained over the internet at: http://www.cmbn.rutgers.edu/bns-home.html. Neuroscientists interested in brain/behavior relationships in normal and/or clinical populations should send CV, names of three references and a brief letter of research goals and philosophy to: Dr. Paula Tallal, Center for Molecular and Behavioral Neuroscience, Rutgers University, 197 University Avenue, Newark, New Jersey, 07102. Phone: (201) 648-1080 x3200. Fax: (201) 648-1272. Email: tallal at axon.rutgers.edu. COGNITIVE PSYCHOLOGY, ASSISTANT PROFESSOR (TWO POSITIONS) The Department of Psychology at the Newark Campus of Rutgers University invites Ph.D. applications for one tenure track and one term (non-tenure track) Assistant Professor position to expand its program in Cognitive Experimental Psychology. One position is in the area of Attention and the second is in Social Cognition, or Cognitive Development. The positions call for candidates with an active research program and who are effective teachers at both the graduate and undergraduate levels. Candidates must be prepared to teach a variety of undergraduate courses. Send CV and three letters of recommendation to Professor Harold I. Siegel, Acting Chair, Department of Psychology-Cognitive Search, Rutgers University, Newark, NJ 07102. ----- End Included Message ----- From juergen at idsia.ch Wed Dec 6 04:39:11 1995 From: juergen at idsia.ch (Juergen Schmidhuber) Date: Wed, 6 Dec 95 10:39:11 +0100 Subject: Non-randomness is no panacea. Message-ID: <9512060939.AA02202@fava.idsia.ch> In response to David's response dated Mon, 4 Dec 95: I wrote ``Loosely speaking, in a universe whose history is compressible, we may expect to generalize well.''. To make this more precise, let us consider a very simple 1-bit universe --- suppose the problem is to extrapolate a sequence of symbols (bits, without loss of generality). We have already observed a bitstring s and would like to predict the next bit. Let si denote the event ``s is followed by symbol i'' for i in {0,1}. David is absolutely right by reminding us that we need a prior before applying Bayes. And he is right by pointing out that only if we have information concerning the prior, we can exploit it to generalize well. In the context of the present discussion, however, an interesting point is: there is a special prior that is biased towards *arbitrary* compressibility/structure/regularity. Following Solomonoff/Levin/Chaitin/Li&Vitanyi, define P(s), the a priori probability of a bitstring s, as the probability of guessing a (halting) program that computes s on a universal Turing machine U. Here, the way of guessing is defined by the following procedure: initially, the input tape consists of a single square. Whenever the scanning head of the input tape shifts to the right, do: (1) Append a new square. (2) With probability 1/2 fill it with a 0; with probability 1/2 fill it with a 1. Bayes tells us P(s0|s) = P(s|s0)P(s0)/P(s) P(s0/P(s); P(s1|s) = P(s1)/P(s). We are going to predict ``the next bit will be 0'' if P(s0) > P(s1), and vice versa. Due to the coding theorem (Levin 74, Chaitin 75), P(si) = O((1/2)^K(si)) for i in {0,1} (K(x) denotes x' Kolmogorov complexity), the continuation with lower Kolmogorov complexity will (in general) be more likely. If s is ``noisy'' then this will be reflected by its relatively high Kolmogorov complexity. I am not saying anything new here. I'd just like to point that if you know nothing about your universe except that it is regular in some way, then P is of interest. Sadly, most possible universes are completely irregular and incompressible. But for the few (but infinetly many) that are not, P is a prior to consider (at least if we don't care for computing time and constant factors). Perhaps there are too many threads in the current discussion. I'll shut up for a while. Juergen Schmidhuber IDSIA From goldfarb at unb.ca Wed Dec 6 15:54:00 1995 From: goldfarb at unb.ca (Lev Goldfarb) Date: Wed, 6 Dec 1995 16:54:00 -0400 (AST) Subject: Compressibility and Generalization In-Reply-To: <9512051750.AA00953@fava.idsia.ch> Message-ID: On Tue, 5 Dec 1995, Juergen Schmidhuber wrote: > ``compressibility of the history of a universe''. > > There are a few compressible or ``regular'' universes, > however. To use ML terminology, some of them allow for > ``generalization by analogy''. Some of them allow for > ``generalization by chunking''. Some of them allow for > ``generalization by exploiting invariants''. Etc. It > would be nice to have a method that can generalize well > in *arbitrary* regular universes. For a proposal how to capture formally the concept of an "arbitrary regular universe" for the purposes of inductive learning (and generalization), i.e. the concept of a "combinative" representation in a universe, see the two references below as well as the original two papers published in Pattern Recognition (and mentioned in each of the two references). The structure of objects in the universe was discussed on the INDUCTIVE list. It appears, that the concept of a "symbolic" representation has to be formalized first (via the concept of transformation system), and the fundamentally new concept of *inductive class structure*, not present in other ML models, becomes of critical importance. The issue of dynamic object representation, so conspicuously (and not surprisingly) absent from the ongoing (classical) "statistical" discussion of inductive learning, is also brought to the fore. 1. L. Goldfarb and S. Nigam, The unified learning paradigm: A foundation for AI, in V. Honavar and L. Uhr, eds., Artificial Intelligence and Neural Networks: Steps toward Principled Integration, Academic Press, 1994. 2. L. Goldfarb , J. Abela, V.C. Bhavsar, V.N. Kamat, Can a vector space based learning model discover inductive class generalization in a symbolic environment? Pattern Recognition Letters 16, 719-726, 1995. -- Lev Goldfarb From N.Sharkey at dcs.shef.ac.uk Thu Dec 7 07:24:09 1995 From: N.Sharkey at dcs.shef.ac.uk (N.Sharkey@dcs.shef.ac.uk) Date: Thu, 7 Dec 95 12:24:09 GMT Subject: CALL FOR ROBOTICS PAPERS Message-ID: <9512071224.AA11298@entropy.dcs.shef.ac.uk> CALL FOR PAPERS ** LEARNING IN ROBOTS AND ANIMALS ** An AISB-96 two-day workshop University of Sussex, Brighton, UK: April, 1st & 2nd, 1996 Co-Sponsored by IEE Professional Group C4 (Artificial Intelligence) WORKSHOP ORGANISERS: Noel Sharkey (chair), University of Sheffield, UK. Gillian Hayes, University of Edinburgh, UK. Jan Heemskerk, University of Sheffield, UK. Tony Prescott, University of Sheffield, UK. PROGRAMME COMMITTEE: Dave Cliff, UK. Marco Dorigo, Italy. Frans Groen, Netherlands. John Hallam, UK. John Mayhew, UK. Martin Nillson, Sweden Claude Touzet, France Barbara Webb, UK. Uwe Zimmer, Germany. Maja Mataric, USA. For Registration Information: alisonw at cogs.susx.ac.uk In the last five years there has been an explosion of research on Neural Networks and Robotics from both a self-learning and an evolutionary perspective. Within this movement there is also a growing interest in natural adaptive systems as a source of ideas for the design of robots, while robots are beginning to be seen as an effective means of evaluating theories of animal learning and behaviour. A fascinating interchange of ideas has begun between a number of hitherto disparate areas of research and a shared science of adaptive autonomous agents is emerging. This two-day workshop proposes to bring together an international group to both present papers of their most recent research, and to discuss the direction of this emerging field. WORKSHOP FORMAT: The workshop will consist of half-hour presentations with at least 15 minutes being allowed for discussion at the end of each presentation. Short videos of mobile robot systems may be included in presentations. Proposals for robot demonstrations are also welcome. Please contact the workshop organisers if you are considering bringing a robot as some local assistance can be arranged. The workshop format may change once the number of accepted papers is known, in particular, there may be some poster presentations. WORKSHOP CONTRIBUTIONS: Contributions are sought from researchers in any field with an interest in the issues outlined above. Areas of particular interest include the following * Reinforcement, supervised, and imitation learning methods for autonomous robots * Evolutionary methods for robotics * The development of modular architectures and reusable representations * Computational models of animal learning with relevance to robots, robot control systems modelled on animal behaviour * Reviews or position papers on learning in autonomous agents Papers will ideally emphasise real world problems, robot implementations, or show clear relevance to the understanding of learning in both natural and artificial systems. Papers should not exceed 5000 words length. Please submit four hard copies to the Workshop Chair (address below) by 30th January, 1996. All papers will be refereed by the Workshop Committee and other specialists. Authors of accepted papers will be notified by 24th February Final versions of accepted papers must be submitted by 10th March, 1996. A collated set of workshop papers will be distributed to workshop attenders. We are currently negotiating to publish the workshop proceedings as a book. SUBMISSIONS TO: Noel Sharkey Department of Computer Science Regent Court University of Sheffield S1 4DP, Sheffield, UK email: n.sharkey at dcs.sheffield.ac.uk For further information about AISB96 ftp ftp.cogs.susx.ac.uk login as Password: cd pub/aisb/aisb96 From mkearns at research.att.com Thu Dec 7 13:39:00 1995 From: mkearns at research.att.com (Michael J. Kearns) Date: Thu, 7 Dec 95 13:39 EST Subject: COLT 96 Call for Papers, ASCII Message-ID: ______________________________________________________________________ CALL FOR PAPERS---COLT '96 Ninth Conference on Computational Learning Theory Desenzano del Garda, Italy June 28 -- July 1, 1996 ______________________________________________________________________ The Ninth Conference on Computational Learning Theory (COLT '96) will be held in the town of Desenzano del Garda, Italy, from Friday, June 28, through Monday, July 1, 1996. COLT '96 is sponsored by the Universita` degli Studi di Milano. We invite papers in all areas that relate directly to the analysis of learning algorithms and the theory of machine learning, including neural networks, statistics, statistical physics, Bayesian/MDL estimation, reinforcement learning, inductive inference, knowledge discovery in databases, robotics, and pattern recognition. We also encourage the submission of papers describing experimental results that are supported by theoretical analysis. ABSTRACT SUBMISSION. Authors should submit fifteen copies (preferably two-sided) of an extended abstract to: Michael Kearns --- COLT '96 AT&T Bell Laboratories, Room 2A-423 600 Mountain Avenue Murray Hill, New Jersey 07974-0636 Telephone(for overnight mail): (908) 582-4017 Abstracts must be RECEIVED by FRIDAY JANUARY 12, 1996. This deadline is firm. We are also allowing electronic submissions as an alternative to submitting hardcopy. Instructions for how to submit papers electronically can be obtained by sending email to colt96 at cs.cmu.edu with subject "help", or from our web site: http://www.cs.cmu.edu/~avrim/colt96.html which will also be used to provide other program-related information. Authors will be notified of acceptance or rejection on or before Friday, March 15, 1996. Final camera-ready papers will be due by Friday, April 5. Papers that have appeared in journals or other conferences, or that are being submitted to other conferences, are not appropriate for submission to COLT. An exception to this policy is that COLT and STOC have agreed that a paper can be submitted to both conferences, with the understanding that a paper will be automatically withdrawn from COLT if accepted to STOC. ABSTRACT FORMAT. The extended abstract should include a clear definition of the theoretical model used and a clear description of the results, as well as a discussion of their significance, including comparison to other work. Proofs or proof sketches should be included. If the abstract exceeds 10 pages, only the first 10 pages may be examined. A cover letter specifying the contact author and his or her email address should accompany the abstract. PROGRAM FORMAT. At the discretion of the program committee, the program may consist of both long and short talks, corresponding to longer and shorter papers in the proceedings. The short talks will also be coupled with a poster presentation. PROGRAM CHAIRS. Avrim Blum (Carnegie Mellon University) and Michael Kearns (AT&T Bell Laboratories). CONFERENCE AND LOCAL ARRANGEMENTS CHAIRS. Nicolo` Cesa-Bianchi (Universita` di Milano) and Giancarlo Mauri (Universita` di Milano). PROGRAM COMMITTEE. Martin Anthony (London School of Economics), Avrim Blum (Carnegie Mellon University), Bill Gasarch (University of Maryland), Lisa Hellerstein (Northwestern University), Robert Holte (University of Ottawa), Sanjay Jain (National University of Singapore), Michael Kearns (AT&T Bell Laboratories), Nick Littlestone (NEC Research Institute), Yishay Mansour (Tel Aviv University), Steve Omohundro (NEC Research Institute), Manfred Opper (University of Wuerzburg), Lenny Pitt (University of Illinois), Dana Ron (Massachusetts Institute of Technology), Rich Sutton (University of Massachusetts) COLT, ML, AND EUROCOLT. The Thirteenth International Conference on Machine Learning (ML '96) will be held right after COLT '96, on July 3--7 in Bari, Italy. In cooperation with COLT, the EuroCOLT conference will not be held in 1996. STUDENT TRAVEL. We anticipate some funds will be available to partially support travel by student authors. Details will be distributed as they become available. From hicks at cs.titech.ac.jp Thu Dec 7 19:49:53 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Fri, 8 Dec 1995 09:49:53 +0900 Subject: compressibility and generalization In-Reply-To: William Finnoff's message of Thu, 7 Dec 95 15:55:52 MST <9512072255.AA25329@predict.com> Message-ID: <199512080049.JAA10560@euclid.cs.titech.ac.jp> finnoff at predict.com (William Finnoff) wrote: >Reading some of the recent postings concerning NFL theorems, it appears >that there is still some misunderstandings about what they refer to in >the versions dealing with statistical inference. For example, Craig >Hicks writes: >> (paraphrase: I want to clarify the meaning of the following assertion) >> (A) cross-validation works as well as anti-cross >> validation (paraphrase: on average) finnoff at predict.com (William Finnoff) continued: >An example of this >would be the case of a two by two contingency table >where the inputs are, say, 0=patient received treatment A, >1=patient received treatment B, and values of the dependent variable >are 0=patient died within three months, or 1=patient still alive >after three months. ... Using the example given above, this corresponds >to cases where the training data contains no examples of >of a patient receiving one of the treatments (for example, where >the training data only contains examples of patients >that have received treatment A). Since there is no data for treatment B, how can we use cross-validation? In this case statement (A) above is not wrong, but it is implicitly occuring within a context where there is no data to use for cross-validation. If so isn't it rather a trivial statement? Possibly misleading? finnoff at predict.com (William Finnoff) continued: >The NFL theorems state that in this case, unless there is some other prior >information available about the performance of treatment B in keeping patients >alive, all predictions are equivalent in their average expected performance. I certainly wouldn't expect cross-validation to work when it can't even be used. And I think it would work just as well as anti-cross validation, whatever that is, where anti-cross validation is also not being used. In fact, both would score `0', not only on average, but every time, since they are not being used. ---- After further study and reading postings to this list my current understanding is that (A) merely means that for any problem (cross validation >= 0) in the sense that it will never be deceptive (never < 0) taking the average across the ensemble of samplings. However, by taking a straight average over a certain infinite (and arguably universal) ensemble of problems we can obtain Expectation[cross validation] = 0 because in this ensemble the positive scoring problems are an infinitely small proportion. This is exciting, because in our universe at the present time evidently Expectation[cross validation] > 0, which implies a non uniform prior over the ensemble of problems. Or are we just choosing our problems unfairly? And if so, what algorithm are we using (or is using us) to choose them? Craig Hicks hicks at cs.titech.ac.jp Ogawa Laboratory, Dept. of Computer Science Tokyo Institute of Technology, Tokyo, Japan PS. I do not claim to be clear on all the issues, or be free from misunderstandings by any means. PSS. What is anti-cross validation? From WALTSCH at vms.cis.pitt.edu Thu Dec 7 22:27:49 1995 From: WALTSCH at vms.cis.pitt.edu (WALTSCH@vms.cis.pitt.edu) Date: Thu, 07 Dec 1995 23:27:49 -0400 (EDT) Subject: Faculty position is Cognitive Neuroscience Univ. of Pittsburgh Message-ID: <01HYJKVPQW36AM35MW@vms.cis.pitt.edu> ********Faculty Opening in Cognitive Neuroscience************* The Department of Psychology at the University of Pittsburgh seeks a faculty member at the assistant professor level who studies human cognitive neuroscience. The faculty member must have a strong empirical background and a program of research that brings together neuroscience and behavioral techniques and an interest graduate and undergraduate teaching in this area. Candidates are likely to become affiliated with Center for the Neural Basis of Cognition between the University of Pittsburgh and Carnegie Mellon University. For additional information, see WWW httyp://neurocog.lrdc.pitt.edu/search Applications should be sent to: Cognitive Neuroscience Search 455 Langley Hal Psychology Department University of Pittsburgh PGH PA 15260. Applications should include: 1. a statement of research and teaching interest 2. a CV 3. copies of selected publications 4. three letters of reference. Initial consideration will begin January 15, 1996, though applications arriving after that date may be considered. The University of Pittsburgh is an Equal Opportunity/Affirmative Action Employer. Women and minority candidates are especially encouraged to apply. From esann at dice.ucl.ac.be Fri Dec 8 12:39:48 1995 From: esann at dice.ucl.ac.be (esann@dice.ucl.ac.be) Date: Fri, 8 Dec 1995 18:39:48 +0100 Subject: ESANN extended deadline Message-ID: <199512081737.SAA18067@ns1.dice.ucl.ac.be> Dear Colleagues, The deadline to submit papers to the ESANN'96 conference (the 4th European Symposium on Artificial Neural Networks, which will be held in Bruges, Belgium, on April 24-26, 1996) was December 8th, 1995 (today !) as announced in the call for papers. However, as you know, there are important strikes in France and in other countries, and many of you have problems to meet this deadline because of the post office strike (it is even worst because of the airport strike in Belgium...). So we are pleased to announce that we will accept submission of papers until Friday December 15th, 1996 (so next Friday!). Please however ensure that the printed copies (no e-mail or fax please) will reach the conference secretariat (see address below), together with the required information (as described in the call for papers) before this date. Please use private mail delivery services if necessary, and don't forget that in most countries Chronopost in NOT a private mail service (for example, because of the strike, the French Chronopost service was not working this week...), while DHL, TNT Mailfast and other companies are private services, and so could be more efficient in the next few days... If you still have problems to meet the new deadline, please contact me personally at the following e-mail address: esann at dice.ucl.ac.be and we will try to arrange another way to transfer your paper. Please feel free to contact me if you need any other information about the submission of papers. Sincerely yours, Michel Verleysen _____________________________ D facto publications - conference services 45 rue Masui 1210 Brussels Belgium tel: +32 2 245 43 63 fax: +32 2 245 46 94 _____________________________ From giles at research.nj.nec.com Fri Dec 8 14:18:39 1995 From: giles at research.nj.nec.com (Lee Giles) Date: Fri, 8 Dec 95 14:18:39 EST Subject: reprint available Message-ID: <9512081918.AA20599@alta> The following conference paper published in the 2nd International IEEE Conference on "Massively Parallely Processing Using Optical Interconnections," October, 1995 is now available via the NEC Research Institute archive: ____________________________________________________________________________________ "Predictive Control of Opto-Electronic Reconfigurable Interconnection Networks Using Neural Networks" Majd F. Sakr[1,2], Steven P. Levitan[2], C. Lee Giles[1,3], Bill G. Horne[1], Marco Maggini[4], Donald M. Chiarulli[5] [1] NEC Research Institute, 4 Independence Way, Princeton, NJ 08540 [2] Electrical Engineering Department, U. of Pittsburgh, Pittsburgh, PA 15261 [3] UMIACS, U. of Maryland, College Park, MD 20742 [4] Universit` di Firenze, Dipartimento di Sistemi e Informatica, 50139 Firenze, Italy [5] Computer Science Department, U. of Pittsburgh, Pittsburgh, PA 15260 Abstract Opto-electronic reconfigurable interconnection networks are limited by significant control latency when used in large multiprocessor systems. This latency is the time required to analyze the current traffic and reconfigure the network to establish the required paths. The goal of latency hiding is to minimize the effect of this control overhead. In this paper, we introduce a technique that performs latency hiding by learning the patterns of communication traffic and using that information to anticipate the need for communication paths. Hence, the network provides the required communication paths before a request for a path is made. In this study, the communication patterns (memory accesses) of a parallel program are used as input to a time delay neural network (TDNN) to perform on-line training and prediction. These predicted communication patterns are used by the interconnection network controller that provides routes for the memory requests. Based on our experiments, the neural network was able to learn highly repetitive communication patterns, and was thus able to predict the allocation of communication paths, resulting in a reduction of communication latency. ------------------------------------------------------------------------------ http://www.neci.nj.nec.com/homepages/giles.html ftp://external.nj.nec.com/pub/giles/papers/MPPOI.95.ps.Z ------------------------------------------------------------------------------ -- C. Lee Giles / Computer Sciences / NEC Research Institute / 4 Independence Way / Princeton, NJ 08540, USA / 609-951-2642 / Fax 2482 http://www.neci.nj.nec.com/homepages/giles.html == From mablume at sdcc10.ucsd.edu Fri Dec 8 17:03:18 1995 From: mablume at sdcc10.ucsd.edu (Matthias Blume) Date: Fri, 8 Dec 1995 14:03:18 -0800 (PST) Subject: Fuzzy ART architecture papers online Message-ID: <199512082203.OAA06153@e3329-4.ucsd.edu> Dear Connectionists, Two papers describing a simple and efficient architecture for Fuzzy ART and Fuzzy ARTMAP are now available online. (Sorry, hardcopies are not available.) ------------------------------------------------------------------------------ Matthias Blume and Sadik C. Esener, An efficient mapping of Fuzzy ART onto a neural architecture (5 pages), submitted to Neural Networks. A novel mapping of the Fuzzy ART algorithm onto a neural network architecture is described. The architecture does not utilize bi-directional synapses, weight transport, or weight duplication, and requires one fewer layer of processing elements than the architecture originally proposed by Carpenter, Grossberg, & Rosen (1991). In the new architecture, execution of the algorithm takes constant time per input vector regardless of the relationship between the input and existing templates, and several control signals are eliminated. This mapping facilitates hardware implementation of Fuzzy ART and furthermore serves as a tool for envisioning and understanding the algorithm. Keywords: Fuzzy ART, Fuzzy ARTMAP, parallel hardware, neural architecture. ftp://archive.cis.ohio-state.edu/pub/neuroprose/blume.fam_arch.ps.Z http://icse1.ucsd.edu/~mablume/nnletter.ps ------------------------------------------------------------------------------ Matthias Blume and Sadik C. Esener, Optoelectronic Fuzzy ARTMAP processor, Optical Computing, Vol. 10, 1995 OSA Technical Digest Series (Optical Society of America, Washington, DC, 1995), p. 213-215, March 1995. The Fuzzy ARTMAP algorithm can perform well even with weights truncated to 4 bits during training. Furthermore, only the weights corresponding to one processing element are updated after each training sample. Finally, it converges rapidly and relatively uniformly with little dependence on the particular choice of adjustable parameter values and initial state. These characteristics are particularly advantageous for parallel optoelectronic implementations. We map Fuzzy ARTMAP onto an architecture which satisfies the constraints of the hardware, and suggest an implementation which is an appropriate combination of optical and electronic technology. The proposed mapping of the algorithm onto a neural architecture is efficient, requiring only an input layer and one processing layer per fuzzy ART module, and requiring neither weight transport nor multiple copies of weights. The proposed optoelectronic system is simple, yet versatile, and relies on proven components. Keywords: Parallel optoelectronic hardware, Fuzzy ART, neural architecture. ftp://archive.cis.ohio-state.edu/pub/neuroprose/blume.oe_fam.ps.Z http://icse1.ucsd.edu/~mablume/OSA95.ps ------------------------------------------------------------------------------ - Matthias Blume ECE department, UCSD matthias at ucsd.edu http://icse1.ucsd.edu/~mablume From mpp at watson.ibm.com Fri Dec 8 19:27:29 1995 From: mpp at watson.ibm.com (Michael Perrone) Date: Fri, 8 Dec 1995 19:27:29 -0500 (EST) Subject: NFL Summary Message-ID: <9512090027.AA26165@austen.watson.ibm.com> Hi Everyone, There has been a lot of confusion regarding the "No Free Lunch" theorems. Below, I try to summarize what I feel to be the key points. NFL in a Nutshell: ------------------ If you make no assumptions about the target function then on average, all learning algorithms will have the same generalization performance. Apparent Contradiction and Resolution: -------------------------------------- Contradiction: Lots of theoretical results regarding generalization claim to make no assumptions about the target function. Resolution: These theoretical results DO make assumption (which may or may no be explicit) regarding the target. Importance of NFL: ------------------ The NFL results in and of itself is not terribly interesting because it's assumption (that we make no assumptions) is NEVER true. What makes NFL important is that it emphasizes in a very striking way that it is the ASSUMPTIONS that we make about our learning domains that MAKE ALL THE DIFFERENCE. Therefore, I see NFL *NOT* as a criticism of theoretical generalization results; but rather, as a call to examine the assumptions underlying these results because it is there that we can potentially learn the most about machine learning. Examples of Unstated Assumptions: --------------------------------- In practise, there are numerous assumption that we as a community usually make when we attempt to learn a task using out favorite algorithm. Below, I list just a few obvious ones. 1) The training and testing data are IID. 2) The data distribution is "smooth" (i.e. "near" data points are in general more similar than "far" data points). This can also be interpreted as some differentiability conditions. 3) NN's approximate real-world functions reasonably well. 4) Starting with small intial weights is good. 5) Overfitting is bad - early stopping is good. 6) Gaussian error models are the best thing since machine sliced bread. REALLY INTERESTING STUFF: ------------------------- I think that the NFL results point towards what I feel are extremely interesting research topics: Exactly what are the assumptions that certain theoretical results require? Exactly how do these assumptions affect generalization? Which assumptions are necessary/sufficient? How do different assumptions compare? Can we identify a set of assumptions that are equivalent to the assumption that CV model selection improves generalization? Can we do the same for early stopping? Bagging? (You can be damn sure I can do this for averaging... :-) Etc, etc, ... Caveat: ------- All of the above is conditioned on the assumptions that David Wolpert did his math correctly when deriving the NFL theorems... :-) I hope all of this helps clear things up. Comments? Regards, Michael ------------------------------------------------------------------------- Michael P. Perrone 914-945-1779 (office) IBM - Thomas J. Watson Research Center 914-945-4010 (fax) P.O. Box 704 / Rm 36-207 914-245-9746 (home) Yorktown Heights, NY 10598 mpp at watson.ibm.com ------------------------------------------------------------------------- From jlm at crab.psy.cmu.edu Sat Dec 9 17:35:01 1995 From: jlm at crab.psy.cmu.edu (James L. McClelland) Date: Sat, 9 Dec 95 17:35:01 EST Subject: TR Announcement Message-ID: <9512092235.AA21814@crab.psy.cmu.edu.psy.cmu.edu> The following Technical Report is available both electronically from our own FTP server or in hard copy form. Instructions for obtaining copies may be found at the end of this post. ======================================================================== Stochastic Interactive Processing, Channel Separability, and Optimal Perceptual Inference: An Examination of Morton's Law Javier R. Movellan & James L. McClelland Technical Report PDP.CNS.95.4 December 1995 In this paper we examine a regularity found in human perception, called Morton's law, in which stimulus and context have independent influences on perception. This regularity has been used in the past to argue that perception is a feed-forward, non-interactive process. Building on earlier work by McClelland ( Cognitive Psychology, 1991) we illustrate how Morton's law may emerge from stochastic interactions between simple processing units. To this end we consider the properties of interactive diffusion networks, the continuous stochastic limit of standard artificial neural models. If, as we believe, human information processing involves using noisy processing elements to process potentially noisy inputs, such models may ultimately serve as foundations for a theory of human information processing. We show that Morton's law emerges in recurrent diffusion networks when the units are organized into separable channels, feed-forward processing is not a necessary condition for Morton's law to hold. Failures to exhibit Morton's law provide evidence that the information channels are not separable. This result can be used to analyze cognitive models as well as actual brain structures. Finally, we illustrate how diffusion networks can be organized to implement optimal Bayesian perceptual inference. ======================================================================= Retrieval information for pdp.cns TRs: unix> ftp 128.2.248.152 # hydra.psy.cmu.edu Name: anonymous Password: ftp> cd pub/pdp.cns ftp> binary ftp> get pdp.cns.95.4.ps.Z # gets this tr ftp> quit unix> zcat pdp.cns.95.4.ps.Z | lpr # or however you print postscript NOTE: The compressed file is 255910 bytes long. Uncompressed, the file is 727359 byes long. The printed version is 66 total pages long. For those who do not have FTP access, physical copies can be requested from Barbara Dorney . For a list of available PDP.CNS Technical Reports: > get README For the titles and abstracts: > get ABSTRACTS From hicks at cs.titech.ac.jp Sun Dec 10 09:24:29 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Sun, 10 Dec 1995 23:24:29 +0900 Subject: NFL Summary In-Reply-To: Michael Perrone's message of Fri, 8 Dec 1995 19:27:29 -0500 (EST) <9512090027.AA26165@austen.watson.ibm.com> Message-ID: <199512101424.XAA13664@euclid.cs.titech.ac.jp> Micheal Perrone writes: > I think that the NFL results point towards what I feel are extremely > interesting research topics: > ... > Can we identify a set of assumptions that are equivalent to the > assumption that CV model selection improves generalization? CV is nothing more than the random sampling of prediction ability. If the average over the ensemble of samplings of this ability on 2 different models A and B come out showing that A is better than B, then by definition A is better than B. This assumes only that the true domain and the ensemble of all samplings coincide. Therefore CV will not, on average, cause a LOSS in prediction ability. That is, when it fails, it fails gracefully, on average. It cannot be consistently deceptive. (A quick note: Sometimes it is advocated that a complexity parameter be set by splitting the data set into training and testing, and using CV. Then with the complexity parameter fixed the whole data set can be used to train the other parameters. Behind this is an ASSUMPTION about the independence of the complexity from the other parameters. Of course it often works in practice, but it violates the principle in the above paragraph, so I do not count this as real CV here.) Two prerequisites exist to obtain a GAIN with CV 1) The objective function must be "compressible". I.e., it cannot be noise. 2) We must have a model which can recognize the structure in the data. This structure might be quite hard to see, as in chaotic signals. I think NFL says that on average CV will not obtain GAINful results, because the chance that a randomly selected problem and a randomly selected algorithm will hit it off is vanishingly small. (Or even any fixed problem and a randomly selected algorithm.) But I think it tells us something more important as well. It tells us that not using CV means we are always implicitly trusting our a priori knowledge. Any reasonable learning algorithm can always predict the training data, or a "smoothed" version of it. But because of the NFL theorem, this, over the ensemble of all algorithms and problems, means nothing. On average there will be no improvement in the off training set error. Fortunately, CV will report this fact by showing a zero correlation between prediction and true value on the off training set data. (Of course this is only the performance of CV on average over the ensemble of off training set datas; CV may be deceptive for a single off training set data.) Thus, we shouldn't think we can do away with CV unless we admit to having great faith in our prior. Going back to NFL, I think it poses another very interesting problem: Supposing we have "a foot in the door". That is, an algorithm which makes some sense of the data by showing some degree of prediction capability. Can we always use this prediction ability to gain better prediction ability? Is there some kind of ability to perform something like steepest descent over the space of algorithms, ONCE we are started on a slope? Is there a provable snowball effect? I think NFL reminds us that we are already rolling down the hill, and we shouldn't think otherwise. Craig Hicks Tokyo Institute of Technology From goldfarb at unb.ca Sun Dec 10 10:52:29 1995 From: goldfarb at unb.ca (Lev Goldfarb) Date: Sun, 10 Dec 1995 11:52:29 -0400 (AST) Subject: NFL Summary In-Reply-To: <9512090027.AA26165@austen.watson.ibm.com> Message-ID: On Fri, 8 Dec 1995, Michael Perrone wrote: > NFL in a Nutshell: > ------------------ > If you make no assumptions about the target function [specifically, about the axiomatic structure of the sample space and the inductive generalization, i.e. which ones are the most general for the purpose] Strangely as it may sound at first, try to inductively learn the subgroup of some large group with the group structure completely hidden. No statistics will reveal the underlying group structure. Objects in the universe do have structure, especially when they have to be represented, as we have learned from the data types in computer science: TO REPRESENT AN OBJECT IS TO MAKE SOME ASSUMPTIONS ABOUT THE OPERATIONS RELATED TO ITS MANIPULATION. Cheers, Lev Goldfarb From XIAODONG at rivendell.otago.ac.nz Sun Dec 10 20:46:21 1995 From: XIAODONG at rivendell.otago.ac.nz (Xiaodong Li, Otago University, New Zealand) Date: Mon, 11 Dec 1995 14:46:21 +1300 Subject: Paper available "Connectionist Model Based on an Optical Thin-Film Model" Message-ID: <01HYONVDU5GYLBVSXM@rivendell.otago.ac.nz> FTP-host: archive.cis.ohio-state.edu FTP-filename:/pub/neuroprose/xli.thinfilm.ps.Z The file xli.thinfilm.ps.Z is now available for ftp from Neuroprose repository. Connectionist Learning Using an Optical Thin-Film Model (4 pages) Martin Purvis and Xiaodong Li Computer and Information Science University of Otago Dunedin, New Zealand ABSTRACT: An alternative connectionist architecture to the one based on the neuroanatomy of biological organisms is described. The proposed architecture is based on an optical thin-film multilayer model, with the thicknesses of thin-film layers serving as adjustable 'weights' for the computation. Inputs are encoded into the corresponding refractive indices of individual thin-film layers, while the outputs are typically measured by the overall reflection coefficients off the thin-film layers, at different wavelengths. The nature of the model and some example calculations (a pattern recognition and the classification on the iris data set) that exhibit behaviour typical of conventional connectionist architectures are described. This model has also been used in solving the XOR and 16 four-bit parity problems, and it has demonstrated comparable performance to that of a conventional feed-forward neural netwrok model using Back-propagation learning. This paper is also available at the proceeding of the Second New Zealand International Two-Stream Conference on Artificial Neural Nteworks and Expert Systems (ANNES'95), IEEE Computer Society Press, Los Almamitos, California, 1995, pp. 63-66. Comments are greatly appreciated. -- Xiaodong Li Email: Xiaodong at otago.ac.nz Http: http://divcom.otago.ac.nz:800/COM/INFOSCI/SECML/xdli/xiao.htm (Postscript file of this paper is also available here at my hoempage) From prechelt at ira.uka.de Mon Dec 11 07:11:32 1995 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Mon, 11 Dec 1995 13:11:32 +0100 Subject: NN Benchmarking WWW homepage Message-ID: <"iraun1.ira.487:11.12.95.12.12.22"@ira.uka.de> The homepage of the very successful NIPS*95 workshop on benchmarking has now been converted into a repository for information about benchmarking issues: Status quo, methodology, facilities, and related info. I kindly ask everybody who has additional information that should be on the page (in particular sources or potential sources of learning data of all kinds) to submit that information to me. Other comments are also welcome. The URL is http://wwwipd.ira.uka.de/~prechelt/NIPS_bench.html The page is also still reachable over the benchmarking workshop link on the NIPS*95 homepage. Below is a textual version of the page. Lutz Lutz Prechelt (http://wwwipd.ira.uka.de/~prechelt/) | Whenever you Institut f. Programmstrukturen und Datenorganisation | complicate things, Universitaet Karlsruhe; D-76128 Karlsruhe; Germany | they get (Phone: +49/721/608-4068, FAX: +49/721/694092) | less simple. =============================================== Benchmarking of learning algorithms information repository page Abstract: Proper benchmarking of (neural network and other) learning architectures is a prerequisite for orderly progress in this field. In many published papers deficiencies can be observed in the benchmarking that is performed. A workshop about NN benchmarking at NIPS*95 addressed the status quo of benchmarking, common errors and how to avoid them, currently existing benchmark collections, and, most prominently, a new benchmarking facility including a results database. This page contains pointers to written versions or slides of most of the talks given at the workshop plus some related material. The page is intended to be a repository for such information to be used as a reference by researchers in the field. Note that most links lead to Postscript documents. Please send any additions or corrections you might have to Lutz Prechelt (prechelt at ira.uka.de). Workshop Chairs: Thomas G. Dietterich , Geoffrey Hinton , Wolfgang Maass , Lutz Prechelt [communicating chair] Terry Sejnowski Assessment of the status quo: * Lutz Prechelt. A quantitative study of current benchmarking practices. A quantitative survey of 400 journal articles of 1993 and 1994 on NN algorithms. Most articles used far too few problems during benchmarking. * Arthur Flexer. Statistical Evaluation of Neural Network Experiments: Minimum Requirements and Current Practice. Says that it is insufficient what is reported about the benchmarks and how. Methodology: * Tom Dietterich. Experimental Methodology Benchmarking types, correct statistical testing, synthetic versus real-world data, understanding via algorithm mutation or data mutation, data generators. * Lutz Prechelt. Some notes on neural learning algorithm benchmarking. A few general remarks about volume, validity, reproducibility, and comparability of benchmarking; DOs and DON'Ts. * Brian Ripley. What can we learn from the study of the design of experiments? (Only two slides, though). * Brian Ripley. Statistical Ideas for Selecting Network Architectures. (Also somewhat related to benchmarking.) Benchmarking facilities: * Previously available NN benchmarking data collections CMU nnbench, UCI machine learning databases archive, Proben1, StatLog data, ELENA data. Advantages of these: UCI is large and growing and popular, Statlog has largest and most orderly collection of results available (in a book, though), and Proben1 is most easy to use and best supports reproducible experiments. Elena and nnbench have no particular advantages. Disadvantages: UCI and Probem1 have too few and too unstructured results available, Proben1 is also inflexible and small, Statlog is partially confidential and neither data nor results collection are growing. * Carl Rasmussen and Geoffrey Hinton. DELVE: A thoroughly designed benchmark collection A proposal of data, terminology, and procedures and a facility for the collection of benchmarking results. This is the newly proposed standard for benchmarking NN (and other) learning algorithms. DELVE is currently still under construction at the University of Toronto. Other sources of data: (Thanks to Nici Schraudolph ) There is a large amount of game data about the board game Go available on the net. One starting point is here. Others are the Go game database project, and the Go game server. The database holds several hundred thousand games of Go and could for instance be used for advanced reinforcement learning projects. Last correction: 1995/12/11 Please send additions and corrections to Lutz Prechelt, prechelt at ira.uka.de. To NIPS homepage. To original homepage of this workshop. From mpp at watson.ibm.com Mon Dec 11 08:42:59 1995 From: mpp at watson.ibm.com (Michael Perrone) Date: Mon, 11 Dec 1995 08:42:59 -0500 (EST) Subject: compressibility and generalization In-Reply-To: <199512080049.JAA10560@euclid.cs.titech.ac.jp> from "hicks@cs.titech.ac.jp" at Dec 8, 95 09:49:53 am Message-ID: <9512111342.AA25646@austen.watson.ibm.com> [hicks at cs.titech.ac.jp wrote:] > PSS. What is anti-cross validation? Suppose we are given a set of functions and a crossvalidation data set. The CV and Anti-CV algorithms are as follows: CV: Choose the function with the best performance on the CV set. Anti-CV: Choose the function with the worst performance on the CV set. (And for this year's NIPS motif: Anti-EM: Dorothy? Dorothy? :-) Regards, Michael ------------------------------------------------------------------------- Michael P. Perrone 914-945-1779 (office) IBM - Thomas J. Watson Research Center 914-945-4010 (fax) P.O. Box 704 / Rm 36-207 914-245-9746 (home) Yorktown Heights, NY 10598 mpp at watson.ibm.com ------------------------------------------------------------------------- From hicks at cs.titech.ac.jp Mon Dec 11 20:01:05 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Tue, 12 Dec 1995 10:01:05 +0900 Subject: compressibility and generalization In-Reply-To: "Michael Perrone"'s message of Mon, 11 Dec 1995 08:42:59 -0500 (EST) <9512111342.AA25646@austen.watson.ibm.com> Message-ID: <199512120101.KAA16136@euclid.cs.titech.ac.jp> "Michael Perrone" wrote: >[hicks at cs.titech.ac.jp wrote:] >> PSS. What is anti-cross validation? >Suppose we are given a set of functions and a crossvalidation data set. >The CV and Anti-CV algorithms are as follows: > CV: Choose the function with the best performance on the CV set. >Anti-CV: Choose the function with the worst performance on the CV set. case 1: * Either the target function is (noise/uncompressible/has no structure), or none of the candidate functions have any correlation with the target function.* In this case both Anti-CV and CV provide (ON AVERAGE) equal improvement in prediction ability: none. For that matter so will ANY method of selection. Moreover, if we plot a graph of the number of data used for training vs. the estimated error (using the residual data), we will (ON AVERAGE) see no decrease in estimated error. Since CV provides an estimated prediction error, it can also tell us "you might as well be using anti-cross validation, or random selection for that matter, because it will be equally useless". case 2: * The target (is compressible/has structure), and some the candidate functions are positively correlated with the target function.* In this case CV will outperform anti-CV (ON AVERAGE). By ON AVERAGE I mean the expectation across the ensemble of samples for a FIXED target function. This is different from the ensemble and distribution of target functions, which is a much bigger question. We known much already about about the ensemble of samples from a fixed target function. I am not avoiding the issue of the ensemble or distribution of target functions, but merely showing that we have 2 general cases, and that in both of them CV is never WORSE than anti-CV. It follows that whatever the distribution of targets is, CV is never worse (ON AVERAGE) than anti-CV. I don't believe this contradicts NFL in any way. It just clarifies the role that CV can play. Learning and monitoring prediction error go hand in hand. This is even more true for cases when the underlying function may be changing and the data has the form of an infinite stream. Craig Hicks Tokyo Institute of Technology From GIOIELLO at cres.it Mon Dec 11 19:13:43 1995 From: GIOIELLO at cres.it (GIOIELLO) Date: Tue, 12 Dec 1995 01:13:43 +0100 Subject: A neural net based OCR demo for both Windows/DOS and Mac OS is available Message-ID: <01HYP9T0BSPU934ROD@cres.it> Dear Netters, An OCR demo for Mac OS is available at the following URL: ftp://ftpcsai.diepa.unipa.it/pub/demos/OCR-demo.cpt.hqx A Windows and DOS version is also available at the following URL: ftp://ftpcsai.diepa.unipa.it/pub/demos/OCR-Win.zip this latter version offers a more rich set of capabilities too. The OCR is based on a three-layer MLP. The conjugate gradient descent techniques were used while training the net. Training and test set were those of NIST. The related papers will be found at the following URL: ftp://ftpcsai.diepa.unipa.it/pub/papers/handwritten Several VLSI architectures to implement the OCR device using a digital implementation of the proposed MLP are also described in the papers. An overwiev of the activities we carry on can be found at the following URL: http://wwwcsai.diepa.unipa.it/research/projects/vlsinn/handcare/handcare.html Best Regards, Giuseppe A. M. Gioiello E-Mail: gioiello at diepa.unipa.it URL: http://wwwcsai.diepa.unipa.it/people/doctors/gioiello/gioiello.html From ernst at kuk.klab.caltech.edu Tue Dec 12 12:02:22 1995 From: ernst at kuk.klab.caltech.edu (Ernst Niebur) Date: 12 Dec 1995 17:02:22 GMT Subject: Training opportunities in Computational Neuroscience at Johns Hopkins University Message-ID: The Zanvyl Krieger Mind/Brain Institute at Johns Hopkins University is an interdisciplinary research center devoted to the investigation of the neural mechanisms of mental function and particularly to the mechanisms of perception: How is complex information represented and processed in the brain, how is it stored and retrieved, and which brain centers are critical for these operations? The Institute intends to significantly enhance its research program in Computational Neuroscience and encourages students with interest in this domain to apply for the graduate program in the Neuroscience department. Research opportunities exist in all of the laboratories of the Institute. Interdisciplinary projects, involving the student in more than one laboratory, are particularly encouraged. At present, MBI faculty include (listed with primary field of interest and methodology used): C. Ed Connor, PhD: Visual selective attention (electrophysiology in the awake behaving monkey). Stewart Hendry, PhD: Organization and plasticity of mammalian cerebral cortex (primate neuroanatomy). Steve S. Hsiao, PhD: Neurophysiology of tactile perception (electrophysiology in the awake behaving monkey). Kenneth O. Johnson, PhD: Neurophysiology of the somatosensory system (electrophysiology in the awake behaving monkey). Guy McKhann, MD (Director of MBI): Cognitive and neurologic outcomes after cardiac surgery; immunologic attack on peripheral motor axonal membranes in the human and experimental animal (neurology). Ernst Niebur, PhD: Theoretical Neuroscience (computational and mathematical modeling). Gian F Poggio, PhD: Analysis of Stereopsis and Texture (electrophysiology in the awake behaving monkey). Michael A. Steinmetz, PhD: Neurophysiological mechanisms in visual-spatial perception (electrophysiology in the awake behaving monkey). Ruediger von der Heydt, PhD: Neural mechanisms of visual perception (electrophysiology in the awake behaving monkey). Additional research opportunities exist in collaborative work with faculty in the Psychology Department (located next door to the Mind/Brain Institute), in particular with Drs. Howard Egeth (attention, perception, cognition), Michael Rudd (computational vision, psychophysics), Trisha Van Zandt (mathematical modelling, neural networks and memory), and Steven Yantis (visual perception, attention, mathematical modeling). All students accepted to the PhD program of the Neuroscience department receive full tuition remission plus a stipend at or above the National Institutes of Health predoctoral level. The Mind/Brain Institute is located on the very attractive Homewood campus in Northern Baltimore. Applicants should have a B.S. or B.A. with a major in any of the biological or physical sciences. Applicants are required to take the Graduate Record Examination (GRE), both the aptitude tests and an advanced test, or the Medical College Admission Test. Further information on the admission procedure can be obtained from the Department of Neuroscience: Director of Graduate Studies Neuroscience Training Program Department of Neuroscience The Johns Hopkins University School of Medicine 725 Wolfe Street Baltimore, MD 21205 Completed applications (including three letters of recommendation and either GRE scores or Medical College Admission Test scores) must be _received_ by January 1, 1996 at the above address. Candidates for whom this is impossible, or those who need additional information, should immediately contact Prof. Ernst Niebur The Zanvyl Krieger Mind/Brain Institute Johns Hopkins University 3400 N. Charles Street Baltimore, MD 21218 niebur at jhu.edu -- Ernst Niebur Krieger Mind/Brain Institute Asst. Prof. of Neuroscience Johns Hopkins University niebur at jhu.edu 3400 N. Charles Street (410)516-8643, -8640 (secr), -8648 (fax) Baltimore, MD 21218 From dhw at santafe.edu Tue Dec 12 17:25:06 1995 From: dhw at santafe.edu (David Wolpert) Date: Tue, 12 Dec 95 15:25:06 MST Subject: The last of a dying thread Message-ID: <9512122225.AA00709@sfi.santafe.edu> Some comments on the NFL thread. Huaiyu Zhu writes >>> 2. The *mere existence* of structure guarantees a (not uniformly-random) algorithm as likely to lose you a million as to win you a million, even in the long run. It is the *right kind* of structure that makes a good algorithm good. >>> This is a crucial point. It also seems to be one lost on many of the contributors to this thread, even those subsequent to Zhu's posting. Please note in particular that the knowledge that "the universe is highly compressible" can NOT, by itself, be used to circumvent NFL. I can only plead again: Those who are interested in this issue should look at the papers directly, so they have at least passing familiarity with the subject before disussing it. :-) ftp.santafe.edu, pub/dhw_ftp, nfl.1.ps.Z and nfl.2.ps.Z. Craig Hicks then writes: >>> However, I interpret the assertion that anti-cross validation can be expected to work as well as cross-validation to mean that we can equally well expect cross-validation to lie. That is, if cross-validation is telling us that the generalization error is decreasing, we can expect, on average, that the true generalization error is not decreasing. Isn't this a contradiction, if we assume that the samples are really randomly chosen? Of course, we can a posteriori always choose a worst case function which fits the samples taken so far, but contradicts the learned model elsewhere. But if we turn things around and randomly sample that deceptive function anew, the learned model will probably be different, and cross-validation will behave as it should. >>> That's part of the power of the NFL theorems - they prove that Hicks' intuition, an intuition many people share, is in fact wrong. >>> I think this follows from the principle that the empirical distribution over an ever larger number of samples converges to the the true distribution of a single sample (assuming the true distribution is stationary). >>> Nope. The central limit theorem is not directly germane. See all the previous discussion on NFL and Vapnik. >>>> CV is nothing more than the random sampling of prediction ability. If the average over the ensemble of samplings of this ability on 2 different models A and B come out showing that A is better than B, then by definition A is better than B. This assumes only that the true domain and the ensemble of all samplings coincide. Therefore CV will not, on average, cause a LOSS in prediction ability. That is, when it fails, it fails gracefully, on average. It cannot be consistently deceptive. Fortunately, CV will report this (failure to generalize) by showing a zero correlation between prediction and true value on the off training set data. (Of course this is only the performance of CV on average over the ensemble of off training set datas; CV may be deceptive for a single off training set data.) >>> This is wrong (or at best misleading). Please read the NFL papers. In fact, if the head-to-head minimax hypothesis concerning xvalidation presented in those papers is correct, xvalidation is wrong more often than it is right. In which case CV is "deceptive" more often (!!!) than not. Lev Goldfarb wrote >>> Strangely as it may sound at first, try to inductively learn the subgroup of some large group with the group structure completely hidden. No statistics will reveal the underlying group structure. >>> It may help if people read some of the many papers (Cox, deFinnetti, Erickson and Smith, etc., etc.) that prove that the only consistent way of dealing with uncertainty is via probability theory. In other words, there is nothing *but* statistics, in the real world. (Perhaps occuring in prior knowledge that you're looking for a group, but statistics nonetheless.) David Wolpert From lemm at LORENTZ.UNI-MUENSTER.DE Wed Dec 13 09:46:52 1995 From: lemm at LORENTZ.UNI-MUENSTER.DE (Joerg_Lemm) Date: Wed, 13 Dec 1995 15:46:52 +0100 Subject: NFL and practice Message-ID: <9512131446.AA13879@xtp141.uni-muenster.de> Some remarks to Craig Hicks arguments on crossvalidation and NFL in general from my point of view: One may discuss NFL for theoretical reasons, but the conditions under which NFL-Theorems hold are not those which are normally met in practice. 1.) In short, NFL assumes that data, i.e. information of the form y_i=f(x_i), do not contain information about function values on a non-overlapping test set. This is done by postulating "unrestricted uniform" priors, or uniform hyperpriors over nonumiform priors... (with respect to Craig's two cases this average would include a third case: target and model are anticorrelated so anticrossvalidation works better) and "vertical" likelihoods. So, in a NFL setting data never say something about function values for new arguments. This seems rather trivial under this assumption and one has to ask how natural is such a NFL situation. 2.) Information of the form y_i=f(x_i) is rather special and not what we normally have. There is much information which is not of this "single sharp data" type. (Examples see below.) There is absolutly no reason why information which depends on more than one f(x_i) should not be incorporated. (This can be done using nonuniform priors or in a way more symmetrical to "sharp data".) NFL just describes the situation in which we don't have any such information but much of the (then quite useless) "sharp data". But these sharp data are not less (maybe more) obscure as other forms of information. Information which is not of this "single sharp data" form but includes many or all f(x_i) to produce one answer normally induces correlations between target and generalizer if included into the generalizer. At the same time there is no real off training set anymore! Examples: 3) Informations like symmetries (even if only approximate), maxima, Fouriercomponents (and much, much more ...) involve more than one f(x_i). Fouriercomponents, for example, can be seen as sharp data but for different basisvectors, i.e. asking for momentum instead of location. This shows again, that the definition of "sharp data" corresponds to choosing a "basis of questions" and is no natural entity!!! 4) Real measurements (especially of continuous variables) normally do also NOT have the form y_i=f(x_i) ! They mostly perform some averaging over f(x_i) or at least they have some noise on the x_i (as small as you like, but present). In the latter case of "sharp" noise posing the same question several times gives you also an average of several (nearby) y with different x_i of the underlying true function. In both cases the averaging is equivalent to regularization for the "effective" function which we can observe!!! This shows that smoothness of the expectation (in contrast to uniform priors) is the result of the measurement process and therefore is a real phenomena for "effective" functions. There is no need to see it just as a subjective prior! (The same could be said on a quantummechanical level, but that's another story.) It follows that NFL results do NOT hold for the "effective" functions in such situations, even if assuming NFL for the underlying true functions. 5.) NFL again: Averaging or noise in the input space of the x_i requires a probability distribution in that space which can be defined independently from a specific function. Noise means that x_i is a random variable dependend from an actual question z_i, i.e. p(actual argument = x_i | question=z_i) and it is f(z_i) which we can observe. If you don't accept a given p(x_i|z_i), I am sure you can average over "all possible" of such relations with unrestricted "uniform" priors to find that it is impossible to obtain any information about any function without assuming a priori that you know something about what you are asking. This could be seen as another NFL-Theorem for questions: You do not even get informations about a single function value if you don't know (assume,define) a priori what you are asking! 6.) With respect to the underlying "true" function off-training set error itself, an important concept for NFL, is in general no longer a measurable quantity if input noise or averaging is present!! (For simplicity let's assume that noise or averaging includes all questions x_i. Then in the case of noise you only have a probability for the x_i to belong to the "true" training set and averaging includes all questions x_i.) So for the "true" functions there remains nothing NFL can say something about and for the "effective" functions NFL is not valid! To conclude: In many interesting cases "effective" function values contain information about other function values and NFL does not hold! The very special handling of "sharp data" in comparison to other information must be discussed for much more learning theories. Joerg Lemm (Institute for Theoretical Physics I, University of Muenster, Germany) From wray at ptolemy-ethernet.arc.nasa.gov Wed Dec 13 17:06:42 1995 From: wray at ptolemy-ethernet.arc.nasa.gov (Wray Buntine) Date: Wed, 13 Dec 95 14:06:42 PST Subject: one revised paper and NIPS slides by Buntine Message-ID: <9512132206.AA08307@ptolemy.arc.nasa.gov> Dear Connectionists Please note the following two WWW resources. One, a forthcoming journal paper, and the other, slides from a NIPS'95 Workshop presentation. Also, please note my new address, email, and company. I am no longer at Heuristicrats. Wray Buntine Thinkbank, Inc. +1 (510) 540-6080 [voice] 1678 Shattuck Avenue, Suite 320 +1 (510) 540-6627 [fax] Berkeley, CA 94709 wray at Thinkbank.COM ============ Article URL: http://www.thinkbank.com/wray/graphbib.ps.Z (about 240Kb compressed) TITLE: A guide to the literature on learning probabilistic networks from data AUTHOR: Wray Buntine, Thinkbank JOURNAL: Accepted for IEEE Trans. on Knowledge and Data Eng., Final draft submitted. ABSTRACT: This literature review discusses different methods under the general rubric of learning Bayesian networks from data, and includes some overlapping work on more general probabilistic networks. Connections are drawn between the statistical, neural network, and uncertainty communities, and between the different methodological communities, such as Bayesian, description length, and classical statistics. Basic concepts for learning and Bayesian networks are introduced and methods are then reviewed. Methods are discussed for learning parameters of a probabilistic network, for learning the structure, and for learning hidden variables. The presentation avoids formal definitions and theorems, as these are plentiful in the literature, and instead illustrates key concepts with simplified examples. KEYWORDS: Bayesian networks, graphical models, hidden variables, learning, learning structure, probabilistic networks, knowledge discovery =========== Talk URL: http://www.thinkbank.com/wray/refs.html (and look under Talks for NIPS) TITLE: Compiling Probabilistic Networks and Some Questions this Poses. AUTHOR: Wray Buntine WORKSHOP: NIPS'95 Workshop on Learning Graphical Models ABSTRACT: Probabilistic networks (or similar) provide a high-level language that can be used as the input to a compiler for generating a learning or inference algorithm. Example compilers are BUGS (inputs a Bayes net with plates) by Gilks, Spiegelhalter, et al., and MultiClass (inputs a dataflow graph) by Roy. This talk will cover three parts: (1) an outline of the arguments for such compilers for probabilistic networks, (2) an introduction to some compilation techniques, and (3) the presentation of some theoretical challenges that compilation poses. High-level language compilers are usually justified as a rapid prototyping tool. In learning, rapid prototyping arises for the following reasons: good priors for complex networks are not obvious and experimentation can be required to understand them; several algorithms may suggest themselves and experimentation is required for comparative evaluation. These and other justifications will be described in the context of some current research on learning probabilistic networks, and past research on learning classification trees and feed-forward neural networks. Techniques for compilation include the data flow graph, automatic differentiation, Monte Carlo Markov Chain samplers of various kinds, and the generation of C code for certain exact inference tasks. With this background, I will then pose a number of research questions to the audience. =========== From bernabe at cnm.us.es Tue Dec 12 07:39:41 1995 From: bernabe at cnm.us.es (Bernabe Linares B.) Date: Tue, 12 Dec 95 13:39:41 +0100 Subject: two papers in neuroprose Message-ID: <9512121239.AA17985@cnm1.cnm.us.es> FTP-host: archive.cis.ohio-state.edu FTP-file: pub/neuroprose/bernabe.art1-nn.ps.Z (30 pages, 257846 bytes) pub/neuroprose/bernabe.art1-vlsi.ps.Z (26 pages, 311686 bytes) The files "bernabe.art1-nn.ps.Z" and "bernabe.art1-vlsi.ps.Z" are now available for copying from the Neuroprose repository. They contain two papers which have been accepted for publication in the following journals: PAPER1: Journal: IEEE Transactions on VLSI Systems Title: "A Real-Time Clustering Microchip Neural Engine" File: bernabe.art1-vlsi.ps.Z PAPER2: Journal: Neural Networks Title: "A Modified ART1 Algorithm more suitable for VLSI Implementations" File: bernabe.art1-nn.ps.Z Authors: Teresa Serrano-Gotarredona and Bernabe Linares-Barranco Filiation: National Microelectronics Center (CNM), Sevilla, SPAIN. Sorry, no hardcopies available. Brief description of papers follows: -------------------------------------------------------------------- PAPER1: ------- File: bernabe.art1-vlsi.ps.Z, 26 pages, 311686 bytes. Title: "A Real-Time Clustering Microchip Neural Engine" Abstract This paper presents an analog current-mode VLSI implementation of an unsupervised clustering algorithm. The clustering algorithm is based on the popular ART1 algorithm [1], but has been modified resulting in a more VLSI-friendly algorithm [2], [3] that allows a more efficient hardware implementation with simple circuit operators, little memory requirements, modular chip assembly capability, and higher speed figures. The chip described in this paper implements a network that can cluster 100 binary pixels input patterns into up to 18 different categories. Modular expansibility of the system is directly possible by assembling an NxM array of chips without any extra interfacing circuitry, so that the maximum number of clusters is 18xM and the maximum number of bits of the input pattern is Nx100. Pattern classification and learning is performed in 1.8us, which is an equivalent computing power of 4.4x10^9 connections per second plus connection-updates per second. The chip has been fabricated in a standard low cost 1.6um double-metal single-poly CMOS process, has a die area of 1cm^2, and is mounted in a 120-pin PGA package. Although internally the chip is analog in nature, it interfaces to the outside world through digital signals, and thus has a true asynchronous digital behavior. Experimental chip test results are available, obtained through digital chip test equipment. Fault tolerance at the system level operation is demonstrated through the experimental testing of faulty chips. -------------------------------------------------------------------- PAPER2: ------- File: bernabe.art1-nn.ps.Z, 30 pages, 257846 bytes. Title: "A Modified ART1 Algorithm more suitable for VLSI Implementations" Abstract This paper presents a modification to the original ART1 algorithm [Carpenter, 1987a] that is conceptually similar, can be implemented in hardware with less sophisticated building blocks, and maintains the computational capabilities of the originally proposed algorithm. This modified ART1 algorithm (which we will call here ART1m) is the result of hardware motivated simplifications investigated during the design of an actual ART1 chip [Serrano, 1994, 1996]. The purpose of this paper is simply to justify theoretically that the modified algorithm preserves the computational properties of the original one and to study the difference in behavior between the two approaches. -------------------------------------------------------------------- ftp instructions are: % ftp archive.cis.ohio-state.edu Name : anonymous Password: ftp> cd pub/neuroprose ftp> binary ftp> get bernabe.art1-nn.ps.Z ftp> get bernabe.art1-vlsi.ps.Z ftp> quit % uncompress bernabe.art1-nn.ps.Z % uncompress bernabe.art1-vlsi.ps.Z % lpr -P bernabe.art1-nn.ps % lpr -P bernabe.art1-vlsi.ps These files are also available from the node "ftp.cnm.us.es", user "anonymous", directory /pub/bernabe/publications, files: "NN_art1theory_96.ps.Z" and "TVLSI_art1chip_96.ps.Z". Any feedback will be appreciated. Thanks, Dr. Bernabe Linares-Barranco National Microelectronics Center (CNM) Dept. of Analog Design Ed. CICA, Av. Reina Mercedes s/n, 41012 Sevilla, SPAIN. Phone: 34-5-4239923, Fax: 34-5-4624506, E-mail: bernabe at cnm.us.es From bishopc at helios.aston.ac.uk Wed Dec 13 14:52:48 1995 From: bishopc at helios.aston.ac.uk (Prof. Chris Bishop) Date: Wed, 13 Dec 1995 19:52:48 +0000 Subject: New Book: Neural Networks for Pattern Recognition Message-ID: <1400.9512131952@sun.aston.ac.uk> -------------------------------------------------------------------- NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK -- NEW BOOK -------------------------------------------------------------------- "Neural Networks for Pattern Recognition" ----------------------------------------- Christopher M. Bishop (Oxford University Press) Full details at: http://neural-server.aston.ac.uk/NNPR/ This book provides the first comprehensive treatment of neural networks from the perspective of statistical pattern recognition. * 504 pages * 160 figures * 129 graded exercises * a self-contained introduction to statistical pattern recogniton * an extensive treatment of Bayesian methods * paperback and hardback editions * 300 references Contents: --------- 1. Statistical Pattern Recognition 2. Probability Density Estimation 3. Single-layer Networks 4. The Multi-layer Perceptron 5. Radial Basis Functions 6. Error Functions 7. Parameter Optimization Algorithms 8. Pre-processing and Feature Extraction 9. Learning and Generalization 10. Bayesian Techniques ***** Instructors wishing to use this text as the basis for a course may request a complimentary examination copy from the publishers. (USA: fax request to 212-726-6442 with brief description of the course) ***** Ordering information: --------------------- ISBN 0-19-853864-2 paperback 0-19-853849-9 hardback USA: 45 dollars paperback ---- 98 dollars hardback Credit card orders: Tel: 1-800-451-7556 (toll free) By post, send payment to: Order Dept. Oxford University Press 2001 Evans Road Cary, NC 27513 USA (3 dollars shipping for first copy, 1 dollar each thereafter) Canada: Tel: 1-800-387-8020 (toll free) ------- UK: 25 pounds paperback --- 55 pounds hardback Tel: 01536 454 534 (from the UK) Tel: +44 1536 454 534 (from abroad) By post, send payment to: CWO Department Oxford University Press Saxon Way West, Corby Northants NN18 9ES, UK (3.53 pounds postage) By fax: 01536 746 337 (from the UK) +44 1536 746 337 (from abroad) ---------------------------------------------------------------------- Prof. Christopher M. Bishop Tel. +44 (0)121 333 4631 Neural Computing Research Group Fax. +44 (0)121 333 4586 Dept. of Computer Science c.m.bishop at aston.ac.uk & Applied Mathematics http://neural-server.aston.ac.uk/ Aston University Birmingham B4 7ET, UK ---------------------------------------------------------------------- From zhuh at helios.aston.ac.uk Thu Dec 14 13:12:43 1995 From: zhuh at helios.aston.ac.uk (zhuh) Date: Thu, 14 Dec 1995 18:12:43 +0000 Subject: No free lunch for Cross Validation! Message-ID: <2237.9512141812@sun.aston.ac.uk> Dear Colleagues, A little while ago someone claimed that Cross validation will benefit from the presence of any structure, and if there is no structure it does no harm; yet NFL explicitly states that a structure can be equally good or bad for any given method, depending on how they match each other; yet It was further claimed that they do not conflict with each other. I was quite curious and did the following five-minute experiment to find out which is correct. Suppose we have a Gaussian variable x, with mean mu and unit variance. We have the following three estimators for estimating mu from a sample of size n. A: The sample mean. It is optimal both in the sense of Maximum Likelihood and Least Mean Squares. B: The maximum of sample. It is a bad estimator in any reasonable sense. C: Cross validation to choose between A and B, with one extra data point. The numerical result with n=16 and averaged over 10000 samples, gives mean squared error: A: 0.0627 B: 3.4418 C: 0.5646 This clearly shows that cross validation IS harmful in this case, despite the fact it is based on a larger sample. NFL still wins! Many of you might jump on me at this point: But this is a very artificial example, which is not what normally occurs in practice. To this I have two answers, short and long. The short answer is from principle. Any counter-example, however artificial it is, clearly demolishes the hope that cross validation is a "universally beneficial method". The longer answer is divided in several parts, which hopefully will answer any potential criticism from any aspect: 1. The cross validation is performed on extra data points. We are not requiring it to perform as good as the mean on 17 data points. If it cannot extract more information from the one extra data point, a minimum requirement is that it keeps the information in the original 16 points. But it can't even do this. 2. The maximum of a sample is the 100 percentile. The median is the 50 percentile, which is in fact a quite reasonable estimator. Let us use a larger cross validation set (of size k), and replace B with a different percentile. The result is that, for the median, CV needs k>2 to work. For 70 percentile CV needs k>16. The required k increases dramatically with the percentile. 3. It is not true that we have set up a case in which cross validation can't win. There is indeed a small probability that a sample can be so bad that the sample maximum is even a better estimate than the sample mean. However to utilise such rare chances to good effect k must be at least several hundred (maybe exponential) while n=16. We know such k exists since k=infinity certainly helps. Yet to adopt such a method is clearly absurd. 4. Although we have chosen estimator A to be the known optimal estimator in this case, it can be replaced by something else. For example, both A and B can be some reasonable averages over percentiles, so that without detailed analysis it may appear doing cross validation might give a C which is better than both A and B. Such believes can be defeated by similar counter-examples. 5. The above scheme of cross validation may appear different from what is familiar, but here is a "practical example" which shows that it is indeed what people normally do. Suppose we have a random variable which is either Gaussian or Cauchy. Consider the following three estimators: A: Sample mean: It has 100% efficiency for Gaussian, and 0% efficiency for Cauchy. B: Sample median: It is 2/pi=63.66% efficient for Gaussian and 8/pi^2=81.06% efficient for Cauchy. C: Cross validation on an additional sample of size k, to choose between A and B. Intuitively it appears quite reasonable to expect cross validation to pick out the correct one, for most of the time, so that, if averaged over all samples, C ought to be superior to both A and B. But no!! This will depend on the PRIOR mixing probability of these two sub-models. If the variable is in fact always Gaussian, then we have just seen that if n=16, CV will be worse unless k>2. The same is even more true in the reversed order, since the mean is an essentially useless estimator for Cauchy. 6. In any of the above cases, "anti cross validation" would be even more disastrous. If you are not convinced by these arguments, or if you want to know more about efficiency, then maybe the following reference can help: Fisher, R.A.: Theory of statistical estimation, Proc. Camb. Phil. Soc., Vol. 122, pp. 700-725, 1925. If you are more or less convinced, I have the following speculation: Several centuries ago, the French Academy of Science (or is it the Royal Society?) made a decision that they would no longer examine inventions of "perpetual motion machines", on the ground that the Law of Energy Conservation was so reliable that it would defeat any such attempt. History proved that this was a wise decision, which assisted the effort of designing machines which utilise energy in fuel. Should we expect the same fate for "the universally beneficial methods" in the face of NFL? Should we put more effort in designing methods which use prior information? posterior information <= prior information + data information. -- Huaiyu Zhu, PhD email: H.Zhu at aston.ac.uk Neural Computing Research Group http://neural-server.aston.ac.uk/People/zhuh Dept of Computer Science ftp://cs.aston.ac.uk/neural/zhuh and Applied Mathematics tel: +44 121 359 3611 x 5427 Aston University, fax: +44 121 333 6215 Birmingham B4 7ET, UK From C.Campbell at bristol.ac.uk Thu Dec 14 11:21:26 1995 From: C.Campbell at bristol.ac.uk (I C G Campbell) Date: Thu, 14 Dec 1995 16:21:26 +0000 (GMT) Subject: New Web Page (Bristol University, UK) Message-ID: <199512141621.QAA11250@zeus.bris.ac.uk> The Neural Computing Research Group at Bristol University, UK has recently set up a WWW page describing their interests at: http://www.fen.bris.ac.uk/engmaths/research/neural/neural.html Our interests cover three main areas: theory of neural computation, modelling simple neurobiological systems and applications of neural computing in engineering. Collectively we have produced in excess of 100 publications related to neural computing in these topic areas. Further details about these publications, current research interests and research grants may be found on the above page. Merry Xmas Colin Campbell University of Bristol From robert at fit.qut.edu.au Thu Dec 14 19:24:04 1995 From: robert at fit.qut.edu.au (Robert Andrews) Date: Fri, 15 Dec 1995 10:24:04 +1000 Subject: Rule Extraction Mailing List Message-ID: <199512150024.KAA15975@ocean.fit.qut.edu.au> =-=-=-=-= RULE EXTRACTION FROM ARTIFICIAL NEURAL NETWORKS =-=-=-=-=-=-=-=- ANNOUNCEMENT OF MAILING LIST Rule Extraction from Artificial Neural Networks and the related field of Rule Refinement are topics of increasing interest and importance. This is to announce the formation of a moderated mailing list for researchers and students interested in these areas. If you are interested in becoming a subscriber to this list please send the following information by return mail: Name: Organisation/Institution: E-mail Address: =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Mr Robert Andrews School of Information Systems robert at fit.qut.edu.au Faculty of Information Technology R.Andrews at qut.edu.au Queensland University of Technology +61 7 864 1656 (voice) GPO Box 2434 _--_|\ +61 7 864 1969 (fax) Brisbane Q 4001 / QUT Australia \_.--._/ http://www.fit.qut.edu.au/staff/~robert v =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- From l.s.smith at cs.stir.ac.uk Fri Dec 15 05:12:09 1995 From: l.s.smith at cs.stir.ac.uk (Dr L S Smith (Staff)) Date: Fri, 15 Dec 1995 10:12:09 GMT Subject: TR on generalization available Message-ID: <19951215T101209Z.KAA27913@katrine.cs.stir.ac.uk> Dear all: We have a new TR available by ftp from here: CCCN Technical report CCCN-21, December 1995. A Theoretical Study of the Generalization Ability of Feed-Forward Neural Networks. M J Roberts. By making assumptions on the probability distribution of the potentials in a feed-forward neural network we have derived lower bounds for the generalization ability of the network in terms of the number of training patterns. The results are consistent with simulations carried out on a simple geometrical function. The URL is ftp://ftp.cs.stir.ac.uk/pub/tr/cccn/TR21.ps.Z If you really can't access this hard copies are available, but only as a last resort. Dr Leslie S. Smith Dept of Computing and Mathematics, Univ of Stirling Stirling FK9 4LA Scotland lss at cs.stir.ac.uk (NeXTmail welcome) Tel (44) 1786 467435 Fax (44) 1786 464551 www http://www.cs.stir.ac.uk/~lss/ From bastiane at irit.fr Fri Dec 15 09:07:57 1995 From: bastiane at irit.fr (bastiane@irit.fr) Date: Fri, 15 Dec 1995 15:07:57 +0100 Subject: Call for papers for DYNN'96 Message-ID: <199512151407.PAA05193@irit.irit.fr> CALL FOR PAPERS FOR DYNN'96 International workshop on NEURAL NETWORKS DYNAMICS AND PATTERN RECOGNITION. Toulouse - France 12 and 13 of March 1996 Organized by ONERA-CERT Sponsored by DRET of French MOD, US Air Force Scientific Research and Pole Universitaire Europeen de Toulouse. Organizers: Manuel SAMUELIDES (ONERA-CERT), Bernard DOYON (INSERM), Gregory TARR (US AF), Simon THORPE (CNRS). Practical Information: Emmanuel DAUCE (dauce at cert.fr) *********************** OBJECTIVES OF THE WORKSHOP. *************************** This workshop is designed to allow information exchange and discussion between theoretical scientists working on models of neuronal dynamics and engineersnners who are looking for efficient devices to process sensor information. Continuous activation state units as well as Integrate and Fire neurons or oscillators are elementary components of Dynamical Neural Networks. Attractor neural networks as well as transitory data-driven dynamics will be considered. The common features between these models is the conversion of spatial information into spatio-temporal data flow which allows specific processing. Mathematical models involved use dynamical systems and stochastic processes. They will be compared to the results of numerical simulations and the latest neuro-physiological data concerning the dynamics of biological neural nets. The main aim of the workshop is to encourage significant advances concerning the dynamics of biologically plausible neural networks and their applications to pattern recognition. *********************** ORGANIZATION OF THE WORKSHOP. ***************************** Scheduled talks will take place on the 12th and the 13 th of March. There will be invited talks as well as submitted contributions. About 24 talks of 30 minutes will be scheduled with time for discussion and panels. Informal discussion and collective work may be scheduled on the 14 th. Extended abstract (one or two pages) of submitted contribution have to be send for acceptation by e-mail to dauce at cert.fr or by post to Manuel Samuelides, DERI ONERA-CERT, BP 4025, 31055 Toulouse CEDEX, FRANCE. Provisional list of invited lecturers: J.P.AUBIN, M.COTTRELL, J.DEMONGEOT, J.DAYHOFF,G.DREYFUS, M.HIRSCH, J.TAYLOR. (This list will be completed) The number of attendants to the workshop is limited to 40 in order to allow living exchange and real discussion. Copies of abstracts and slides will be provided to participants. The registration fees amount to FF 1,200 including 2 nights with american breakfast(11th and 12 th) at a first class hotel in downtown Toulouse (Holyday Inn, Crown Plaza), two lunches on the site of the workshop, the workshop banquet, transportation to and from CERT, coffee beaks, the general costs of the workshop facilities and equipment. Payment should be made either by check payable to " AGENT COMPTABLE DU CERT ONERA " in French francs only or by Bank transfer to "AGENT COMPTABLE DU CERT ONERA" Bank: Societe Generale Ramonville Saint Agne Account N? 30003 /02117/ 00037291008/93 Please state the workshop reference: DYNN'96 on all transactions. *********************** IMPORTANT DATES: **************** 15th of January: Dead-line for contributions and declarations of interest. 31 th of January: Date for signification of accepted contribution and expedition of final programming of the workshop 15 th of February: Dead-Line for Inscriptions to the workshop. To avoid postage delay, e-mail will be accepted as a usual communication If you want to attend DYNN'96 please use your computer to reply at once -------------------------------------------------------------------------------- Name Organization Adress e mail ( ) wishes the information about the final program ( ) wishes to attend DYNN'96 ( ) will submit a contribution entitled: ----------------------------------------------------------------------------- Please send your reply to the following e-mail dauce at cert.fr or to xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x Professor Manuel SAMUELIDES x x DERI ONERA-CERT x x BP 4025 x x 31055 Toulouse CEDEX x x FRANCE x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Manuel SAMUELIDES ----------------------------------------------------------------- for research: Chercheur a l'ONERA-CERT samuelid at cert.fr for Teaching Professeur a l'ENSAE Manuel.Samuelides at supaero.fr Tel: (33) 62 17 81 06 Fax: (33) 62 17 83 30 From lemm at LORENTZ.UNI-MUENSTER.DE Fri Dec 15 09:28:49 1995 From: lemm at LORENTZ.UNI-MUENSTER.DE (Joerg_Lemm) Date: Fri, 15 Dec 1995 15:28:49 +0100 Subject: NFL and practice Message-ID: <9512151428.AA24811@xtp141.uni-muenster.de> Huaiyu Zhu responsed to >> One may discuss NFL for theoretical reasons, but >> the conditions under which NFL-Theorems hold >> are not those which are normally met in practice. and wrote >Exactly the opposite. The theory behind NFL is trivial (in some sense). >The power of NFL is that it deals directly with what is rountinely >practiced in the neural network community today. That depends on how you understand practice. E.g. in nearly all cases functions are somewhat smooth. This is a prior which exists in reality (for example because of input noise in the measuring process). And the situation would we hopeless if we would not use this fact in practice. (That is just what also NFL says.) But, if Huaiyu means that it is necessary to think about the priors in "practice" explicitly, then I fully aggree! But what I wanted to say is: WE DO HAVE "PRIORS" (BETTER SAY CORRELATIONS BETWEEN ANSWERS TO DIFFERENT QUESTIONS) IN MOST CASES and they are NOT obscure, but very often at least as well MEASUREABLE as "normal" sharp data y_i=f(x_i). Even more: situations without "priors" are VERY artificial. So if we specify the "priors" (and the lesson from NFL is that we should if we want to make a good theory) then we cannot use NFL anymore.(What should it be used for then?) >Joerg continued with examples of various priors of practical concern, >including smoothness, symmetry, positive correlation, iid samples, etc. >These are indeed very important priors which match the real world, >and they are the implicit assumptions behind most algorithms. > >What NFL tells us is: If your algorithm is designed for such a prior, >then say so explicitly so that a user can decide whether to use it. >You can't expect it to be also good for any other prior which you have >not considered. In fact, in a sense, you should expect it to perform >worse than a purely random algorithm on those other priors. Maybe the problem is that Huaiyu Zhu uses the word "PRIOR" for every information which is not of the sharp data form y_i=f(x_i). It suggests that we know something before starting our generalizer. NO, that is not the normal case!!! I mentioned many examples (like measurement with input noise) where "priors" are just normal information which should be used DURING learning like sharp data! (Sharp data might be even not available at all!) And of course using wrong "priors" is similar to using wrong sharp data. But I fully aggree that most algorithm uses "prior" information only implicitly and that there is a lot of theoretical work to do. In response to >> In many interesting cases "effective" function values contain information >> about other function values and NFL does not hold! Huaiyu Zhu continues >This is like saying "In many interesting cases we do have energy sources, >and we can make a machine running forever, so the natural laws against >`perpetual motion machines' do not hold." Indeed, it is a little bit like that, but a system without energy sources is a much better approximation for some real world systems compared to a world without "priors" (i.e. without correlated answers over different questions)! So the energy law is useful, but models for worlds without correlated information are NOT, except maybe that they tell us to include the correlation properly! Joerg Lemm (Institute for Theoretical Physics I, University of Muenster, Germany) From shastri at ICSI.Berkeley.EDU Fri Dec 15 16:34:24 1995 From: shastri at ICSI.Berkeley.EDU (Lokendra Shastri) Date: Fri, 15 Dec 1995 13:34:24 PST Subject: Technical report --- negated knowledge and inconsistency Message-ID: <199512152134.NAA06683@kulfi.ICSI.Berkeley.EDU> Dealing with negated knowledge and inconsistency in a neurally motivated model of memory and reflexive reasoning. Lokendra Shastri and Dean J. Grannes TR-95-041 ICSI August 1995 Recently, SHRUTI has been proposed as a connectionist model of rapid reasoning. It demonstrates how a network of simple neuron- like elements can encode a large number of specific facts as well as systematic knowledge (rules) involving n-ary relations, quanti- fication and concept hierarchies, and perform a class of reasoning with extreme efficiency. The model, however, does not deal with negated facts and rules involving negated antecedents and consequents. We describe an extension of SHRUTI that can encode positive as well as negated knowledge and use such knowledge during reflexive reasoning. The extended model explains how an agent can hold inconsistent knowledge in its long-term memory without being ``aware'' that its beliefs are inconsistent, but detect a contradiction whenever inconsistent beliefs that are within a certain inferential distance of each other become co-active during an episode of reasoning. Thus the model is not logically omniscient, but detects contradictions whenever it tries to use inconsistent knowledge. The extended model also explains how limited attentional focus or action under time pressure can lead an agent to produce an erroneous response. A biologically signficant feature of the model is that it uses only local inhibition to encode negated knowledge. Like the basic model, the extended model encodes and propagates dynamic bindings using temporal synchrony. Key Words: long-term memory; rapid reasoning; dynamic bindings; synchrony; knowledge representation; neural oscillations; short-term memory; negation; inconsistent knowledge. ftp-server: ftp.icsi.berkeley.edu (128.32.201.55) ftp-file: /pub/techreports/1995/tr-95-041.ps.Z Lokendra Shastri International Computer Science Institute 1947 Center Street, Suite 600 Berkeley, CA 94704 http://www.icsi.berkeley.edu/~shastri ========================== Detailed instructions for retrieving the report: unix% ftp ftp.icsi.berkeley.edu Name (ftp.icsi.berkeley.edu:): anonymous Password: your_name at your_machine ftp> cd /pub/techreports/1995 ftp> binary ftp> get tr-95-041.ps.Z ftp> quit unix% uncompress tr-95-041.ps.Z unix% lpr tr-95-041.ps If your name server does not know about ftp.icsi.berkeley.edu, use 128.32.201.55 instead. All files in this archive can also be obtained through an e-mail interface in case direct ftp is not available. To obtain instructions, send mail containing the line `send help' to: ftpmail at ICSI.Berkeley.EDU As a last resort, hardcopies may be ordered for a small fee. Send mail to info at ICSI.Berkeley.EDU for more information. From cherkaue at cs.wisc.edu Fri Dec 15 19:03:15 1995 From: cherkaue at cs.wisc.edu (cherkaue@cs.wisc.edu) Date: Fri, 15 Dec 1995 18:03:15 -0600 Subject: No free lunch for Cross Validation! Message-ID: <199512160003.SAA03324@mozzarella.cs.wisc.edu> In reply to Huaiyu Zhu's message > ... > >A little while ago someone claimed that > Cross validation will benefit from the presence of any structure, > and if there is no structure it does no harm; > > ... > >Suppose we have a Gaussian variable x, with mean mu and unit variance. >We have the following three estimators for estimating mu from a >sample of size n. > A: The sample mean. It is optimal both in the sense of Maximum >Likelihood and Least Mean Squares. > B: The maximum of sample. It is a bad estimator in any reasonable sense. > C: Cross validation to choose between A and B, with one extra data point. > >The numerical result with n=16 and averaged over 10000 samples, gives >mean squared error: > A: 0.0627 B: 3.4418 C: 0.5646 >This clearly shows that cross validation IS harmful in this case, >despite the fact it is based on a larger sample. NFL still wins! You forgot D: Anti-cross validation to choose between A and B, with one extra data point. I don't understand your claim that "cross validation IS harmful in this case." You seem to equate "harmful" with "suboptimal." Cross validation is a technique we use to guess the answer when we don't already know the answer. You give technique A the benefit of your prior knowledge of the true answer, but C must operate without this knowledge. A fair comparison would pit C against D, not C against A. As you say: >6. In any of the above cases, "anti cross validation" would be even >more disastrous. Kevin Cherkauer Computer Sciences Dept. University of Wisconsin-Madison cherkauer at cs.wisc.edu From pkso at castle.ed.ac.uk Sat Dec 16 10:06:41 1995 From: pkso at castle.ed.ac.uk (P Sollich) Date: Sat, 16 Dec 95 15:06:41 GMT Subject: Thesis on Query Learning available Message-ID: <9512161506.aa29855@uk.ac.ed.castle> FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/Thesis/sollich.thesis.tar.Z Dear fellow connectionists, the following Ph.D. thesis is now available for copying from the neuroprose archive: ASKING INTELLIGENT QUESTIONS --- THE STATISTICAL MECHANICS OF QUERY LEARNING Peter Sollich Department of Physics University of Edinburgh, U.K. Abstract: This thesis analyses the capabilities and limitations of query learning by using the tools of statistical mechanics to study learning in feed-forward neural networks. In supervised learning, one of the central questions is the issue of generalization: Given a set of training examples in the form of input-output pairs produced by an unknown {\em teacher} rule, how can one generate a {\em student} which {\em generalizes}, i.e., which correctly predicts the outputs corresponding to inputs not contained in the training set? The traditional paradigm has been to study learning from {\em random examples}, where training inputs are sampled randomly from some given distribution. However, random examples contain redundant information, and generalization performance can thus be improved by {\em query learning}, where training inputs are chosen such that each new training example will be maximally `useful' as measured by a given {\em objective function}. We examine two common kinds of queries, chosen to optimize the objective functions, generalization error and entropy (or information), respectively. Within an extended Bayesian framework, we use the techniques of statistical mechanics to analyse the average case generalization performance achieved by such queries in a range of learning scenarios, in which the functional forms of student and teacher are inspired by models of neural networks. In particular, we study how the efficacy of query learning depends on the form of teacher and student, on the training algorithm used to generate students, and on the objective function used to select queries. The learning scenarios considered are simple but sufficiently generic to allow general conclusions to be drawn. We first study perfectly learnable problems, where the student can reproduce the teacher exactly. From an analysis of two simple model systems, the high-low game and the linear perceptron, we conclude that query learning is much less effective for rules with continuous outputs -- provided they are `invertible' in the sense that they can essentially be learned from a finite number of training examples -- than for rules with discrete outputs. Queries chosen to minimize the entropy generally achieve generalization performance close to the theoretical optimum afforded by minimum generalization error queries, but can perform worse than random examples in scenarios where the training algorithm is under-regularized, i.e., has too much `confidence' in corrupted training data. For imperfectly learnable problems, we first consider linear students learning from nonlinear perceptron teachers and show that in this case the structure of the student space determines the efficacy of queries chosen to minimize the entropy in {\em student} space. Minimum {\em teacher} space queries, on the other hand, perform worse than random examples due to lack of feedback about the progress of the student. For students with discrete outputs, we find that in the absence of information about the teacher space, query learning can lead to self-confirming hypotheses far from the truth, misleading the student to such an extent that it will not approximate the teacher optimally even for an infinite number of training examples. We investigate how this problem depends on the nature of the noise process corrupting the training data, and demonstrate that it can be alleviated by combining query learning with Bayesian techniques of model selection. Finally, we assess which of our conclusions carry over to more realistic neural networks, by calculating finite size corrections to the thermodynamic limit results and by analysing query learning in a simple two-layer neural network. The results suggest that the statistical mechanics analysis is often relevant to real-world learning problems, and that the potentially significant improvements in generalization performance achieved by query learning can be made available, in a computationally cheap manner, for realistic multi-layer neural networks. Criticism, comments and suggestions are welcome. Merry Christmas everyone! Peter Sollich -------------------------------------------------------------------------- Peter Sollich Department of Physics University of Edinburgh e-mail: P.Sollich at ed.ac.uk Kings Buildings phone: +44 - (0)131 - 650 5236 Mayfield Road Edinburgh EH9 3JZ, U.K. -------------------------------------------------------------------------- RETRIEVAL INSTRUCTIONS: Get `sollich.thesis.tar.Z' from the `Thesis' subdirectory of the neuroprose archive. Uncompress, and unpack the resulting tar file (on UNIX: uncompress sollich.thesis.tar.Z; tar xf - < sollich.thesis.tar). This will yield the postscript files listed below. Contact me if there are any problems with retrieval and or printing. QUICK GUIDE for busy readers: For a first look, see sollich_title.ps (has abstract and table of contents). File sollich_chapter1.ps contains a general introduction to query learning and an overview of the literature. Finally, for a summary of the main results and open questions, see sollich_chapter9.ps. LIST OF FILES: ------------------------------------------------------------------------------ Filename No of Size in KB Contents pages (compressed/ uncompressed) ------------------------------------------------------------------------------ sollich_title.ps 8 37/ 75 Title, Declaration, Acknowledgements, Publications, Abstract, Table of contents ------------------------------------------------------------------------------ sollich_chapter1.ps 8 48/ 98 Introduction ------------------------------------------------------------------------------ sollich_chapter2.ps 10 48/ 101 A probabilistic framework for query selection ------------------------------------------------------------------------------ sollich_chapter3.ps 21 128/ 376 Perfectly learnable problems: Two simple examples ------------------------------------------------------------------------------ sollich_chapter4.ps 19 135/ 337 Imperfectly learnable problems: Linear students ------------------------------------------------------------------------------ sollich_chapter5.ps 40 228/ 565 Query learning assuming the inference model is correct ------------------------------------------------------------------------------ sollich_chapter6.ps 12 244/1050 Combining query learning and model selection ------------------------------------------------------------------------------ sollich_chapter7.ps 20 217/ 558 Towards realistic neural networks I: Finite size effects ------------------------------------------------------------------------------ sollich_chapter8.ps 24 136/ 299 Towards realistic neural networks II: Multi-layer networks ------------------------------------------------------------------------------ sollich_chapter9.ps 5 31/ 59 Summary and Outlook ------------------------------------------------------------------------------ sollich_bib.ps 8 37/ 68 Bibliography ------------------------------------------------------------------------------ From zhuh at helios.aston.ac.uk Mon Dec 18 08:11:50 1995 From: zhuh at helios.aston.ac.uk (zhuh) Date: Mon, 18 Dec 1995 13:11:50 +0000 Subject: NFL and practice Message-ID: <4332.9512181311@sun.aston.ac.uk> I accidentally sent my reply to Joerg Lemm, instead of Connnetionist. Since he replied to the Connectionist, I'll reply here as well, and include my original posting at the end. I quite agree with Joerg's observation about learning algorithms in practice, and the priors they use. The key difference is Is it legitimate to be vague about prior? Put it another way, Do you claim the algorithm can pick up whatever prior automatically, instead of being specified before hand? My answer is NO, to both questions, because for an algorithm to be good on any prior is exactly the same as for an algorithm to be good without prior, as NFL told us. For purely cosmetic reasons, it might be helpful to translate the useless "No free lunch theorem" :-) Without specifying a particular prior, any algorithm is as good as random guess, into the equivalent, but infinitely more useful, "You have to pay for lunch Theorem" :-) For an algorithm to perform better than random guess, a particular prior should be specified. On a more practical level, > E.g. in nearly all cases functions are somewhat smooth. Do you specify the scale on which it is smooth? > This is a prior which exists in reality (for example because > of input noise in the measuring process). If you average smoothness over all scales, in a certain uniform way, you get a prior which contains no smoothness at all. If you average them in a non- uniform way, you actually specify a non-uniform prior, which is the crucial piece of information for any algorithm to work at all. > And the situation would we hopeless > if we would not use this fact in practice. It would still be hopeless if we only used the fact of "somewhat smooth", instead of specifying how smooth. See the following for theory and examples: Zhu, H. and Rohwer, R.: Bayesian regression filters and the issue of priors, 1995. To appear in Neural Computing and Applications. ftp://cs.aston.ac.uk/neural/zhuh/reg_fil_prior.ps.Z My original posting is enclosed as the following: ----- Begin Included Message ----- From imlm at tuck.cs.fit.edu Mon Dec 18 16:39:40 1995 From: imlm at tuck.cs.fit.edu (IMLM Workshop (pkc)) Date: Mon, 18 Dec 1995 16:39:40 -0500 Subject: CFP: AAAI-96 Workshop on Integrating Multiple Learned Models Message-ID: <199512182139.QAA10740@tuck.cs.fit.edu> CALL FOR PAPERS/PARTICIPATION INTEGRATING MULTIPLE LEARNED MODELS FOR IMPROVING AND SCALING MACHINE LEARNING ALGORITHMS to be held in conjunction with AAAI 1996 Portland, Oregon August 1996 Most modern machine learning research uses a single model or learning algorithm at a time, or at most selects one model from a set of candidate models. Recently however, there has been considerable interest in techniques that integrate the collective predictions of a set of models in some principled fashion. With such techniques often the predictive accuracy and/or the training efficiency of the overall system can be improved, since one can "mix and match" among the relative strengths of the models being combined. The goal of this workshop is to gather researchers actively working in the area of integrating multiple learned models, to exchange ideas and foster collaborations and new research directions. In particular, we seek to bring together researchers interested in this topic from the fields of Machine Learning, Knowledge Discovery in Databases, and Statistics. Any aspect of integrating multiple models is appropriate for the workshop. However we intend the focus of the workshop to be improving prediction accuracies, and improving training performance in the context of large training databases. More precisely, submissions are sought in, but not limited to, the following topics: 1) Techniques that generate and/or integrate multiple learned models. In particular, techniques that do so by: * using different training data distributions (in particular by training over different partitions of the data) * using different output classification schemes (for example using output codes) * using different hyperparameters or training heuristics (primarily as a tool for generating multiple models) 2) Systems and architectures to implement such strategies. In particular: * parallel and distributed multiple learning systems * multi-agent learning over inherently distributed data A paper need not be submitted to participate in the workshop, but space may be limited so contact the organizers as early as possible if you wish to participate. The workshop format is planned to encompass a full day of half hour presentations with discussion periods, ending with a brief period for summary and discussion of future activities. Notes or proceedings for the workshop may be provided, depending on the submissions received. Submission requirements: i) A short paper of not more than 2000 words detailing recent research results must be received by March 18, 1996. ii) The paper should include an abstract of not more than 150 words, and a list of keywords. Please include the name(s), email address(es), address(es), and phone number(s) of the author(s) on the first page. The first author will be the primary contact unless otherwise stated. iii) Electronic submissions in postscript or ASCII via email are preferred. Three printed copies (preferrably double-sided) of your submission are also accepted. iv) Please also send the title, name(s) and email address(es) of the author(s), abstract, and keywords in ASCII via email. Submission address: imlm at cs.fit.edu Philip Chan IMLM Workshop Computer Science Florida Institute of Technology 150 W. University Blvd. Melbourne, FL 32901-6988 407-768-8000 x7280 (x8062) 407-984-8461 (fax) Important Dates: Paper submission deadline: March 18, 1996 Notification of acceptance: April 15, 1996 Final copy: May 13, 1996 Chairs: Salvatore Stolfo, Columbia University sal at cs.columbia.edu David Wolpert, Santa Fe Institute dhw at santafe.edu Philip Chan, Florida Institute of Technology pkc at cs.fit.edu General Inquiries: Please address general inquiries to one of the co-chairs or send them to: imlm at cs.fit.edu Up-to-date workshop information is maintained on WWW at: http://cs.fit.edu/~imlm/ or http://www.cs.fit.edu/~imlm/ From ces at negi.riken.go.jp Mon Dec 18 20:36:45 1995 From: ces at negi.riken.go.jp (ces@negi.riken.go.jp) Date: Tue, 19 Dec 95 10:36:45 +0900 Subject: PhD Thesis Announcement : nonlinear filters Message-ID: <9512190136.AA21982@negi.riken.go.jp>  FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/Thesis/chng.thesis.ps.Z Dear fellow connectionists, the following Ph.D. thesis is now available for copying from the neuroprose archive: (Sorry, no hardcopies available.) - ----------------------------------------------------------------------- Applications of nonlinear filters with the linear-in-the-parameter structure Eng-Siong CHNG Department of Electrical Engineering University of Edinburgh, U.K. Abstract: The subject of this thesis is the application of nonlinear filters, with the linear-in-the-parameter structure, to time series prediction and channel equalisation problems. In particular, the Volterra and the radial basis function (RBF) expansion techniques are considered to implement the nonlinear filter structures. These approaches, however, will generate filters with very large numbers of parameters. As large filter models require significant implementation complexity, they are undesirable for practical implementations. To reduce the size of the filter, the orthogonal least squares (OLS) algorithm is considered to perform model selection. Simulations were conducted to study the effectiveness of subset models found using this algorithm, and the results indicate that this selection technique is adequate for many practical applications. The other aspect of the OLS algorithm studied is its implementation requirements. Although the OLS algorithm is very efficient, the required computational complexity is still substantial. To reduce the processing requirement, some fast OLS methods are examined. Two major applications of nonlinear filters are considered in this thesis. The first involves the use of nonlinear filters to predict time series which possess nonlinear dynamics. To study the performance of the nonlinear predictors, simulations were conducted to compare the performance of these predictors with conventional linear predictors. The simulation results confirm that nonlinear predictors normally perform better than linear predictors. Within this study, the application of RBF predictors to time series that exhibit homogeneous nonstationarity is also considered. This type of time series possesses the same characteristic throughout the time sequence apart from local variations of mean and trend. The second application involves the use of filters for symbol-decision channel equalisation. The decision function of the optimal symbol-decision equaliser is first derived to show that it is nonlinear, and that it may be realised explicitly using a RBF filter. Analysis is then carried out to illustrate the difference between the optimum equaliser's performance and that of the conventional linear equaliser. In particular, the effects of delay order on the equaliser's decision boundaries and bit error rate (BER) performance are studied. The minimum mean square error (MMSE) optimisation criterion for training the linear equaliser is also examined to illustrate the sub-optimum nature of such a criterion. To improve the linear equaliser's performance, a method which adapts the equaliser by minimising the BER is proposed. Our results indicate that the linear equalisers performance is normally improved by using the minimum BER criterion. The decision feedback equaliser (DFE) is also examined. We propose a transformation using the feedback inputs to change the DFE problem to a feedforward equaliser problem. This unifies the treatment of the equaliser structures with and without decision feedback. ----------------------------------------------------------- Criticism, comments and suggestions are welcome. Merry Christmas everyone! Eng Siong - -------------------------------------------------------------------------- Eng Siong CHNG Lab. for ABS, Frontier Research Programme, RIKEN, email : ces at negi.riken.go.jp 2-1 Hirosawa, Wako-Shi, Saitama 351-01, JAPAN. - -------------------------------------------------------------------------- RETRIEVAL INSTRUCTIONS: FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/Thesis/chng.thesis.ps.Z File size : 1715073 bytes Number of pages : 165 pages unix> ftp archive.cis.ohio-state.edu Connected to archive.cis.ohio-state.edu. 220 archive.cis.ohio-state.edu FTP server ready. Name: anonymous 331 Guest login ok, send ident as password. Password:neuron 230 Guest login ok, access restrictions apply. ftp> binary 200 Type set to I. ftp> cd pub/neuroprose/Thesis 250 CWD command successful. ftp> get chng.thesis.ps.Z 200 PORT command successful. 150 Opening BINARY mode data connection for chng.thesis.ps.Z 226 Transfer complete. ftp> quit 221 Goodbye. unix> uncompress chng.thesis.ps.Z unix> lpr chng.thesis.ps (postscript printer) Contact me if there are any problems with retrieval and or printing. ------- End of Forwarded Message From hag at santafe.edu Mon Dec 18 21:22:57 1995 From: hag at santafe.edu (Howard A. Gutowitz) Date: Mon, 18 Dec 1995 19:22:57 -0700 (MST) Subject: Exploring the Space of CA Message-ID: <9512190222.AA29140@sfi.santafe.edu> Announcing: "Exploring the Space of Cellular Automata" Cellular automata can be thought of as a restricted kind of neural net, in which the cells take on only a finite set of values, and connections are local and regular. This is set of interactive web pages designed to help you learn about CA, and the use of the lambda parameter to find critical regions in the space of CA. Credits: Concept: Chris Langton CA simulation program: Patrick Hayden. cgi interface: Eric Carr. Text: Chris Langton , Howard Gutowitz, and Eric Carr. Available from: http://alife.santafe.edu/alife/topics/ca/caweb -- Howard Gutowitz | hag at neurones.espci.fr ESPCI | http://www.santafe.edu/~hag Laboratoire d'Electronique | home: (331) 4707-3843 10 rue Vauquelin | office: (331) 4079-4697 75005 Paris, France | fax: (331) 4079-4425 From hicks at cs.titech.ac.jp Mon Dec 18 23:58:07 1995 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Tue, 19 Dec 1995 13:58:07 +0900 Subject: NFL, practice, and CV Message-ID: <199512190458.NAA28669@euclid.cs.titech.ac.jp> Huaiyu Zhu wrote: >You can't make every term positive in your balance sheet, if the grand >total is bound to be zero. There ARE functions which are always non-negative, but which under an appropriate measure integrate to 0. It only requires that 1) the support of the non-negative values is vanishingly small, 2) the non-negative values are bounded So the above statement by Dr. Zhu is not true. In fact I think this ability for pointwise positive values to dissapear under integration is key to the "zero-sum" aspect of the NFL theorem holding true, despite the fact that we obviously see so many examples of working algorithms. My key point: A zero-sum (infinite) universe doesn't require negative values. ---- There is another important issue which needs to be clarified, and that is the definition of CV and the kinds of problems to which it can be applied. Now anybody can make whatever definition they want, and then come to some conclusions based upon that definition, and that conclusion may be correct given that definition. However, there are also advantages to sharing a common intellectual currency. I quote below from "An Introduction to the Bootstrap" by Efron and Tibshirani, 1993, Chapter 17.1. It describes well what I meant when I talked monitoring prediction error in a previous posting, and describes CV as a method for doing that. ================================================== In our discussion so far we have focused on a number of measures of statistical accuracy: standard errors, biases, and confidence intervals. All of these are measures of accuracy for parameters of a model. Prediction error is a different quantity that measures how well a model predicts the response value of a future observation. It is often used for model selection, since it is ensible ot choose a model that has the lowest prediction error among a set of candidates. Cross-validation is a standard tool for estimating prediction error. It is an old idea (predating the bootstrap) that has enjoyed a comeback in recent years with the increase in available computing power and speed. In this chapter we discuss cross-validation, the bootstrap, and some other closely related techniques for estimation of prediction error. In regression models, prediction error refers to the expected squared difference between a future response and its prediction from the model: PE = E(y - \hat{y})^2. The expectation refers to repeated sampling from the true population. Prediction error also arises in th eclassification problem, where the repsponse falls into one of k unordered classes. For example, the possible reponses might be Republican, Democrat, or Independent in a political survey. In classification problems prediction error is commonly defined as the probability of an incorrect classification PE = Prob(\hat{y} \neq y), also called the misclassification rate. The methods described in this chapter apply to both definitions of prediction error, and also to others. ================================================== Craig Hicks Tokyo Institute of Technology From zhuh at helios.aston.ac.uk Tue Dec 19 10:14:20 1995 From: zhuh at helios.aston.ac.uk (zhuh) Date: Tue, 19 Dec 1995 15:14:20 +0000 Subject: NFL, practice, and CV Message-ID: <8208.9512191514@sun.aston.ac.uk> This is in reply to the critisism by Craig Hicks and Kevin Cherkauer, and will be my last posting in this thread. Craig Hicks thought that my statement (A) > >You can't make every term positive in your balance sheet, if the grand > >total is bound to be zero. is contradictory to his statements (B) > There ARE functions which are always non-negative, but which under > an appropriate measure integrate to 0. > It only requires that > > 1) the support of the non-negative values is vanishingly small, > 2) the non-negative values are bounded But they are actually talking about different things. There is a big difference between positive and non-negative. For all practical purposes, the functions described by (B) can be regarded as identically zero. Translating back to the original topic, statement (B) becomes (C) There are algorithms which are always no worse than random guessing, on any prior, provided that 1) The priors on which it performs better than random guessing have zero probability to occur in practice. 2) It cannot be infinitely better on these priors. It is true that something improbable may still be possible, but this is only of academic interest. In most of modern treatment of function spaces, functions are only identified up to a set of measure zero, so that phrases like "almost everywhere" or "almost surely" are redundent. I suspect that due to the way NFL are proved, even (C) is impossible, but this does not matter anyway, because (C) itself is of no practical interest whatsoever. > ---- Considering cross validation, Craig wrote > > There is another important issue which needs to be clarified, and that is the > definition of CV and the kinds of problems to which it can be applied. Now > anybody can make whatever definition they want, and then come to some > conclusions based upon that definition, and that conclusion may be correct > given that definition. However, there are also advantages to sharing a common > intellectual currency. Risking a little bit over-simplification, I would like to summarise the two usages of CV as the following (CV1) A method for evaluating estimates, (CV2) A method for evaluating estimators. The key difference is that in (CV1), a decision is made for each sample, while in (CV2) a decision is made for all samples. If (CV1) is applied on two algorithms A and B, then we can always define a third algorithm C, by always choosing the estimate given by either A or B which is favoured by (CV1). But my previous counter-example shows that averaging over all samples, C can be worse than A. One may seek refuge in statements like "optimal decision for each sample does not mean optimal decision for all samples". Well, such incoherent inference is the defining characteristic of non-Bayesian statistics. In Bayesian decision theory it is well known that A method is optimal iff it is optimal on almost all samples, (excluding various measure zero anomolies.) The case of (CV2) is quite different. It is of a higher level than algorithms like A and B. It is in fact a statistical estimator mapping (D,A,f) to to a real number r, where D is a finite data set, A is a given algorithm, f is an objective function, and r is the predicted average performance. It should therefore be compared with other such methods. This appears not to be a topic considered in this discussion. -------------- Kevin Cherkauer wrote > > You forgot > > D: Anti-cross validation to choose between A and B, with one extra data > point. Well, I did not forget that, as you have quoted below, point 6. > > I don't understand your claim that "cross validation IS harmful in this case." > You seem to equate "harmful" with "suboptimal." See my original answer, points 1. and 4. > Cross validation is a technique > we use to guess the answer when we don't already know the answer. This is true for any statistical estimator. > You give > technique A the benefit of your prior knowledge of the true answer, but C must > operate without this knowledge. The prior knowledge is that the distribution is a unit Gaussian with unspecified mean, the true answer is its mean. No, they are not the same thing. C also operates with the knowledge that the distribution is a unit Gaussian, but it refuses to use this knowledge (which implies A is better than B). Instead, it insists on evaluating A and B on a cross validation set. That's why it performs miserably. > A fair comparison would pit C against D, not C > against A. As you say: > > >6. In any of the above cases, "anti cross validation" would be even > >more disastrous. If the definition was that "An algorithm is good if it is no worse than the worst algorithm", then I would have no objection. Well, almost any algorithm would be good in this sense. However, if the phrase "in any of the above cases" is droped without putting a prior restriction as remedy, then it's also true that all algorithm is as bad as the worst algorithm. Huaiyu PS. I think I have already talked enough about this subject so I'll shut up from now on, unless there's anything new to say. More systematic treatment of these subjects instead of counter-examples can be found in the ftp site below. -- Huaiyu Zhu, PhD email: H.Zhu at aston.ac.uk Neural Computing Research Group http://neural-server.aston.ac.uk/People/zhuh Dept of Computer Science ftp://cs.aston.ac.uk/neural/zhuh and Applied Mathematics tel: +44 121 359 3611 x 5427 Aston University, fax: +44 121 333 6215 Birmingham B4 7ET, UK From minton at ISI.EDU Tue Dec 19 14:53:27 1995 From: minton at ISI.EDU (minton@ISI.EDU) Date: Tue, 19 Dec 95 11:53:27 PST Subject: JAIR article Message-ID: <9512191953.AA11913@sungod.isi.edu> Readers of this mailing list may be interested in the following JAIR article, which was just published: Weiss, S.M. and Indurkhya, N. (1995) "Rule-based Machine Learning Methods for Functional Prediction", Volume 3, pages 383-403. PostScript: volume3/weiss95a.ps (527K) compressed, volume3/weiss95a.ps.Z (166K) Abstract: We describe a machine learning method for predicting the value of a real-valued function, given the values of multiple input variables. The method induces solutions from samples in the form of ordered disjunctive normal form (DNF) decision rules. A central objective of the method and representation is the induction of compact, easily interpretable solutions. This rule-based decision model can be extended to search efficiently for similar cases prior to approximating function values. Experimental results on real-world data demonstrate that the new techniques are competitive with existing machine learning and statistical methods and can sometimes yield superior regression performance. The PostScript file is available via: -- comp.ai.jair.papers -- World Wide Web: The URL for our World Wide Web server is http://www.cs.washington.edu/research/jair/home.html -- Anonymous FTP from either of the two sites below: CMU: p.gp.cs.cmu.edu directory: /usr/jair/pub/volume3 Genoa: ftp.mrg.dist.unige.it directory: pub/jair/pub/volume3 -- automated email. Send mail to jair at cs.cmu.edu or jair at ftp.mrg.dist.unige.it with the subject AUTORESPOND, and the body GET VOLUME3/FILE-NM (e.g., GET VOLUME3/MOONEY95A.PS) Note: Your mailer might find our files too large to handle. Also, note that compressed files cannot be emailed, since they are binary files. -- JAIR Gopher server: At p.gp.cs.cmu.edu, port 70. For more information about JAIR, check out our WWW or FTP sites, or send electronic mail to jair at cs.cmu.edu with the subject AUTORESPOND and the message body HELP, or contact jair-ed at ptolemy.arc.nasa.gov. From lucas at scr.siemens.com Tue Dec 19 12:26:15 1995 From: lucas at scr.siemens.com (Lucas Parra) Date: Tue, 19 Dec 1995 12:26:15 -0500 (EST) Subject: Preprint: Symplectic Nonlinear Component Analysis Message-ID: <199512191726.MAA04146@owl.scr.siemens.com> Dear fellow connectionists, a preprint of the following NIPS*95 paper is available at: ftp://archive.cis.ohio-state.edu/pub/neuroprose/parra.nips95.ps.Z Symplectic Nonlinear Component Analysis Lucas C. Parra Siemens Corporate Research lucas at scr.siemens.com Statistically independent features can be extracted by finding a factorial representation of a signal distribution. Principal Component Analysis (PCA) accomplishes this for linear correlated and Gaussian distributed signals. Independent Component Analysis (ICA), formalized by Comon (1994), extracts features in the case of linear statistical dependent but not necessarily Gaussian distributed signals. Nonlinear Component Analysis finally should find a factorial representation for nonlinear statistical dependent distributed signals. This paper proposes for this task a novel feed-forward, information conserving, nonlinear map - the explicit symplectic transformations. It also solves the problem of non-Gaussian output distributions by considering single coordinate higher order statistics. From jlm at crab.psy.cmu.edu Wed Dec 20 18:16:31 1995 From: jlm at crab.psy.cmu.edu (James L. McClelland) Date: Wed, 20 Dec 95 18:16:31 EST Subject: Technical Report Available Message-ID: <9512202316.AA19275@crab.psy.cmu.edu.psy.cmu.edu> The following Technical Report is available electronically from our FTP server or in hard copy form. Instructions for obtaining copies may be found at the end of this post. ======================================================================== On the Time Course of Perceptual Choice: A Model Based on Principles of Neural Computation Marius Usher & James L. McClelland Carnegie Mellon University and the Center for the Neural Basis of Cognition Technical Report PDP.CNS.95.5 December 1995 The time course of information processing is discussed in a model based on leaky, stochastic, non-linear accumulation of activation in mutually inhibitory processing units. The model addresses data from choice tasks using both time-controlled (e.g., deadline or response signal) and standard reaction time paradigms, and accounts simultaneously for aspects of data from both paradigms. In special cases, the model becomes equivalent to a classical diffusion process, but in general a more complex type of diffusion occurs. Mutual inhibition counteracts the effects of information leakage, allows flexible choice behavior regardless of the number of alternatives, and contributes to accounts of additional data from tasks requiring choice with conflict stimuli and word identification tasks. ====================================================================== Retrieval information for pdp.cns TRs: unix> ftp 128.2.248.152 # hydra.psy.cmu.edu Name: anonymous Password: ftp> cd pub/pdp.cns ftp> binary ftp> get pdp.cns.95.5.ps.Z # gets this tr ftp> quit unix> zcat pdp.cns.95.5.ps.Z | lpr # or however you print postscript NOTE: The compressed file is 567,075 bytes long. Uncompressed, the file is 1,768,398 byes long. The printed version is 53 total pages long. For those who do not have FTP access, physical copies can be requested from Barbara Dorney . For a list of available PDP.CNS Technical Reports: > get README For the titles and abstracts: > get ABSTRACTS From dhw at santafe.edu Wed Dec 20 20:00:48 1995 From: dhw at santafe.edu (David Wolpert) Date: Wed, 20 Dec 95 18:00:48 MST Subject: NFL once again, I'm afraid Message-ID: <9512210100.AA06007@sfi.santafe.edu> First and foremost, I would like to request that this NFL thread fade out. It is only sowing confusion - people should read the papers on NFL to understand NFL. [[ Moderator's note: I concur. We've had enough "No Free Lunch" discussion for a while; people are starting to protest. Future discussion should be done in email. -- Dave Touretzky, CONNECTIONISTS moderator ]] Full stop. *After* that, after there is common grounding, we can all debate. There is much else that connectionist is more appropriate for in the meantime. (To repeat: ftp.santafe.edu, pub/dhw_ftp, nfl.1.ps.Z and nfl.2.ps.Z.) Please, I'm on my knees, use the time that would have been spent thrashing at connectionist in a more fruitful fashion. Like by reading the NFL papers. :-) *** Hicks writes: >>> case 1: * Either the target function is (noise/uncompressible/has no structure), or none of the candidate functions have any correlation with the target function.* Since CV provides an estimated prediction error, it can also tell us "you might as well be using anti-cross validation, or random selection for that matter, because it will be equally useless". >>> This is wrong. Construct the following algorithm: "If CV says one of the algorithms under consideration has particularly low error in comparison to the other, use that algorithm. Otherwise, choose randomly among the algorithms." Averaged over all targets, this will do exactly as well as the algorithm that always guesses randomly among the algorithms. (For zero-one loss, either OTS error or IID error with a big input space, etc.) So you cannot rely on CV's error estimate *at all* (unless you impose a prior over targets or some such, etc.). Alternatively, keep in mind the following simple argument: In its uniform prior(targets) formulation, NFL holds even for error distributions conditioned on *any* property of the training set. So in particular, you can condition on having a training set for which CV says "yep, I'm sure; choose that one". And NFL still holds. So even in those cases where CV "is sure", by following CV, you'll die as often as not. >>> case 2: * The target (is compressible/has structure), and some the candidate functions are positively correlated with the target function.* In this case CV will outperform anti-CV (ON AVERAGE). >>> This is wrong. As has been mentioned many times, having structure in the target, by itself, gains you nothing. And as has also been mentioned, if "the candidate functions are positively correlated with the target function", then in fact *anti-CV wins*. READ THE PAPERS. >>> By ON AVERAGE I mean the expectation across the ensemble of samples for a FIXED target function. This is different from the ensemble and distribution of target functions, which is a much bigger question. >>> This distinction is irrelevent. There are versions of NFL that address both of these cases (as well as many others). READ THE PAPERS. ***** Lemm writes: >>> 1.) In short, NFL assumes that data, i.e. information of the form y_i=f(x_i), do not contain information about function values on a non-overlapping test set. >>> This is wrong. See all the previous discussion about how NFL holds even if you restrict yourself to targets with a lot of structure. The problem is that the structure can hurt just as easily as help. There is no need for the data set to contain no information about the test set - simply that the limited types of information can "confuse" the learning algorithm at hand. READ THE PAPERS. >>> This is done by postulating "unrestricted uniform" priors, or uniform hyperpriors over nonumiform priors... >>> This is wrong. There is (obviously) a version of NFL that holds for uniform priors. And there is another version in which one averages over all priors - so the uniform prior has measure 0. But one can also restrict oneself to average only over those priors "with a lot of structure", and again get NFL. And there are many other versions of NFL in which there is *no* prior, because things are conditioned on a fixed target. Exactly as in (non-Bayesian) sampling theory statistics. Some of those alternative NFL results involve saying "if you're conditioning on a target, there are as many such targets where you die as where you do well". Other NFL results never vary the target *in any sense*, even to compare different targets. Rather they vary something concerning the generalizer. This is the case with the more sophisticated xvalidation results, for example. READ THE PAPERS. >>> There is much information which is not of this "single sharp data" type. (Examples see below.) >>> *Obviously* if you have extra information and/or knowledge beyond that in the training set, you can (often) do better than randomly. That's what Bayesian analysis is all about. More generally, as I have proven in [1], the probability of error can be written as a non-Euclidean inner product between the learning algorithm and the posterior. So obviously if your posterior is structured in an appropriate manner, that can be exploited by the algorithm. This was never the issue however. The issue had to do with "blind" supervised learning, in which one has no such additional information. Like in COLT, for example. You're arguing apples and oranges here. >>> 4) Real measurements (especially of continuous variables) normally do also NOT have the form y_i=f(x_i) ! They mostly perform some averaging over f(x_i) or at least they have some noise on the x_i (as small as you like, but present). >>> Again, this is obvious. And stated explicitly in the papers, moreover. And completely irrelevent to the current discussion. The issue at hand has *always* been "sharp" data. And if you look at what's done in the neural net community, or in COLT, 95% of it assumes "sharp data". Indeed, there are many other assumptions almost always made and almost never true that Lemm has missed. Like making a "weak filtering assumption": assume the target and the distribution over inputs are independent. But again, just like in COLT, we're starting simple here, with such assumptions intact. READ THE PAPERS. >>> This shows that smoothness of the expectation (in contrast to uniform priors) is the result of the measurement process and therefore is a real phenomena for "effective" functions. >>> To give one simple example, what about with categorical data, where there is not even a partial ordering over the inputs? What does "locally smooth" even mean then? And even if we're dealing with real valued spaces, if there's input space noise, NFL simply changes to be a statement concerning test set elements that are sufficiently far (on the scale of the input space noise) from the elements of the training set. The input space noise makes the math more messy, but doesn't change the underlying phenomenon. (Readers interested in previous work on the relationship between local (!) regularization, smoothness, and input noise should see Bishop's Neural Computation article of about 6 months ago.) >>> Even more: situations without "priors" are VERY artificial. So if we specify the "priors" (and the lesson from NFL is that we should if we want to make a good theory) then we cannot use NFL anymore.(What should it be used for then?) >>> Sigh. 1) I am a Bayesian whenever feasible. (In fact, I've been taken to task for being "too Bayesian".) But situations without obvious priors - or where eliciting the priors is not trivial and you don't have the time - are in fact *very* common. A simple example is a project I am currently involved on for detecting phone fraud for MCI. Quick, tell me the prior probability that a fraudulent call arises from area code 617 vs. the prior probability that a non-fraudulent call does... 2) Essentially all of COLT is non-Bayesian. (Although some of it makes assumptions about things like the support of the priors.) You haven't a prayer of really understanding what COLT has to say without keeping in mind the admonitions of NFL. 3) As I've now said until I'm blue in the face, NFL is only the starting point. What it's "good for", beyond proving to people that they must pay attention to their assumptions, be wary of COLT-type claims, etc. is: head-to-head minimax theory, scrambled algorithms theory, hypothesis-averaging theory, etc., etc., etc. READ THE PAPERS. **** Zhu writes: >>> I quite agree with Joerg's observation about learning algorithms in practice, and the priors they use. The key difference is Is it legitimate to be vague about prior? Put it another way, Do you claim the algorithm can pick up whatever prior automatically, instead of being specified before hand? My answer is NO, to both questions, because for an algorithm to be good on any prior is exactly the same as for an algorithm to be good without prior, as NFL told us. >>> Yes! Everybody, LISTEN TO ZHU!!!! David Wolpert [1] - Wolpert, D. "The Relationshop Between PAC, the Statistical Physics Framework, the Bayesian Framework, and the VC Framework", in "The Mathematics of Generalization", D. Wolpert (Ed.), Addison-Wesley, 1995 From terry at salk.edu Wed Dec 20 20:34:15 1995 From: terry at salk.edu (Terry Sejnowski) Date: Wed, 20 Dec 95 17:34:15 PST Subject: Senior Position at GSU Message-ID: <9512210134.AA16333@salk.edu> Forwarded to Connectionists: Date: Mon, 18 Dec 1995 15:00:23 -0500 (EST) From: Donald Edwards Subject: job Dear friends and colleagues, I am writing to let you know of a senior position in computational neuroscience available here in the Department of Biology at Georgia State University. This person would join neurobiologists, physicists, mathematicians and computer scientists in the newly established Center for Neural Communication and Computation, and would participate in the graduate program in Neurobiology in the Department of Biology. This person would also help guide the construction, equipping and staffing of a Laboratory for Computational Neuroscience for which funds have already been obtained from the George Research Alliance. Georgia State University is located in downtown Atlanta. For more information, please contact me at this address, or call at (404) 651-3148. To apply, please send a letter of intent, c.v., and two letters of reference to Search Committee for Computational Neuroscience, Department of Biology, Georgia State University, Atlanta, GA 30302-4010. FAX: (404) 651-2509. Please share this message with anyone who might be interested. Thanks for your consideration, Don Edwards From erik at kuifje.bbf.uia.ac.be Thu Dec 21 12:48:50 1995 From: erik at kuifje.bbf.uia.ac.be (Erik De Schutter) Date: Thu, 21 Dec 95 17:48:50 GMT Subject: Crete Course in Computational Neuroscience Message-ID: <9512211748.AA27308@kuifje.bbf.uia.ac.be> CRETE COURSE IN COMPUTATIONAL NEUROSCIENCE AUGUST 25 - SEPTEMBER 21, 1996 CRETE, GREECE DIRECTORS: Erik De Schutter (University of Antwerp, Belgium) Idan Segev (Hebrew University, Jerusalem, Israel) Jim Bower (California Institute of Technology, USA) Adonis Moschovakis (University of Crete, Greece) The Crete Course in Computational Neuroscience introduces students to the practical application of computational methods in neuroscience, in particular how to create biologically realistic models of neurons and networks. The course consists of two complimentary parts. A distinguished international faculty gives morning lectures on topics in experimental and computational neuroscience. The rest of the day is spent learning how to use simulation software and how to implement a model of the system the student wishes to study. The first week of the course introduces students to the most important techniques in modeling single cells, networks and neural systems. Students learn how to use the GENESIS, NEURON, XPP and other software packages on their individual unix workstations. During the following three weeks the lectures will be more general, moving from modeling single cells and subcellular processes through the simulation of simple circuits and large neuronal networks and, finally, to system level models of the cortex and the brain. The course ends with a presentation of the student modeling projects. The Crete Course in Computational Neuroscience is designed for advanced graduate students and postdoctoral fellows in a variety of disciplines, including neurobiology, physics, electrical engineering, computer science and psychology. Students are expected to have a basic background in neurobiology as well as some computer experience. A total of 25 students will be accepted, the majority of whom will be from the European Union and affiliated countries. A tuition fee of 500 ECU ($700) covers travel to Crete, lodging and all course-related expenses for European nationals. We encourage students from the Far East and the USA to also apply to this international course. More information and application forms can be obtained: - WWW access: http://bbf-www.uia.ac.be/CRETE/Crete_index.html - by mail: Prof. E. De Schutter Born-Bunge Foundation University of Antwerp - UIA, Universiteitsplein 1 B2610 Antwerp Belgium - email: crete_course at kuifje.bbf.uia.ac.be APPLICATION DEADLINE: April 10th, 1996. Applicants will be notified of the results of the selection procedures before May 1st. FACULTY: M. Abeles (Hebrew University, Jerusalem, Israel), D.J. Amit (University of Rome, Italy and Hebrew University, Israel), R.E. Burke (NIH, USA), C.E. Carr (University of Maryland, USA), A. Destexhe (Universit Laval, Canada), R.J. Douglas (Institute of Neuroinformatics, Zurich, Switzerland), T. Flash (Weizmann Institute, Rehovot, Israel), A. Grinvald (Weizmann Institute, Israel), J.J.B. Jack (Oxford University, England), C. Koch (California Institute of Technology, USA), H. Korn (Institut Pasteur, France), A. Lansner (Royal Institute Technology, Sweden), R. Llinas (New York University, USA), E. Marder (Brandeis University, USA), M. Nicolelis (Duke University, USA), J.M. Rinzel (NIH, USA), W. Singer (Max-Planck Institute, Frankfurt, Germany), S. Tanaka (RIKEN, Japan), A.M. Thomson (Royal Free Hospital, England), S. Ullman (Weizmann Institute, Israel), Y. Yarom (Hebrew University, Israel). The Crete Course in Computational Neuroscience is supported by the European Commission (4th Framework Training and Mobility of Researchers program) and by The Brain Science Foundation (Tokyo). Local administrative organization: the Institute of Applied and Computational Mathematics of FORTH (Crete, GR). From udah075 at kcl.ac.uk Thu Dec 21 12:53:21 1995 From: udah075 at kcl.ac.uk (Rasmus Petersen) Date: Thu, 21 Dec 95 17:53:21 GMT Subject: studentships for European students Message-ID: <3027.9512211753@maths1.mth.kcl.ac.uk> ************************************************************** Studentships - For EU Students - Please note new age limit It was agreed by the Human Resources Committee and endorsed by the Executive Board of NEuroNet in Paris that up to 10,000 ECU be allocated for studentships each year. These provide support for registration, accommodation and travel to designated workshops and conferences with a significant tutorial component. (The studentships are a fixed value). Up to 22 studentships of 450 ECU each will be available for the NEuroFuzzy '96 workshop and tutorials in Prague from 16th-18th April 1996. Applications for these studentships must be received in the NEuroNet Office before 31st December 1995. Successful applicants will be notified in January 1996. Up to 20 studentships of 500 ECU each will be available for the ICANN '96 conference in Bochum, Germany from 16th-19th July 1996. Applications for these studentships must be received in the NEuroNet Office before 3rd March 1996. Successful applicants will be notified in April 1996. Applicants for studentships are limited to full-time students, who are EU nationals, and aged 30 years or less. (Priority will be given to applicants aged under 25 years of age). All applications should be accompanied by a letter of support from the applicant's Head of Department and should contain verification of the applicant's age, status as a student and nationality. All applications will be reviewed by the Human Resources Committee of NEuroNet. Please apply in writing to the NEuroNet Administrator: Ms Terhi Garner NEuroNet Department of Electronic and Electrical Engineering King's College London Strand, London WC2R 2LS, UK Fax: +44 (0) 171 873 2559 *********************************************************************** From dhw at santafe.edu Fri Dec 29 19:54:42 1995 From: dhw at santafe.edu (dhw@santafe.edu) Date: Fri, 29 Dec 95 17:54:42 MST Subject: Postdoc opening Message-ID: <9512300054.AA17781@yaqui> The Santa Fe Institute is soliciting applications for a TXN postdoctoral fellow. The fellow is expected to perform research in Machine Learning, Artificial Intelligence, or related areas of statistics. Information about the SFI can be found at http://www.santafe.edu/. Candidates should have a Ph.D. (or expect to receive one soon) and should have backgrounds in computer science, mathematics, statistics, or related fields. Applicants should submit a curriculum vitae, list of publications, statement of research interests, and three letters of recommendation. Please submit your materials in one complete package. Incomplete applications will not be considered. All application materials must be received by March 1, 1996. Decisions will be made by April, 1996. Send complete application packages only, preferably hard copy, to: TXN Postdoctoral Committee Attention: David Wolpert Santa Fe Institute 1399 Hyde Park Road Santa Fe, New Mexico 87501 Include your e-mail address and/or fax number. The SFI is an equal opportunity employer. Women and minorities are encouraged to apply. From bozinovs at delusion.cs.umass.edu Sun Dec 31 17:55:53 1995 From: bozinovs at delusion.cs.umass.edu (bozinovs@delusion.cs.umass.edu) Date: Sun, 31 Dec 1995 17:55:53 -0500 Subject: New Book Message-ID: <9512312255.AA25407@delusion.cs.umass.edu> Dear Connectionists, Happy New Year to everybody! At the end of the year I have a pleasure to announce a new book in the field. Advertisment: ********************************************************************* New Book! New Book! New Book! New Book! New Book! New Book! --------------------------------------------------------------------- CONSEQUENCE DRIVEN SYSTEMS CONSEQUENCE DRIVEN SYSTEMS CONSEQUENCE DRIVEN SYSTEMS by Stevo Bozinovski *201 pages *79 figures *27 algorithm descriptions *8 tables Among its special features, the book: --------------------------------------- ** provides a unified theory of response-sensitive teaching and learning ** as a result of that theory describes a generic architecture of a neuro-genetic agent capable of performing in 1) consequence sensitive teaching, 2) reinforcement learning, and 3) self-reinforcement learning paradigms ** describes the Crossbar Adaptive Array (CAA) architecture, an 1981 neural network developed within the Adaptive Networks Group, as an example of a neuro-genetic agent ** explains how the CAA architecture was the first neural network that solved a delayed reinforcement learning task, the Dungeons-and-Dragons task, in 1981 ** explains how the 1981 learning method (shown on the cover of the book) is actually the well known, 1989 rediscovered, Q-learning method ** introduces the Benefit-Cost CAA (B-C CAA), as extension of the 1981 Benefit-only CAA architecture ** introduces at-subgoal-go-back algorithm as modification of the 1981 at-goal-go-back CAA algorithm ** introduces a new type of neuron, denoted as Provoking Adaptive Unit, for dealing with tasks of Distributed Consequence Programming ** illustrates the usage of those neurons as routers in a routing-in-networks-with-faults task ** uses parallel programming technique in describing the algorithms throughout the book ----------------------------------------- Ordering information ISBN 9989-684-06-5, Gocmar Press, 1995 price: $15, paperback For further information contact the author: bozinovs at cs.umass.edu ********************************************************************** CONTENTS: 1. INTRODUCTION 1.1. The framework 1.2. Agents and architectures 1.3. Neural architectures 1.3.1. Greedy policy neural architectures 1.3.2. Recurrent architectures 1.3.3. Crossbar architectures 1.3.4. Subsumption architecture adaptive arrays 1.4. Problems. Emotional Graphs 1.5. Games. Emotional Petri Nets 1.6. Parallel programming 1.7. Bibliographical and other notes 2. CONSEQUENCE LEARNING AGENTS: A STRUCTURAL THEORY 2.1. The agent-environment interface 2.2. A taxonomy of learning paradigms 2.3. Classes of consequence learning agents 2.4. A generic consequence learning architecture 2.5. Learning rules and routines 2.6. Bibliographical and other notes 3. CONSEQUENCE DRIVEN TEACHING 3.1. Class T agents 3.2. Learners 3.3. Teachers 3.3.1. Toward a theory of teaching systems 3.3.2. Teaching strategies 3.4. Curriculums 3.4.1. Curriculum grammars and languages 3.4.2. Curriculum space approach 3.5. Pattern classification teaching as integer programming 3.6. Pattern classification teaching as dynamic programming 3.7. Bibliographical and other notes 4. EXTERNAL REINFORCEMENT LEARNING 4.1. Reinforcement learningh NG agents 4.2. Associative Search Network (ASN) 4.2.1. Basic ASN 4.2.2. Reionforcement predictive ASN 4.3. Actor-Critic architecture 4.4. Bibliographical and other notes 5. SELF-REINFORCEMENT LEARNING 5.1. Conceptual framework 5.2. Self-reinforcement learning and the NG agents 5.3. The Crossbar Adaptive Array architecture 5.4. How it works 5.4.1. Defining primary goals from the genetic environment 5.4.2. Secondary reinforcement mechanism 5.4.3. The CAA learning method 5.5. Example of a CAA architecture 5.6. Solving problems with a CAA architecture 5.6.1. Learning in emotional graphs: Maze running 5.6.2. Learning in loosely defined emotional graphs: Pole balancing 5.7. Another example of a CAA architecture 5.8. Using entropy in Markov Decision Processes 5.9. Issues on the genetic environment 5.9.1. CAA architecture as an optimization architecture 5.9.2. Complemetarity with the Genetic Algorithms 5.9.3. Self-reinforcement: Genetic environment approach 5.10. Bibliographical and other notes 6. CONSEQUENCE PROGRAMMING 6.1. Dynamic Programming and Markov Decision Problems 6.2. Introducing cost in the CAA architecture 6.3. Q-learning 6.4. A taxonomy of the CAA-method based learning algorithms 6.5. Producing optimal solution in a stochastic environment 6.6. Distributed Consequence Programming: A neural theory 6.6.1. Provoking units: Axon provoked neurons 6.6.2. An illustration: Routing in client-server networks with faults 6.7. Bibliographical and other notes 7. SUMMARY 8. REFERENCES 9. INDEX *********************************************************************