From ling at csd.uwo.ca Tue Feb 1 03:37:10 1994 From: ling at csd.uwo.ca (Charles X. Ling) Date: Tue, 1 Feb 94 03:37:10 EST Subject: some questions on training neural nets... Message-ID: <9402010837.AA01695@godel.csd.uwo.ca> Hi neural net experts, I am using backprop (and variations of it) quite often although I have not followed neural net (NN) research as well as I wanted. Some rather basic issues in training NN still puzzle me a lot, and I hope to get advice and help from the experts in the area. Sorry for being ignorant. Say we are learning a function F (such as a Boolean function of n vars). The training set (TR) and testing set (TS) are drawn randomly according to the same probability distribution, with no noise added in. 1. Is it true that, since there is no noise, the smaller the training error on TR, the better it would predict in general on TS? That is, stopping training earlier is not needed (so cross-validation is not needed). 2. Is it true that, to get reliable prediction (good or bad), we should always choose net architecture with a minimum number of hidden units (or weights via weight decaying)? Will cross-validation help if we have too much freedom in the net (could results on the validation set be coincident)? 3. If, for some reason, cross-validation is needed, and TR is split to TR1 (for training) and TR2 (for validation), what would be the proper ways to do cross-validation? Training on TR1 uses only partial information in TR, but training TR1 to find right parameters and then training on TR1+TR2 may require parameters different from the estimation of training TR1. 4. In case the net has too much freedom (even different random seeds produce very different predictive accuracies), how can we effectively reduce the variations? Weight decaying seems to be a powerful tool, any others? What kind of "simple" functions weight decaying is biased to? Thanks very much for help Charles From marwan at sedal.sedal.su.OZ.AU Tue Feb 1 21:07:09 1994 From: marwan at sedal.sedal.su.OZ.AU (Marwan Jabri) Date: Tue, 1 Feb 94 21:07:09 EST Subject: job openning Message-ID: <9402011007.AA09253@sedal.sedal.su.OZ.AU> The advertisment below could be of interest to a person with Unix and connectionism skills. --------------------------------------------------------------------- Systems Engineering and Design Automation Laboratory Sydney University Electrical Engineering Computer Systems Officer (in other words, a software engineer!) Reference No: B04/17 Applications are invited for the position of Computer Systems Officer with the Systems Engineering and Design Automation Laboratory (SEDAL) at Sydney University Electrical Engineering. The position is aimed at: - Supporting the administration of a computer network (Sun and DEC workstations); - Developing software in the areas of neural computing, video coding and parallel computers. The appointee must have knowledge and experience of C programming under Unix, DOS and Windows, and a degree in electronics or computer science. Experience in the areas of neural computing and/or video coding is highly desirable. Appointment will be for one year in the first instance with the possibility of renewal for up to a further four years subject to need and funding. Further information from Marwan Jabri on (+61-2) 692 2240, fax (+61-2) 660 1228 or email: marwan at sedal.su.oz.au. Salary: Level 5 $28,899 - $32,598 per annum Closing: 10 February 1994 To apply, an application quoting reference number, including CV, qualifications and the names, addresses, phone numbers and email addresses of two referees should be sent to Personnel Officier Personnel Services K07 The University of Sydney NSW 2006 Australia ---------------------------------------------------------------------- Equal opportunity and no smoking in the workplace are University Policies. The University Resevers the right not to proceed with an appointment for financial or other reasons. From prechelt at ira.uka.de Tue Feb 1 09:08:12 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Tue, 01 Feb 1994 15:08:12 +0100 Subject: donations to bibliography server Message-ID: <"irafs2.ira.632:01.01.94.14.08.31"@ira.uka.de> A colleague of mine here at University of Karlsruhe is currently building a large bibliographic database that is available free of charge on the internet. It currently contains about 210000 entries from various fields of computer science (mostly parallel processing, graphics, theoretical computer science, computational geometry, human computer interaction) Although there are several thousand entries on Artificial Intelligence topics, connectionism is not covered very well yet (Neural Computation's contents are present and some personal bibliographies). To extend this database by at least some basic information about neural network and other connectionist research, it would be fine if somebody could donate bibliographies on these topics which are (almost) comprehensive in some respect. In particular, I think it would be a very good start to have complete contents of NIPS, IJCNN, and Neural Networks (and perhaps, other journals such as Complex Systems). If anybody is able and willing to donate such bibliographies, please send me email. BibTeX format would be best, but refer or other parsable formats are OK, too. For information on the bibliography service, send mail with a single line containing the word 'help' in the body to bibserv at ira.uka.de [ The query service is still in a test stage and is not yet available to people located outside email domain '.de' (Germany) due to resource restrictions. The bibliographies themselves, however, are available for anonymous ftp from ftp.ira.uka.de:/pub/bibliography ] Lutz Lutz Prechelt (email: prechelt at ira.uka.de) | Whenever you Institut fuer Programmstrukturen und Datenorganisation | complicate things, Universitaet Karlsruhe; 76128 Karlsruhe; Germany | they get (Voice: ++49/721/608-4068, FAX: ++49/721/694092) | less simple. From schraudo at salk.edu Tue Feb 1 03:04:05 1994 From: schraudo at salk.edu (Nici Schraudolph) Date: Tue, 1 Feb 94 00:04:05 PST Subject: Neural Computation BibTeX database available Message-ID: <9402010804.AA02809@salk.edu> I've made a database of BibTeX entries for all articles published in the first five volumes of the journal Neural Computation; it's available by anonymous ftp from mitpress.mit.edu (18.173.0.28), file NC.bib.Z in the pub/NeuralComp directory. Share and enjoy, - Nici Schraudolph. From prechelt at ira.uka.de Wed Feb 2 04:12:48 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Wed, 02 Feb 1994 10:12:48 +0100 Subject: Techreport on CuPit available Message-ID: <"irafs2.ira.960:02.01.94.09.13.09"@ira.uka.de> The technical report Lutz Prechelt: "CuPit --- A Parallel Language for Neural Algorithms: Language Reference and Tutorial" is now available for anonymous ftp from ftp.ira.uka.de /pub/uni-karlsruhe/papers/cupit.ps.gz (154 Kb, 75 pages) It is NOT on neuroprose, because its topic does not quite fit into neuroprose's scope. Abstract: ---------- CuPit is a parallel programming language with two main design goals: 1. to allow the simple, problem-adequate formulation of learning algorithms for neural networks with focus on algorithms that change the topology of the underlying neural network during the learning process and 2. to allow the generation of efficient code for massively parallel machines from a completely machine-independent program description, in particular to maximize both data locality and load balancing even for irregular neural networks. The idea to achieve these goals lies in the programming model: CuPit programs are object-centered, with connections and nodes of a graph (which is the neural network) being the objects. Algorithms are based on parallel local computations in the nodes and connections and communication along the connections (plus broadcast and reduction operations). This report describes the design considerations and the resulting language definition and discusses in detail a tutorial example program. ---------- Remember to use 'binary' mode for ftp. To uncompress the Postscript file, you need to have the GNU gzip utility. Lutz Lutz Prechelt (email: prechelt at ira.uka.de) | Whenever you Institut fuer Programmstrukturen und Datenorganisation | complicate things, Universitaet Karlsruhe; D-76128 Karlsruhe; Germany | they get (Voice: ++49/721/608-4068, FAX: ++49/721/694092) | less simple. From prechelt at ira.uka.de Wed Feb 2 03:58:56 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Wed, 02 Feb 1994 09:58:56 +0100 Subject: Encoding missing values Message-ID: <"irafs2.ira.708:02.01.94.08.59.37"@ira.uka.de> I am currently thinking about the problem of how to encode data with attributes for which some of the values are missing in the data set for neural network training and use. An example of such data is the 'heart-disease' dataset from the UCI machine learning database (anonymous FTP on "ics.uci.edu" [128.195.1.1], directory "/pub/machine-learning-databases"). There are 920 records altogether with 14 attributes each. Only 299 of the records are complete, the others have one or several missing attribute values. 11% of all values are missing. I consider only networks that handle arbitrary numbers of real-valued inputs here (e.g. all backpropagation-suited network types etc). I do NOT consider missing output values. In this setting, I can think of several ways how to encode such missing values that might be reasonable and depend on the kind of attribute and how it was encoded in the first place: 1. Nominal attributes (that have n different possible values) 1.1 encoded "1-of-n", i.e., one network input per possible value, the relevant one being 1 all others 0. This encoding is very general, but has the disadvantage of producing networks with very many connections. Missing values can either be represented as 'all zero' or by simply treating 'is missing' as just another possible input value, resulting in a "1-of-(n+1)" encoding. 1.2 encoded binary, i.e., log2(n) inputs being used like the bits in a binary representation of the numbers 0...n-1 (or 1...n). Missing values can either be represented as just another possible input value (probably all-bits-zero is best) or by adding an additional network input which is 1 for 'is missing' and 0 for 'is present'. The original inputs should probably be all zero in the 'is missing' case. 2. continuous attributes (or attributes treated as continuous) 2.1 encoded as a single network input, perhaps using some monotone transformation to force the values into a certain distribution. Missing values are either encoded as a kind of 'best guess' (e.g. the average of the non-missing values for this attribute) or by using an additional network input being 0 for 'missing' and 1 for 'present' (or vice versa) and setting the original attribute input either to 0 or to the 'best guess'. (The 'best guess' variant also applies to nominal attributes above) 3. binary attributes (truth values) 3.1 encoded by one input: 0=false 1=true or vice versa Treat like (2.1) 3.2 encoded by one input: -1=false 1=true or vice versa In this case we may act as for (3.1) or may just use 0 to indicate 'missing'. 3.3 treat like nominal attribute with 2 possible values 4. ordinal attributes (having n different possible values, which are ordered) 4.1 treat either like continuous or like nominal attribute. If (1.2) is chosen, a Gray-Code should be used. Continuous representation is risky unless a 'sensible' quantification of the possible values is available. So far to my considerations. Now to my questions. a) Can you think of other encoding methods that seem reasonable ? Which ? b) Do you have experience with some of these methods that is worth sharing ? c) Have you compared any of the alternatives directly ? Lutz Lutz Prechelt (email: prechelt at ira.uka.de) | Whenever you Institut fuer Programmstrukturen und Datenorganisation | complicate things, Universitaet Karlsruhe; 76128 Karlsruhe; Germany | they get (Voice: ++49/721/608-4068, FAX: ++49/721/694092) | less simple. From marshall at cs.unc.edu Wed Feb 2 12:41:49 1994 From: marshall at cs.unc.edu (Jonathan A. Marshall) Date: Wed, 2 Feb 94 12:41:49 -0500 Subject: Papers on visual occlusion and neural networks Message-ID: <9402021741.AA17887@marshall.cs.unc.edu> Dear Colleagues, Below I list two new papers that I have added to the Neuroprose archives (thanks to Jordan Pollack!). In addition, I list two of my older papers in Neuroprose. You can retrieve a copy of these papers -- follow the instructions at the end of this message. --Jonathan ---------------------------------------------------------------------------- marshall.occlusion.ps.Z (5 pages) A SELF-ORGANIZING NEURAL NETWORK THAT LEARNS TO DETECT AND REPRESENT VISUAL DEPTH FROM OCCLUSION EVENTS JONATHAN A. MARSHALL and RICHARD K. ALLEY Department of Computer Science, CB 3175, Sitterson Hall University of North Carolina, Chapel Hill, NC 27599-3175, U.S.A. marshall at cs.unc.edu, alley at cs.unc.edu Visual occlusion events constitute a major source of depth information. We have developed a neural network model that learns to detect and represent depth relations, after a period of exposure to motion sequences containing occlusion and disocclusion events. The network's learning is governed by a new set of learning and activation rules. The network develops two parallel opponent channels or "chains" of lateral excitatory connections for every resolvable motion trajectory. One channel, the "On" chain or "visible" chain, is activated when a moving stimulus is visible. The other channel, the "Off" chain or "invisible" chain, is activated when a formerly visible stimulus becomes invisible due to occlusion. The On chain carries a predictive modal representation of the visible stimulus. The Off chain carries a persistent, amodal representation that predicts the motion of the invisible stimulus. The new learning rule uses disinhibitory signals emitted from the On chain to trigger learning in the Off chain. The Off chain neurons learn to interact reciprocally with other neurons that indicate the presence of occluders. The interactions let the network predict the disappearance and reappearance of stimuli moving behind occluders, and they let the unexpected disappearance or appearance of stimuli excite the representation of an inferred occluder at that location. Two results that have emerged from this research suggest how visual systems may learn to represent visual depth information. First, a visual system can learn a nonmetric representation of the depth relations arising from occlusion events. Second, parallel opponent On and Off channels that represent both modal and amodal stimuli can also be learned through the same process. [In Bowyer KW & Hall L (Eds.), Proceedings of the AAAI Fall Symposium on Machine Learning and Computer Vision, Research Triangle Park, NC, October 1993, 70-74.] ---------------------------------------------------------------------------- marshall.context.ps.Z (46 pages) ADAPTIVE PERCEPTUAL PATTERN RECOGNITION BY SELF-ORGANIZING NEURAL NETWORKS: CONTEXT, UNCERTAINTY, MULTIPLICITY, AND SCALE JONATHAN A. MARSHALL Department of Computer Science, CB 3175, Sitterson Hall University of North Carolina, Chapel Hill, NC 27599-3175, U.S.A. marshall at cs.unc.edu A new context-sensitive neural network, called an "EXIN" (excitatory+ inhibitory) network, is described. EXIN networks self-organize in complex perceptual environments, in the presence of multiple superimposed patterns, multiple scales, and uncertainty. The networks use a new inhibitory learning rule, in addition to an excitatory learning rule, to allow superposition of multiple simultaneous neural activations (multiple winners), under strictly regulated circumstances, instead of forcing winner-take-all pattern classifications. The multiple activations represent uncertainty or multiplicity in perception and pattern recognition. Perceptual scission (breaking of linkages) between independent category groupings thus arises and allows effective global context-sensitive segmentation and constraint satisfaction. A Weber Law neuron-growth rule lets the network learn and classify input patterns despite variations in their spatial scale. Applications of the new techniques include segmentation of superimposed auditory or biosonar signals, segmentation of visual regions, and representation of visual transparency. [Submitted for publication.] ---------------------------------------------------------------------------- marshall.steering.ps.Z (16 pages) CHALLENGES OF VISION THEORY: SELF-ORGANIZATION OF NEURAL MECHANISMS FOR STABLE STEERING OF OBJECT-GROUPING DATA IN VISUAL MOTION PERCEPTION JONATHAN A. MARSHALL [Invited paper, in Chen S-S (Ed.), Stochastic and Neural Methods in Signal Processing, Image Processing, and Computer Vision, Proceedings of the SPIE 1569, San Diego, July 1991, 200-215.] ---------------------------------------------------------------------------- martin.unsmearing.ps.Z (8 pages) UNSMEARING VISUAL MOTION: DEVELOPMENT OF LONG-RANGE HORIZONTAL INTRINSIC CONNECTIONS KEVIN E. MARTIN and JONATHAN A. MARSHALL [In Hanson SJ, Cowan JD, & Giles CL, Eds., Advances in Neural Information Processing Systems, 5. San Mateo, CA: Morgan Kaufmann Publishers, 1993, 417-424.] ---------------------------------------------------------------------------- RETRIEVAL INSTRUCTIONS % ftp archive.cis.ohio-state.edu Name (cheops.cis.ohio-state.edu:yourname): anonymous Password: (use your email address) ftp> cd pub/neuroprose ftp> binary ftp> get marshall.occlusion.ps.Z ftp> get marshall.context.ps.Z ftp> get marshall.steering.ps.Z ftp> get martin.unsmearing.ps.Z ftp> quit % uncompress marshall.occlusion.ps.Z ; lpr marshall.occlusion.ps % uncompress marshall.context.ps.Z ; lpr marshall.context.ps % uncompress marshall.steering.ps.Z ; lpr marshall.steering.ps % uncompress martin.unsmearing.ps.Z ; lpr martin.unsmearing.ps From tgd at chert.CS.ORST.EDU Wed Feb 2 13:02:30 1994 From: tgd at chert.CS.ORST.EDU (Tom Dietterich) Date: Wed, 2 Feb 94 10:02:30 PST Subject: some questions on training neural nets... In-Reply-To: "Charles X. Ling"'s message of Tue, 1 Feb 94 03:37:10 EST <9402010837.AA01695@godel.csd.uwo.ca> Message-ID: <9402021802.AA00565@curie.CS.ORST.EDU> From: "Charles X. Ling" Date: Tue, 1 Feb 94 03:37:10 EST Hi neural net experts, I am using backprop (and variations of it) quite often although I have not followed neural net (NN) research as well as I wanted. Some rather basic issues in training NN still puzzle me a lot, and I hope to get advice and help from the experts in the area. Sorry for being ignorant. Say we are learning a function F (such as a Boolean function of n vars). The training set (TR) and testing set (TS) are drawn randomly according to the same probability distribution, with no noise added in. 1. Is it true that, since there is no noise, the smaller the training error on TR, the better it would predict in general on TS? That is, stopping training earlier is not needed (so cross-validation is not needed). No, this is not true. Even in the noise-free case, the bias/variance tradeoff is operating and it is possible to overfit the training data. Consider for example an algorithm that just memorized the training set and guessed "false" on all unseen examples. It has obviously overfit, and it will obviously do poorly even in the absence of noise. 2. Is it true that, to get reliable prediction (good or bad), we should always choose net architecture with a minimum number of hidden units (or weights via weight decaying)? Will cross-validation help if we have too much freedom in the net (could results on the validation set be coincident)? There are many ways to manage the bias/variance tradeoff. I would say that there is nothing approaching complete agreement on the best approaches (and more fundamentally, the best approach varies from one application to another, since this is really a form of prior). The approaches can be summarized as * early stopping * error function penalties * size optimization - growing - pruning - other Early stopping usually employs cross-validation to decide when to stop training. (see below). In my experience, training an overlarge network with early stopping gives better performance than trying to find the minimum network size. It has the disadvantage that training costs are very high. Error function penalties such as weight decay and soft weight-sharing have been very effective in some applications. In my experience, they introduce additional training problems, because the error surface can develop more local minima. A solution to this is to gradually increase the penalties during training, but this requires more hands-on work than I have patience for. Size optimization attempts to find the optimal number of units and/or number of weights. Cascade-correlation and related algorithms grow the network, optimal brain damage and optimal brain surgeon prune the network, and then of course one can use cross-validation and just generate-and-test different network sizes. An advantage of "right-sizing" is that training time can be considerably reduced (at least the time per epoch). A problem with right-sizing, I believe, is that simply counting units or weights is not necessarily a good measure of network size. The work by Weigend (see 1993 summer school proceedings) suggests that early stopping provides a better method for modulating the effective number of parameters in the network. The OBD/OBS methods do not "just count weights", but instead assess the significance of the weights, so even non-zero weights that are useless can be removed. 3. If, for some reason, cross-validation is needed, and TR is split to TR1 (for training) and TR2 (for validation), what would be the proper ways to do cross-validation? Training on TR1 uses only partial information in TR, but training TR1 to find right parameters and then training on TR1+TR2 may require parameters different from the estimation of training TR1. I use the TR1+TR2 approach. On large data sets, this works well. On small data sets, the cross-validation estimates themselves are very noisy, so I have not found it to be as successful. I compute the stopping point using the sum squared error per training example, so that it scales. I think it is an open research problem to know whether this is the right thing to do. On a large speech recognition data set, after doing cross-validation training, we later checked to see if we had stopped at the right point (by monitoring using the test set). The cross-validation point was nearly exactly right. This was a case with a large data set. 4. In case the net has too much freedom (even different random seeds produce very different predictive accuracies), how can we effectively reduce the variations? Weight decaying seems to be a powerful tool, any others? What kind of "simple" functions weight decaying is biased to? Thanks very much for help Charles --Tom From karun at faline.bellcore.com Thu Feb 3 10:15:55 1994 From: karun at faline.bellcore.com (N. Karunanithi) Date: Thu, 3 Feb 1994 10:15:55 -0500 Subject: Encoding missing values Message-ID: <199402031515.KAA29100@faline.bellcore.com> > I am currently thinking about the problem of how to encode data with > a ttributes for which some of the values are missing in the data set for > neural network training and use. I am also having the same problem. I would like to get a copy responses. >1. Nominal attributes (that have n different possible values) > 1.1 encoded "1-of-n", i.e., one network input per possible value, the relevant one > being 1 all others 0. > This encoding is very general, but has the disadvantage of producing > networks with very many connections. > Missing values can either be represented as 'all zero' or by simply > treating 'is missing' as just another possible input value, resulting > in a "1-of-(n+1)" encoding. > 1.2 encoded binary, i.e., log2(n) inputs being used like the bits in a > binary representation of the numbers 0...n-1 (or 1...n). > Missing values can either be represented as just another possible input > value (probably all-bits-zero is best) or by adding an additional network > input which is 1 for 'is missing' and 0 for 'is present'. The original > inputs should probably be all zero in the 'is missing' case. > Both methods have the problem of poor scalability. If the number of missing values increases then the number of additional inputs will increase linearly in 1.1 and logarithmically in 1.2. In fact, 1-of-n encoding may be a poor choice if (1) the number of input features is large and (2) such an expanded dimensional representation does not become a (semi) linearly separable problem. Even if it becomes a linearly separable problem, the overall complexity of the network can sometimes be very high. >2. continuous attributes (or attributes treated as continuous) > 2.1 encoded as a single network input, perhaps using some monotone transformation > to force the values into a certain distribution. > Missing values are either encoded as a kind of 'best guess' (e.g. the > average of the non-missing values for this attribute) or by using > an additional network input being 0 for 'missing' and 1 for 'present' > (or vice versa) and setting the original attribute input either to 0 > or to the 'best guess'. (The 'best guess' variant also applies to > nominal attributes above) This representation requires GUESS. A nominal tranformation may not be a proper representation in some cases. Assume that the output values range over a large numerical intervel. For example, from 0.0 to 10,000.0. If you use a simple scaling like dividing by 10,000.0 to make it between 0.0 and 1.0, this will result in poor accuracy of prediction. If the attribute is on the input side, then on theory the scaling is unnecessary because the input layer weights will scale accordingly. However, in practice I had lot of problem with this approach. May be a log tranformation before scaling may not be a bad choice. If you use a closed scaling you may have problem whenever a future value exceeds the maximum value of the numerical intervel. For example, assume that the attribute is time, say in miliseconds. Any future time from the point of reference can exceed the limit. Hence any closed scaling will not work properly. > 3. binary attributes (truth values) > 3.1 encoded by one input: 0=false 1=true or vice versa > Treat like (2.1) > 3.2 encoded by one input: -1=false 1=true or vice versa > In this case we may act as for (3.1) or may just use 0 to indicate 'missing'. > 3.3 treat like nominal attribute with 2 possible values No comments. > 4. ordinal attributes (having n different possible values, which are ordered) > 4.1 treat either like continuous or like nominal attribute. > If (1.2) is chosen, a Gray-Code should be used. > Continuous representation is risky unless a 'sensible' quantification > of the possible values is available. I have compared Binary Encoding (1.2), Gray-Coded representation and straighforward scaling. Colsed scaling seems to do a good job. I have also compared open scaling and closed scaling and did find significant improvement in prediction accuracy. (Refer to: N. Karunanithi, D. Whitley and Y. K. Malaiya, "Prediction of Software Reliability Using Connectionist Models", IEEE Trans. Software Eng., July 1992, pp 563-574. N. Karunanithi and Y. K. Malaiya, "The Scaling Problem in Neural Networks for Software Reliability Prediction", Proc. IEEE Int. Symposium on Rel. Eng., Oct. 1992, pp. 776-82. ) > So far to my considerations. Now to my questions. > > a) Can you think of other encoding methods that seem reasonable ? Which ? > > b) Do you have experience with some of these methods that is worth sharing ? > > c) Have you compared any of the alternatives directly ? > > Lutz I have not found a simple solution that is general. I think representation in general and the missing information in specific are open problems within connectionist research. I am not sure we will have a magic bullet for all problems. The best approach is to come up with a specific solution for a given problem. -Karun From Thierry.Denoeux at hds.univ-compiegne.fr Thu Feb 3 03:36:47 1994 From: Thierry.Denoeux at hds.univ-compiegne.fr (Thierry.Denoeux@hds.univ-compiegne.fr) Date: Thu, 3 Feb 1994 09:36:47 +0100 Subject: Encoding missing values Message-ID: <199402030836.AA29123@kaa.hds.univ-compiegne.fr> Dear Lutz, dear connectionists, In a recent mailing, Lutz Prechelt mentioned the interesting problem of how to encode attributes with missing values as inputs to a neural network. I have recently been faced to that problem while applying neural nets to rainfall prediction using weather radar images. The problem was to classify pairs of "echoes" -- defined as groups of connected pixels with reflectivity above some threshold -- taken from successive images as corresponding to the same rain cell or not. Each pair of echoes was discribed by a list of attributes. Some of these attributes, refering to the past of a sequence, were not defined for some instances. To encode these attributes with potentially missing values, we applied two different methods actually suggested by Lutz: - the replacement of the missing value by a "best-guess" value - the addition of a binary input indicating whether the corresponding attribute was present or absent. Significantly better results were obtained by the second method. This work was presented at ICANN'93 last september: X. Ding, T. Denoeux & F. Helloco (1993). Tracking rain cells in radar images using multilayer neural networks. In Proc. of ICANN'93, Springer-Verlag, p. 962-967. Thierry Denoeux +------------------------------------------------------------------------+ | tdenoeux at hds.univ-compiegne.fr Thierry DENOEUX | | Departement de Genie Informatique | | Centre de Recherches de Royallieu | | tel (+33) 44 23 44 96 Universite de Technologie de Compiegne | | fax (+33) 44 23 44 77 B.P. 649 | | 60206 COMPIEGNE CEDEX | | France | +------------------------------------------------------------------------+ From rreilly at nova.ucd.ie Thu Feb 3 10:38:08 1994 From: rreilly at nova.ucd.ie (Ronan Reilly) Date: Thu, 3 Feb 1994 15:38:08 +0000 Subject: Fourth Irish Neural Networks Conference - INNC'94 Message-ID: FOURTH IRISH NEURAL NETWORK CONFERENCE - INNC'94 University College Dublin, Ireland September 12-13, 1994 FIRST CALL FOR PAPERS Papers are solicited for the Fourth Irish Neural Network Conference (INNC'94). They can be in any area of theoretical or applied neural networks. A non-exhaustive list of topic headings include: Learning algorithms Cognitive modelling Neurobiology Natural language processing Vision Signal processing Time series analysis Hardware implementations An extended abstract of not more than 500 words should be sent, preferably by e-mail, to: Ronan Reilly - INNC'94 Dept. of Computer Science University College Dublin Belfield Dublin 4 IRELAND e-mail: rreilly at nova.ucd.ie The deadline for receipt of abstracts is March 31, 1994. Authors will be contacted regarding acceptance by April 30, 1994. Full papers will be required by August 31, 1994. From finnoff at predict.com Thu Feb 3 11:40:51 1994 From: finnoff at predict.com (William Finnoff) Date: Thu, 3 Feb 94 09:40:51 MST Subject: some questions on training neural nets... Message-ID: <9402031640.AA01243@predict.com> Charles X. Ling writes: > Hi neural net experts, > > I am using backprop (and variations of it) quite often although I have > not followed neural net (NN) research as well as I wanted. Some rather > basic issues in training NN still puzzle me a lot, and I hope to get advice > and help from the experts in the area. Sorry for being ignorant.... In addition to Tom's pertinent comments, (tgd at chert.cs.orst.edu, Thu Feb 3) I would suggest consulting the following references which contain discussions of various issues pretaining to /model selection/overfitting/stopped training/ complexity control/bias variance dilema. (This list is by no means complete). References 2), 4), 13), 15) and 17) are particularly relevant to the questions raised. 1) Baldi, P. and Chauvin, Y. (1991). Temporal evolution of generalization during learning in linear networks, {\it Neural Computation} 3, 589-603. 2) Finnoff, W., Hergert, F. and Zimmermann, H.G., Improving generalization performance by nonconvergent model selection methods, {\it Neural Networks}, vol.6, nr.6, pp. 771-783, 1993. 3) Finnoff, W. and Zimmermann, H.G. (1991). Detecting structure in small datasets by network fitting under complexity constraints. To appear in {\it Proc. of 2nd Ann. Workshop on Computational Learning Theory and Natural Learning Systems}, Berkley. 4) Geman, S., Bienenstock, E. and Doursat R., (1992). Neural networks and the bias/variance dilemma, {\it Neural Computation} 4, 1-58. 5) Guyon, I., Vapnik, V., Boser, B., Bottou, L. and Solla, S. (1992). Structural risk minimization for character recognition. In J. Moody, J. Hanson and R. Lippmann (Eds.), {\it Advances in Neural Information Processing Systems IV} (pp. 471-479). San Mateo: Morgan Kaufman. 6) Hanson, S. J., and Pratt, L. Y. (1989). Comparing biases for minimal network construction with back-propagation, In D. S. Touretzky, (Ed.), {\it Advances in Neural Information Processing I} (pp.177-185). San Mateo: Morgan Kaufman. 7) Hergert, F., Finnoff, W. and Zimmermann, H.G. (1992). A comparison of weight elimination methods for reducing complexity in neural networks. {\it Proc. Int. Joint Conf. on Neural Networks}, Baltimore. 8) Hergert, F., Zimmermann, H.G., Kramer, U., and Finnoff, W. (1992). Domain independent testing and performance comparisons for neural networks. In I. Aleksander and J. Taylor (Eds.) {\it Artificial Neural Networks II} (pp.1071-1076). London: North Holland. 9) Le Cun, Y., Denker J. and Solla, S. (1990). Optimal Brain Damage. In D. Touretzky (Ed.) {\it Advances in Neural Information Processing Systems II} (pp.598-605). San Mateo: Morgan Kaufman. 10) MacKay, D. (1991). {\it Bayesian Modelling and Neural Networks}, Dissertation, Computational and Neural Systems, California Inst. of Tech. 139-74, Pasadena. 11) Moody, J. (1992). Generalization, weight decay and architecture selection for nonlinear learning systems. In J. Moody, J. Hanson and R. Lippmann (Eds.), {\it Advances in Neural Information Processing Systems IV} (pp. 471-479). San Mateo: Morgan Kaufman. 12) Morgan, N. and Bourlard, H. (1990). Generalization and parameter estimation in feedforward nets: Some experiments. In D. Touretzky (Ed.) {\it Advances in Neural Information Processing Systems II} (pp.598-605). San Mateo: Morgan Kaufman. 13) Sj\"oberg, J. and Ljung, L. (1992). Overtraining, regularization and searching for minimum in neural networks, {Report LiTH-ISY-I-1297, Dep. of Electrical Engineering}, Link\"oping University, S-581 83 Link\"oping, Sweden. 14) Stone, C.J. (1977). Cross-validation: A review. {\it Math. Operations res. Statist. Ser.}, 9, 1-51. 15) Vapnik, V. (1992). Principles of risk minimization for learning theory. In J. Moody, J. Hanson and R. Lippmann (Eds.), {\it Advances in Neural Information Processing Systems IV} (pp. 831-838 ). San Mateo: Morgan Kaufman. 16) Weigend, A. and Rumelhart, D. (1991). The effective dimension of the space of hidden units, in {\it Proc. Int. Joint Conf. on Neural Networks}, Singapore. 17) Weigend, A., Rumelhart, D., and Huberman, B. (1991). Generalization by weight elimination with application to forecasting. In R. Lippman, J. Moody and D. Touretzy (Eds.), {\it Advances in Neural Information Processing III} (pp.875-882). San Mateo: Morgan Kaufman. 18) White, H. (1989). Learning in artificial neural networks: A statistical perspective, {\it Neural Computation} 1, 425-464. -William %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% William Finnoff Prediction Co. 320 Aztec St., Suite B Santa Fe, NM, 87501, USA Tel.: (505)-984-3123 Fax: (505)-983-0571 e-mail: finnoff at predict.com From jlm at crab.psy.cmu.edu Thu Feb 3 11:27:41 1994 From: jlm at crab.psy.cmu.edu (James L. McClelland) Date: Thu, 3 Feb 94 11:27:41 EST Subject: CMU-Pitt Center for the Neural Basis of Cognition Message-ID: <9402031627.AA08304@crab.psy.cmu.edu.psy.cmu.edu> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Carnegie Mellon University and the University of Pittsburgh Announce the Creation of the Center for the Neural Basis of Cognition The Center is dedicated to the study of the neural basis of cognitive processes, including learning and memory, language and thought, perception, attention, and planning; to the study of the development of the neural substrate of these processes; to the study of disorders of these processes and their underlying neuropathology; and to the promotion of applications of the results of these studies to artificial intelligence, technology, and medicine. The Center will synthesize the disciplines of basic and clinical neuroscience, cognitive psychology, and computer science, combining neurobiological, behavioral, computa- tional and brain imaging methods. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Faculty Openings in the Center The Center seeks faculty and research scientists whose work relates to the mission stated above. Recruiting is beginning immediately, and will continue for several years. Appointments can be at any level and will be coordinated with one or more departments at either university. Coordinating departments include Biological Sciences, Computer Science, and Psychology at Carnegie Mellon and the departments of Behavioral Neuroscience, Neurobiology, Neurology, Psychiatry and Psychology at the University of Pittsburgh. Other affiliations may be possible. Candidates should send an application to either of the Co-Directors of the Center, listed below. The application should include a statement of interest indicating how the candidate's work fits the mission of the center and suggesting possible departmental affiliations, as well as a CV, copies of publications, and three letters of reference. Both uni- versities are EEO/AA Employers. James L. McClelland Robert Y. Moore Department of Psychology Center for Neuroscience Baker Hall 345-F Biomedical Science Tower 1656 Carnegie Mellon University University of Pittsburgh Pittsburgh, PA 15213 Pittsburgh, PA 15261 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From wahba at stat.wisc.edu Thu Feb 3 20:42:28 1994 From: wahba at stat.wisc.edu (Grace Wahba) Date: Thu, 3 Feb 94 19:42:28 -0600 Subject: nips6 paper on ss-anova in archive Message-ID: <9402040142.AA06981@hera.stat.wisc.edu> Dear Colleagues Our paper for the 1993 Neural Information Processing Society (NIPS) Proceedings is in the neuroprose archive under wahba.nips6.ps.Z Title: Structured Machine Learning For `Soft' Classification with Smoothing Spline ANOVA and Stacked Tuning, Testing and Evaluation. Authors: G. Wahba, Y. Wang, C. Gu, R. Klein and B. Klein Summary We describe the use of smoothing spline analysis of variance (SS-ANOVA) in the penalized log likelihood context, for learning (estimating) the probability $p$ of a `$1$' outcome, given a training set with attribute vectors and 0-1 outcomes. $p$ is of the form $p(t) = e^{f(t)}/(1+e^{f(t)})$, where, if $t$ is a vector of attributes, $f$ is learned as a sum of smooth functions of one attribute plus a sum of smooth functions of two attributes, etc. The smoothing parameters governing $f$ are obtained by an iterative unbiased risk or iterative GCV method. Confidence intervals for these estimates are available. The method is applied to estimate the risk of progression of diabetic retinopathy given predictor variables of age, body mass index and glycosylated hemoglobin. RETRIEVAL INSTRUCTIONS for NEUROPROSE ARCHIVE % ftp archive.cis.ohio-state.edu Name (cheops.cis.ohio-state.edu:yourname): anonymous Password: (use your email address) ftp> cd pub/neuroprose ftp> binary ftp> get wahba.nips6.ps.Z ftp> quit % uncompress wahba.nips6.ps.Z % lpr wahba.nips6.ps Some other papers of yours truly, friends and students, and an idiosyncratic bibliography of possible interest to connectionists are available by ftp. Get the (ascii) file Contents to see what's there. RETRIEVAL INSTRUCTIONS for WAHBA's public directory % ftp ftp.stat.wisc.edu Name (ftp.stat.wisc.edu:yournamehere): anonymous Password: (use your email address) ftp> binary ftp> cd pub/wahba ftp> get Contents ... read Contents and retrieve files of interest From pollack at cis.ohio-state.edu Thu Feb 3 17:17:14 1994 From: pollack at cis.ohio-state.edu (Jordan B Pollack) Date: Thu, 3 Feb 1994 17:17:14 -0500 Subject: new neuroprose/Thesis subdirectory Message-ID: <199402032217.RAA01292@dendrite.cis.ohio-state.edu> *** do not forward ** The filesystem on which neuroprose resides has overflowed. A set of very large files (all the files with *thesis* in their filename), have been moved to a new subdirectory. jordan From bill at nsma.arizona.edu Thu Feb 3 23:53:26 1994 From: bill at nsma.arizona.edu (Bill Skaggs) Date: Thu, 03 Feb 1994 21:53:26 -0700 (MST) Subject: Encoding missing values Message-ID: <9402040453.AA24599@nsma.arizona.edu> There is at least one kind of network that has no problem (in principle) with missing inputs, namely a Boltzmann machine. You just refrain from clamping the input node whose value is missing, and treat it like an output node or hidden unit. This may seem to be irrelevant to anything other than Boltzmann machines, but I think it could be argued that nothing very much simpler is capable of dealing with the problem. When you ask a network to handle missing inputs, you are in effect asking it to do pattern completion on the input layer, and for this a Boltzmann machine or some other sort of attractor network would seem to be required. -- Bill From tal at goshawk.lanl.gov Fri Feb 4 10:22:12 1994 From: tal at goshawk.lanl.gov (Tal Grossman) Date: Fri, 4 Feb 1994 08:22:12 -0700 Subject: some questions on training neural nets... Message-ID: <199402041522.IAA22945@goshawk.lanl.gov> Dear Charles X. Ling, You say: "Some rather basic issues in training NN still puzzle me a lot, and I hope to get advice and help from the experts in the area." Well... the questions you have asked still puzzle the experts as well, and good answers, where they exist, are very much case dependent. As Tom Dietterich wrote, in general "Even in the noise-free case, the bias/variance tradeoff is operating and it is possible to overfit the training data", therefore you can not expect just any large net to generalize well. It was also observed recently that... When having a large enough set of examples (so one can have a good enough sample for the training and the validation set), you can obtain better generalization with larger nets by using cross validation to decide when to stop training, as is demonstrated in the paper of A. Weigend : Weigend A.S. (1994), in the {\em Proc. of the 1993 Connectionist Models Summer School}, edited by M.C. Mozer, P. Smolensky, D.S. Touretzky, J.L. Elman and A.S. Weigend, pp. 335-342 (Erlbaum Associates, Hillsdale NJ, 1994). Rich Caruana has presented similar results in the "Complexity Issues" workshop in the last NIPS post-conference. But... Larger networks can generalize as good as, or even better than small networks even without cross-validation. A simple experiment that demonstrates that was presented in : T. Grossman, R. Meir and E. Domany, Learning by choice of Internal Representations, Complex Systems 2, 555-575 (1988). In that experiment, networks with different number of hidden units were trained to perform the symmetry task by using a fraction of the possible examples as the training set, training the net to 100% performance on the TR set and testing the performance on the rest (off training set generalization). No early stopping, no cross validation. The symmetry problem can be solved by 2 hidden units - so this is the minimal architecture required for this specific function. However, it was found that it is NOT the best generalizing architecture. The generalization rates of all the architectures (H=2..N, the size of the input) were similar, with the larger networks somewhat better. Now, this is a special case. One can explain it by observing that the symmetry problem can also be solved by a network of N hidden units, with smaller weights, and not only by effectively "zeroing" the contributions of all but two units (see an example in Minsky and Papert's Perceptrons). Probably by all the other architectures as well. So, considering the mapping from weight space to function space, it is very likely that training a large network on partial data will take you closer (in function space) to your target function F (symmetry in that case) than training a small one. The picture can be different in other cases... One has to remember that the training/generalization problem (including the bias/variance tradeoff problem) is, in general, a complex interaction between three entities: 1. The target function (or the task). 2. The learning model, and what is the class of functions that is realizable by this model (and its associated learning algorithm). 3. The training set, and how well it represents the task. Even the simple question: is my training set large enough (or good enough) ? is not simple at all. One might think that it should be larger than, say, twice the number of free parameters (weights) in your model/network architecture. It turns out that not even this is enough in general. Allow me to advertise here the paper presented by A.Lapedes and myself at the last NIPS where we present a method to test a "general" classification algorithm (i.e. any classifier such as a neural net, a decision tree, etc. and its learning algorithm, which may include pruning or net construction) by a method we call "noise sensitivity signature" NSS (see abstract below). In addition to introducing this new model selection method, which we believe can be a good alternative to cross-validation in data limited cases, we present the following experiment: the target function is a network with 20:5:1 architecture (weights chosen at random). The training set is provided by choosing M random input patterns and classifying them by the teacher net. we then train other nets with various architectures, ranging from 1 to 8 hidden units on the training set (without controlled stopping, but with tolerance in the error function). A different (and large) set of classified examples is used to determine the generalization performance of the trained nets (averaged over several realizations with different initial weights). Some of the results are : 1. With different training set sizes M=400,700,1000, the the optimal architecture is different. Smaller training set yields smaller optimal network, according to the independent test set measure. 2. Even with M=1000 (much more than twice the number of weights), the optimal learning net is still smaller than the original teacher net. 3. There are differences of up to a few percents in generalization performance of the different learning nets for all training set sizes. In particular, nets that are larger than the optimal are doing worse with size. Depends on your problem, a few percents can be insignificant or they can make a real difference. In some real applications, 1-2 % can be the difference between a contract or a paper... In such cases you would like to tune your model (i.e to identify the optimal architecture) as best as you can. 4. Using the NSS it was possible to recognize the optimal architectures for each training set, without using extra data. Some conclusions are: 1. If one uses a validation set to choose the architecture (not for stopping) - for example by using the extra 1000 examples - then the architecture that will be picked up when using the 700 training set is going to be smaller (and worse) than the one picked up when using the 1000 training set. In other words, if your data is just a 1000 examples, and you devote 300 of them to be your validation set. Then even if those 300 will give a good estimation of the generalization of the trained net, when you choose the model according to this test set, you end up with the optimal model for 700 training examples, which is less good than the optimal model that you can obtain when training with all the 1000 examples. It means that in many cases you need more examples than one might expect in order to obtain a well tuned model. Especially if you are using a considerable fraction of it as a validation set. 2. Using NSS one would find the right architecture for the total number of examples you have - paying a factor of about 30 on training effort. 3. You can use "set 1 aside" cross validation in order to select your model. This will probably overcome the bias caused by giving up a large fraction of the examples. However, in order to obtain a reliable estimate of the performance the training process will have to be repeated many times, probably more than what is needed in order to calculate the NSS. It is important to emphasize again: The above results were obtained for that specific experiment. We have obtained similar results with different tasks (e.g. DNA structure classification) and with different learning machines (e.g. decision trees), but still, these results prove nothing "in general", except may be, that life is complicated and full of uncertainty... A more careful comparison with cross validation as a stopping method, and using NSS in other scenarios (like function fitting) is under investigation. If anyone is interested in using the NSS method in combination with pruning methods (e.g. to test the stopping criteria), I will be glad to help. I will be grateful for any other information/ref about similar experiments. I hope all the above did not add too much to your puzzlement. Good luck with your training, Tal ------------------------------------------------ The paper I mentioned above is: Learning Theory seminar: Thursday Feb.10. 15:15. CNLS Conference room. title: Use of Bad Training Data For Better Predictions. by : Tal Grossman and Alan Lapedes (Complex Systems group, LANL) Abstract: We present a method for calculating the ``noise sensitivity signature'' of a learning algorithm which is based on scrambling the output classes of various fractions of the training data. This signature can be used to indicate a good (or bad) match between the complexity of the classifier and the complexity of the data and hence to improve the predictive accuracy of a classification algorithm. Use of noise sensitivity signatures is distinctly different from other schemes to avoid overtraining, such as cross-validation, which uses only part of the training data, or various penalty functions, which are not data-adaptive. Noise sensitivity signature methods use all of the training data and are manifestly data-adaptive and non-parametric. They are well suited for situations with limited training data It is going to appear in the Proc. of NIPS 6. An expanded version of it will (hopefully) be placed in the neuroprose archive within a week or two. Until then I can send a ps file of it to the interested. From sef+ at cs.cmu.edu Fri Feb 4 10:25:51 1994 From: sef+ at cs.cmu.edu (Scott E. Fahlman) Date: Fri, 04 Feb 94 10:25:51 EST Subject: Encoding missing values In-Reply-To: Your message of Thu, 03 Feb 94 21:53:26 -0700. <9402040453.AA24599@nsma.arizona.edu> Message-ID: There is at least one kind of network that has no problem (in principle) with missing inputs, namely a Boltzmann machine. You just refrain from clamping the input node whose value is missing, and treat it like an output node or hidden unit. This may seem to be irrelevant to anything other than Boltzmann machines, but I think it could be argued that nothing very much simpler is capable of dealing with the problem. When you ask a network to handle missing inputs, you are in effect asking it to do pattern completion on the input layer, and for this a Boltzmann machine or some other sort of attractor network would seem to be required. Good point, but perhaps in need of clarification for some readers: There are two ways of training a Boltzmann machine. In one (the original form), there is no distinction between input and output units. During training we alternate between an instruction phase, in which all of the externally visible units are clamped to some pattern, and a normalization phase, in which the whole network is allow to run free. The idea is to modify the weights so that, when running free, the external units assume the various pattern values in the training set in their proper frequencies. If only some subset of the externally visible units are clamped to certain values, the net will produce compatible completions in the other units, again with frequencies that match this part of the training set. A net trained in this way will (in principle -- it might take a *very* long time for anything complicated) do what you suggest: Complete an "input" pattern and produce a compatible output at the same time. This works even if the input is *totally* missing. I believe it was Geoff Hinton who realized that a Boltzmann machine could be trained more efficiently if you do make a distinction between input and output units, and don't waste any of the training effort learning to reconstruct the input. In this model, the instruction phase clamps both input and output units to some pattern, while the normalization phase clamps only the input units. Since the input units are correct in both cases, all of the networks learning power (such as it is) goes into producing correct patterns on the output units. A net trained in this way will not do input-completion. I bring this up because I think many people will only have seen the latter kind of Boltzmann training, and will therefore misunderstand your observation. By the way, one alternative method I have seen proposed for reconstructing missing input values is to first train an auto-encoder (with some degree of bottleneck to get generalization) on the training set, and then feed the output of this auto-encoder into the classification net. The auto-encoder should be able to replace any missing values with some degree of accuracy. I haven't played with this myself, but it does sound plausible. If anyone can point to a good study of this method, please post it here or send me E-mail. -- Scott =========================================================================== Scott E. Fahlman Internet: sef+ at cs.cmu.edu Senior Research Scientist Phone: 412 268-2575 School of Computer Science Fax: 412 681-5739 Carnegie Mellon University Latitude: 40:26:33 N 5000 Forbes Avenue Longitude: 79:56:48 W Pittsburgh, PA 15213 =========================================================================== From zoubin at psyche.mit.edu Fri Feb 4 11:04:32 1994 From: zoubin at psyche.mit.edu (Zoubin Ghahramani) Date: Fri, 4 Feb 94 11:04:32 EST Subject: Encoding missing values Message-ID: <9402041604.AA28037@psyche.mit.edu> Dear Lutz, Thierry, Karun, and connectionists, I have also been looking into the issue of encoding and learning from missing values in a neural network. The issue of handling missing values has been addressed extensively in the statistics literature for obvious reasons. To learn despite the missing values the data has to be filled in, or the missing values integrated over. The basic question is how to fill in the missing data. There are many different methods for doing this in stats (mean imputation, regression imputation, Bayesian methods, EM, etc.). For good reviews see (Little and Rubin 1987; Little, 1992). I do not in general recommend encoding "missing" as yet another value to be learned over. Missing means something in a statistical sense -- that the input could be any of the values with some probability distribution. You could, for example, augment the original data filling in different values for the missing data points according to a prior distribution. Then the training would assign different weights to the artificially filled-in data points depending on how well they predict the output (their posterior probability). This is essentially the method proposed by Buntine and Weigand (1991). Other approaches have been proposed by Tresp et al. (1993) and Ahmad and Tresp (1993). I have just written a paper on the topic of learning from incomplete data. In this paper I bring a statistical algorithm for learning from incomplete data, called EM, into the framework of nonlinear function approximation and classification with missing values. This approach fits the data iteratively with a mixture model and uses that same mixture model to effectively fill in any missing input or output values at each step. You can obtain the preprint by ftp psyche.mit.edu login: anonymous cd pub get zoubin.nips93.ps To obtain code for the algorithm please contact me directly. Zoubin Ghahramani zoubin at psyche.mit.edu ----------------------------------------------------------------------- Ahmad, S and Tresp, V (1993) "Some Solutions to the Missing Feature Problem in Vision." In Hanson, S.J., Cowan, J.D., and Giles, C.L., editors, Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers, San Mateo, CA. Buntine, WL, and Weigand, AS (1991) "Bayesian back-propagation." Complex Systems. Vol 5 no 6 pp 603-43 Ghahramani, Z and Jordan MI (1994) "Supervised learning from incomplete data via an EM approach" To appear in Cowan, J.D., Tesauro, G., and Alspector,J. (eds.). Advances in Neural Information Processing Systems 6. Morgan Kaufmann Publishers, San Francisco, CA, 1994. Little, RJA (1992) "Regression With Missing X's: A Review." Journal of the American Statistical Association. Volume 87, Number 420. pp. 1227-1237 Little, RJA. and Rubin, DB (1987). Statistical Analysis with Missing Data. Wiley, New York. Tresp, V, Hollatz J, Ahmad S (1993) "Network structuring and training using rule-based knowledge." In Hanson, S.J., Cowan, J.D., and Giles, C.~L., editors, Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers, San Mateo, CA. From Volker.Tresp at zfe.siemens.de Fri Feb 4 13:09:46 1994 From: Volker.Tresp at zfe.siemens.de (Volker Tresp) Date: Fri, 4 Feb 1994 19:09:46 +0100 Subject: missing data Message-ID: <199402041809.AA14305@inf21.zfe.siemens.de> In response to the questions raised by Lutz Prechelt concerning the missing data problem: In general, the solution to the missing-data problem depends on the missing-data mechanism. For example, if you sample the income of a population and rich people tend to refuse the answer the mean of your sample is biased. To obtain an unbiased solution you would have to take into account the missing-data mechanism. The missing-data mechanism can be ignored if it is independent of the input and the output (in the example: the likelihood that a person refuses to answer is independent of the person's income). Most approaches assume that the missing-data mechanism can be ignored. There exist a number of ad hoc solutions to the missing-data problem but it is also possible to approach the problem from a statistical point of view. In our paper (which will be published in the upcoming NIPS-volume and which will be available on neuroprose shortly) we discuss a systematic likelihood-based approach. NN-regression can be framed as a maximum likelihood learning problem if we assume the standard signal plus Gaussian noise model P(x, y) = P(x) P(y|x) \propto P(x) exp(-1/(2 \sigma^2) (y - NN(x))^2). By deriving the probability density function for a pattern with missing features we can formulate a likelihood function including patterns with complete and incomplete features. The solution requires an integration over the missing input. In practice, the integral is approximated using a numerical approximation. For networks of Gaussian basis functions, it is possible to obtain closed-form solutions (by extending the EM algorithm). Our paper also discusses why and when ad hoc solutions --such as substituting the mean for an unknown input-- are harmful. For example, if the mapping is approximately linear substituting the mean might work quite well. In general, although, it introduces bias. Training with missing and noisy input data is described in: ``Training Neural Networks with Deficient Data,'' V. Tresp, S. Ahmad and R. Neuneier, in Cowan, J. D., Tesauro, G., and Alspector, J. (eds.), {\em Advances in Neural Information Processing Systems 6}, Morgan Kaufmann, 1994. A related paper by Zoubin Ghahramani and Michael Jordan will also appear in the upcoming NIPS-volume. Recall with missing and noisy data is discussed in (available in neuroprose as ahmad.missing.ps.Z): ``Some Solutions to the Missing Feature Problem in Vision,'' S. Ahmad and V. Tresp, in {\em Advances in Neural Information Processing Systems 5,} S. J. Hanson, J. D. Cowan, and C. L. Giles eds., San Mateo, CA, Morgan Kaufman, 1993. Volker Tresp Subutai Ahmad Ralph Neuneier tresp at zfe.siemens.de ahmad at interval.com ralph at zfe.siemens.de From wray at ptolemy-ethernet.arc.nasa.gov Fri Feb 4 15:19:44 1994 From: wray at ptolemy-ethernet.arc.nasa.gov (Wray Buntine) Date: Fri, 4 Feb 94 12:19:44 PST Subject: Encoding missing values In-Reply-To: <199402031515.KAA29100@faline.bellcore.com> (karun@faline.bellcore.com) Message-ID: <9402042019.AA05621@ptolemy.arc.nasa.gov> regarding this missing value question raised thusly .... by Thierry Denoeux, Lutz Prechelt, and others >>>>>>>>>>>>>>> > So far to my considerations. Now to my questions. > > a) Can you think of other encoding methods that seem reasonable ? Which ? > > b) Do you have experience with some of these methods that is worth sharing ? > > c) Have you compared any of the alternatives directly ? > > Lutz + > I have not found a simple solution that is general. I think > representation in general and the missing information in specific > are open problems within connectionist research. I am not sure we will > have a magic bullet for all problems. The best approach is to come up > with a specific solution for a given problem. -> Karun >>>>>>>>>> This missing value problem is of course shared amongst all the learning communities, artificial intelligence, statistics, pattern recognition, etc., not just neural networks. A classic study in this area, which includes most suggestions I've read here so far, is inproceedings{quinlan:ml6, AUTHOR = "J.R. Quinlan", TITLE = "Unknown Attribute Values in Induction", YEAR = 1989, BOOKTITLE = "Proceedings of the Sixth International Machine Learning Workshop", PUBLISHER = "Morgan Kaufmann", ADDRESS = "Cornell, New York"} The most frequently cited methods I've seen, and they're so common amongst the different communities its hard to lay credit: 1) replace missings by their some best guess 2) fracture the example into a set of fractional examples each with the missing value filled in somehow 3) call the missing value another input value 3 is a good thing to do if they are "informative" missing, i.e. if someone leaves the entry "telephone number" blank in a questionaire, then maybe they don't have a telephone, but probably not good otherwise unless you have loads of data and don't mind all the extra example types generated (as already mentioned) 1 is a quick and dirty hack at 2. How good depends on your application. 2 is an approximation to the "correct" approach for handling "non-informative" missing values according to the standard "mixture model". The mathematics for this is general and applies to virtually any learning algorithm trees, feed-forward nets, linear regression, whatever. We do it for feed-forward nets in @article{buntine.weigend:bbp, AUTHOR = "W.L. Buntine and A.S. Weigend", TITLE = "Bayesian Back-Propagation", JOURNAL = "Complex Systems", Volume = 5, PAGES = "603--643", Number = 1, YEAR = "1991" } and see Tresp, Ahmad & Neuneier in NIPS'94 for an implementation. But no doubt someone probably published the general idea back in the 50's. I certainly wouldn't call missing values an open problem. Rather, "efficient implementations of the standard approaches" is, in some cases, an open problem. Wray Buntine NASA Ames Research Center phone: (415) 604 3389 Mail Stop 269-2 fax: (415) 604 3594 Moffett Field, CA, 94035-1000 email: wray at kronos.arc.nasa.gov From stork at cache.crc.ricoh.com Fri Feb 4 11:57:37 1994 From: stork at cache.crc.ricoh.com (David G. Stork) Date: Fri, 4 Feb 94 08:57:37 -0800 Subject: Missing features... Message-ID: <9402041657.AA12260@neva.crc.ricoh.com> There is a provably optimal method for performing classification with missing inputs, described in Chapter 2 of "Pattern Classification and Scene Analysis" (2nd ed.) by R. O. Duda, P. E. Hart and D. G. Stork, which avoids the ad-hoc heuristics that have been described by others. Those interested in obtaining Chapter two via ftp should contact me. Dr. David G. Stork Chief Scientist and Head, Machine Learning and Perception Ricoh California Research Center 2882 Sand Hill Road Suite 115 Menlo Park, CA 94025-7022 USA 415-496-5720 (w) 415-854-8740 (fax) stork at crc.ricoh.com From wray at ptolemy-ethernet.arc.nasa.gov Fri Feb 4 15:47:25 1994 From: wray at ptolemy-ethernet.arc.nasa.gov (Wray Buntine) Date: Fri, 4 Feb 94 12:47:25 PST Subject: some questions on training neural nets... In-Reply-To: <9402031640.AA01243@predict.com> (message from William Finnoff on Thu, 3 Feb 94 09:40:51 MST) Message-ID: <9402042047.AA06120@ptolemy.arc.nasa.gov> Tom Dietterich and William Finnof covered a lot of issues. I'd just like to highlight two points: * this is a contentious area * there are several opposing factors at play that confuse our understanding of this ================ detail Basically, this comment below is SO true. > There are many ways to manage the bias/variance tradeoff. I would say > that there is nothing approaching complete agreement on the best > approaches (and more fundamentally, the best approach varies from one > application to another, since this is really a form of prior). The > approaches can be summarized as The bias/variance tradeoff lies at the heart of almost all disagreements between different learning philosophies such as classical, Bayesian, minimum description length, resampling schemes (now often viewed as empirical Bayesian), statistical physics approaches, and the various "implementation" schemes. One thing to note is that there are several quite separate forces in operation here: computational and search issues: (e.g. maybe early stopping works better because its a more efficient way of searching the space of smaller networks ?) prior issues: (e.g. have you thrown in 20 attributes you happen to think might apply, but probably 15 are irrelevant; OR did a medical specialist carefully pick all 10 attributes and assures you every one is important, OR is a medical specialist able to solve the task blind, just be reading the 20 attribute values (without seeing the patient), etc.) (e.g. are 30 hidden units adequate for the structure of the task? ) asking the right question: (e.g. sometimes the question: what's the "best" network is a bit silly when you have a small amount of data, perhaps you should be trying to find 10 reasonable alternative networks and pool their results (ala. Michael Perrone's NIPS'93 workshop) understanding your representation: (e.g. with rule based systems, each rule has a good interpretation so the question of how to prune, etc., is something you can understand well BUT with a large feed-forward network, understanding the structure of the space is more involved, e.g. if I set these 2 weights to zero what the hell happens to my proposed solution) (e.g. this confuses the problem of designing good regularizes/priors/network-encodings). Problem is that theory people tend to focus on one, maybe two of these, whereas application people tend to confuse them together. Wray Buntine NASA Ames Research Center phone: (415) 604 3389 Mail Stop 269-2 fax: (415) 604 3594 Moffett Field, CA, 94035-1000 email: wray at kronos.arc.nasa.gov From kak at gate.ee.lsu.edu Fri Feb 4 17:24:34 1994 From: kak at gate.ee.lsu.edu (Subhash Kak) Date: Fri, 4 Feb 94 16:24:34 CST Subject: Encoding missing values Message-ID: <9402042224.AA23849@gate.ee.lsu.edu> Missing values in feedback networks raise interesting questions: Should these values be considered "don't know" values or should these be generated in some "most likelihood" fashion? These issues are discussed in the following paper: S.C. Kak, "Feedback neural networks: new characteristics and a generalization", Circuits, Systems, Signal Processing, vol. 12, no. 2, 1993, pp. 263-278. -Subhash Kak From moody at chianti.cse.ogi.edu Fri Feb 4 18:50:07 1994 From: moody at chianti.cse.ogi.edu (John Moody) Date: Fri, 4 Feb 94 15:50:07 -0800 Subject: PhD and Masters Programs at the Oregon Graduate Institute Message-ID: <9402042350.AA19148@chianti.cse.ogi.edu> Fellow Connectionists: The Oregon Graduate Institute of Science and Technology (OGI) has openings for a few outstanding students in its Computer Science and Electrical Engineering Masters and Ph.D programs in the areas of Neural Networks, Learning, Signal Processing, Time Series, Control, Speech, Language, and Vision. Faculty and postdocs in these areas include Etienne Barnard, Ron Cole, Mark Fanty, Dan Hammerstrom, Hynek Hermansky, Todd Leen, Uzi Levin, John Moody, David Novick, Misha Pavel, Joachim Utans, Eric Wan, and Lizhong Wu. Short descriptions of our research interests are appended below. OGI is a young, but rapidly growing, private research institute located in the Portland area. OGI offers Masters and PhD programs in Computer Science and Engineering, Applied Physics, Electrical Engineering, Biology, Chemistry, Materials Science and Engineering, and Environmental Science and Engineering. Inquiries about the Masters and PhD programs and admissions for either Computer Science or Electrical Engineering should be addressed to: Margaret Day, Director Office of Admissions and Records Oregon Graduate Institute PO Box 91000 Portland, OR 97291 Phone: (503)690-1028 Email: margday at admin.ogi.edu The final deadline for receipt of all applications materials for the Ph.D. programs is March 1, 1994, so it's not too late to apply! Masters program applications are accepted continuously. +++++++++++++++++++++++++++++++++++++++++++++++++++++++ Oregon Graduate Institute of Science & Technology Department of Computer Science and Engineering & Department of Electrical Engineering and Applied Physics Research Interests of Faculty in Adaptive & Interactive Systems (Neural Networks, Signal Processing, Control, Speech, Language, and Vision) Etienne Barnard (Assistant Professor): Etienne Barnard is interested in the theory, design and implementation of pattern-recognition systems, classifiers, and neural networks. He is also interested in adaptive control systems -- specifically, the design of near-optimal controllers for real- world problems such as robotics. Ron Cole (Professor): Ron Cole is director of the Center for Spoken Language Understanding at OGI. Research in the Center currently focuses on speaker- independent recognition of continuous speech over the telephone and automatic language identification for English and ten other languages. The approach combines knowledge of hearing, speech perception, acoustic phonetics, prosody and linguistics with neural networks to produce systems that work in the real world. Mark Fanty (Research Assistant Professor): Mark Fanty's research interests include continuous speech recognition for the telephone; natural language and dialog for spoken language systems; neural networks for speech recognition; and voice control of computers. Dan Hammerstrom (Associate Professor): Based on research performed at the Institute, Dan Hammerstrom and several of his students have spun out a company, Adaptive Solutions Inc., which is creating massively parallel computer hardware for the acceleration of neural network and pattern recognition applications. There are close ties between OGI and Adaptive Solutions. Dan is still on the faculty of the Oregon Graduate Institute and continues to study next generation VLSI neurocomputer architectures. Hynek Hermansky (Associate Professor); Hynek Hermansky is interested in speech processing by humans and machines with engineering applications in speech and speaker recognition, speech coding, enhancement, and synthesis. His main research interest is in practical engineering models of human information processing. Todd K. Leen (Associate Professor): Todd Leen's research spans theory of neural network models, architecture and algorithm design and applications to speech recognition. His theoretical work is currently focused on the foundations of stochastic learning, while his work on Algorithm design is focused on fast algorithms for non-linear data modeling. Uzi Levin (Senior Research Scientist): Uzi Levin's research interests include neural networks, learning systems, decision dynamics in distributed and hierarchical environments, dynamical systems, Markov decision processes, and the application of neural networks to the analysis of financial markets. John Moody (Associate Professor): John Moody does research on the design and analysis of learning algorithms, statistical learning theory (including generalization and model selection), optimization methods (both deterministic and stochastic), and applications to signal processing, time series, and finance. David Novick (Assistant Professor): David Novick conducts research in interactive systems, including computational models of conversation, technologically mediated communication, and human-computer interaction. A central theme of this research is the role of meta-acts in the control of interaction. Current projects include dialogue models for telephone-based information systems. Misha Pavel (Associate Professor): Misha Pavel does mathematical and neural modeling of adaptive behaviors including visual processing, pattern recognition, visually guided motor control, categorization, and decision making. He is also interested in the application of these models to sensor fusion, visually guided vehicular control, and human-computer interfaces. Joachim Utans (Post-Doctoral Research Associate): Joachim Utans's research interests include computer vision and image processing, model based object recognition, neural network learning algorithms and optimization methods, model selection and generalization, with applications in handwritten character recognition and financial analysis. Lizhong Wu (Post-Doctoral Research Associate): Lizhong Wu's research interests include neural network theory and modeling, time series analysis and prediction, pattern classification and recognition, signal processing, vector quantization, source coding and data compression. He is now working on the application of neural networks and nonparametric statistical paradigms to finance. Eric A. Wan (Assistant Professor): Eric Wan's research interests include learning algorithms and architectures for neural networks and adaptive signal processing. He is particularly interested in neural applications to time series prediction, adaptive control, active noise cancellation, and telecommunications. From hicks at cs.titech.ac.jp Sun Feb 6 17:22:17 1994 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Sun, 6 Feb 94 17:22:17 JST Subject: Methods for improving generalization (was Re: some questions on ...) Message-ID: <9402060822.AA11860@maruko.cs.titech.ac.jp> Dear Mr. Grossman, I read with great interest your analysis of overlearning and about your research into achieving better generalization with less data. However, I only want to point out an ommision in your background despcription. In the abstract of your paper "Use of Bad Training Data For Better Predictions" you write: >Use of noise sensitivity signatures is distinctly different from other schemes >to avoid overtraining, such as cross-validation, which uses only part of the >training data, or various penalty functions, which are not data-adaptive. >Noise sensitivity signature methods use all of the training data and >are manifestly data-adaptive and non-parametric. When you say penalty functions the first thing which comes to mind is a penalty on the sum of squared weights. This method is indeed not data-adaptive. However, an interesting article in Neural Computation 4, pp. 473-493, "Simplifying Neural Networks by Soft Weight-Sharing" proposes a weight penalty method which is adaptive. Basically, the weights are grouped together in Gaussian clusters whose mean and variance are allowed to adapt to the data. The experimental results they published show improvement over both cross-validation and weight decay. I am looking forward to reading your paper when it is available. Yours Respectfully, Craig Hicks Craig Hicks hicks at cs.titech.ac.jp | Kore ya kono Yuku mo kaeru mo Ogawa Laboratory, Dept. of Computer Science | Wakarete wa Shiru mo shiranu mo Tokyo Institute of Technology, Tokyo, Japan | Ausaka no seki lab:03-3726-1111 ext.2190 home:03-3785-1974 | (from hyaku-nin-issyu) fax: +81(3)3729-0685 (from abroad) 03-3729-0685 (from Japan) From pluto at cs.ucsd.edu Fri Feb 4 17:01:47 1994 From: pluto at cs.ucsd.edu (Mark Plutowski) Date: Fri, 04 Feb 1994 14:01:47 -0800 Subject: some questions on training neural nets... Message-ID: <9402042201.AA16326@odin.ucsd.edu> I have another reference to add that may be helpful to those interested in the cross-validation issue raised in the following discussion, which I have edited in what follows to focus on the particular issue this reference addresses: ------- Forwarded Message From tgd at chert.CS.ORST.EDU Wed Feb 2 13:02:30 1994 From: tgd at chert.CS.ORST.EDU (Tom Dietterich) Date: Wed, 2 Feb 94 10:02:30 PST Subject: some questions on training neural nets... In-Reply-To: "Charles X. Ling"'s message of Tue, 1 Feb 94 03:37:10 EST <9402010837.AA01695@godel.csd.uwo.ca> Message-ID: <9402021802.AA00565@curie.CS.ORST.EDU> In answer to the following: From: "Charles X. Ling" Date: Tue, 1 Feb 94 03:37:10 EST Hi neural net experts, Will cross-validation help ? [...] (could results on the validation set be coincident)? Tom Dietterich replies: [stuff deleted] There are many ways to manage the bias/variance tradeoff. I would say that there is nothing approaching complete agreement on the best approaches (and more fundamentally, the best approach varies from one application to another, since this is really a form of prior). The approaches can be summarized as * early stopping * error function penalties * size optimization - growing - pruning - other Early stopping usually employs cross-validation to decide when to stop training. (see below). In my experience, training an overlarge network with early stopping gives better performance than trying to find the minimum network size. It has the disadvantage that training costs are very high. [stuff deleted] 3. If, for some reason, cross-validation is needed, and TR is split to TR1 (for training) and TR2 (for validation), what would be the proper ways to do cross-validation? Training on TR1 uses only partial information in TR, but training TR1 to find right parameters and then training on TR1+TR2 may require parameters different from the estimation of training TR1. I use the TR1+TR2 approach. On large data sets, this works well. On small data sets, the cross-validation estimates themselves are very noisy, so I have not found it to be as successful. I compute the stopping point using the sum squared error per training example, so that it scales. I think it is an open research problem to know whether this is the right thing to do. [the reply continues..] ------- End of Forwarded Message In response to the last point, I supply a reference that provides theoretical guidance from a statistical perspective. It proves that cross-validation estimates Integrated Mean Squared Error (IMSE) within a constant due to noise. What this means: IMSE is a version of the mean squared error that accounts for the finite size of the training set. Think of it as the expected squared error obtained by training a network on random training sets of a particular size. It is an ideal (i.e., in general, unobservable) measure of generalization. IMSE embodies the bias and variance tradeoff. It can be decomposed into the sum of two terms, which directly quantify the bias + variance. Therefore, if IMSE embodies the measure of generalization that is relevant to you, (which will depend on your learning task) then, least-squares cross-validation provides a realizable estimate of generalization. Summary of the main results of the paper: It proves that two versions of cross-validation (one being the "hold-out set" version discussed above, and the other being the "delete-1" version) provide unbiased and strongly consistent estimates of IMSE This is statistical jargon meaning that, on average, the estimate is accurate, (i.e., the expectation of the estimate for given training set size equals the IMSE + a noise term) and asymtotically precise (in that as the training set and test set size grow large, the estimate converges to the IMSE within the constant factor due to noise, with probability 1.) Note that it does not say anything about the rate at which the variance of the estimate converges to the truth; therefore, it is possible that other IMSE-approximate measures may excel for small training set sizes (e.g., resampling methods such as bootstrap and jackknife.) However, it is the first result generally applicable to nonlinear regression that the authors are aware of, extending the well-known (in the statistical and econometric literature) work by C.J. Stone and others that prove similar results for particular learning tasks or for particular models. The statement of the results will appear in NIPS 6. I will post the soon-to-be-completed extended version to Neuroprose if anyone wants to see it sooner, or need access to the proofs. I hope this is helpful, = Mark Plutowski Institute for Neural Computation, and Department of Computer Science and Engineering University of California, San Diego La Jolla, California. USA. Here is the reference: Plutowski, Mark~E., Shinichi Sakata, and Halbert White. (1994). ``Cross-validation estimates IMSE.'' Cowan, J.D., Tesauro, G., and Alspector, J. (eds.), {\em Advances in Neural Information Processing Systems 6}, San Mateo, CA: Morgan Kaufmann Publishers. From esann at dice.ucl.ac.be Sun Feb 6 15:19:56 1994 From: esann at dice.ucl.ac.be (esann@dice.ucl.ac.be) Date: Sun, 6 Feb 94 21:19:56 +0100 Subject: ESANN'94: European Symposium on ANNs Message-ID: <9402062019.AA07827@ns1.dice.ucl.ac.be> ****************************************************************** * European Symposium * * on Artificial Neural Networks * * * * Brussels (Belgium) - April 20-21-22, 1994 * * * * Preliminary Program and registration form * ****************************************************************** Foreword ******** The actual developments in the field of artificial neural networks mark a watershed in its relatively young history. Far from the blind passion for disparate applications some years ago, the tendency is now to an objective assessment of this emerging technology, with a better knowledge of the basic concepts, and more appropriate comparisons and links with classical methods of computing. Neural networks are not restricted to the use of back-propagation and multi-layer perceptrons. Self-organization, adaptive signal processing, vector quantization, classification, statistics, image and speech processing are some of the domains where neural networks techniques may be successfully used; but a beneficial use goes through an in-depth examination of both the theoretical basis of the neural techniques and standard methods commonly used in the specified domain. ESANN'94 is the second symposium covering these specified aspects of neural networks computing. After a successful edition in 1993, ESANN'94 will open new perspectives, by focusing on theoretical and mathematical aspects of neural networks, biologically-inspired models, statistical aspects, and relations between neural networks and both information and signal processing (classification, vector quantization, self-organization, approximation of functions, image and speech processing,...). The steering and program committees of ESANN'94 are pleased to invite you to participate to this symposium. More than a formal conference presenting the last developments in the field, ESANN'94 will be also a forum for open discussions, round tables and opportunities for future collaborations. We hope to have the pleasure to meet you in April, in the splendid town of Brussels, and that your stay in Belgium will be as scientifically beneficial as agreeable. Symposium information ********************* Registration fees for symposium ------------------------------- registration before registration after 18th March 1994 18th March 1994 Universities BEF 14500 BEF 15500 Industries BEF 18500 BEF 19500 Registration fees include attendance to all sessions, the ESANN'94 banquet, a copy of the conference proceedings, daily lunches (20-22 April '94), and coffee breaks twice a day during the symposium. Advance registration is mandatory. Young researchers may apply for grants offered by the European Community (restricted to citizens or residents of a Western European country or, tentatively, Central or Eastern European country - deadline for applications: March 11th, 1994 - please write to the conference secretariat for details). Advance payments (see registration form) must be made to the conference secretariat by bank transfers in Belgian Francs (free of charges) or by sending a cheque (add BEF 500 for processing fees). Language -------- The official language of the conference is English. It will be used for all printed material, presentations and discussions. Proceedings ----------- A copy of the proceedings will be provided to all Conference Registrants. All technical papers will be included in the proceedings. Additional copies of the proceedings (ESANN'93 and ESANN'94) may be purchased at the following rate: ESANN'94 proceedings: BEF 2000 ESANN'93 proceedings: BEF 1500. Add BEF 500 to any order for p.&p. and/or bank charges. Please write to the conference secretariat for ordering proceedings. Conference dinner ----------------- A banquet will be offered on Thursday 21th to all conference registrants in a famous and typical place of Brussels. Additional vouchers for the banquet may be purchased on Wednesday 20th at the conference. Cancellation ------------ If cancellation is received by 25th March 1994, 50% of the registration fees will be returned. Cancellation received after this date will not be entitled to any refund. General information ******************* Brussels, Belgium ----------------- Brussels is not only the host city of the European Commission and of hundreds of multinational companies; it is also a marvelous historical town, with typical quarters, famous monuments known throughout the world, and the splendid "Grand-Place". It is a cultural and artistic center, with numerous museums. Night life in Brussels is considerable. There are of lot of restaurants and pubs open late in the night, where typical Belgian dishes can be tasted with one of the more than 1000 different beers. Hotel accommodation ------------------- Special rates for participants to ESANN'94 have been arranged at the MAYFAIR HOTEL, a De Luxe 4 stars hotel with 99 fully air conditioned guest rooms, tastefully decorated to the highest standards of luxury and comfort. The hotel includes two restaurants, a bar and private parking. Public transportation (trams n93 & 94) goes directly from the hotel to the conference center (Parc stop) Single room BEF 2800 Double room or twin room BEF 3500 Prices include breakfast, taxes and service. Rooms can only be confirmed upon receipt of booking form (see at the end of this booklet) and deposit. Located on the elegant Avenue Louise, the exclusive Hotel Mayfair is a short walk from the "uppertown" luxurious shopping district. Also nearby is the 14th century Cistercian abbey and the magnificent "Bois de la Cambre" park with its open-air cafes - ideal for a leisurely stroll at the end of a busy day. HOTEL MAYFAIR tel: +32 2 649 98 00 381 av. Louise fax: +32 2 649 22 49 1050 Brussels - Belgium Conference location ------------------- The conference will be held at the "Chancellerie" of the Generale de Banque. A map is included in the printed programme. Generale de Banque - Chancellerie 1 rue de la Chancellerie 1000 Brussels - Belgium Conference secretariat D facto conference services tel: + 32 2 245 43 63 45 rue Masui fax: + 32 2 245 46 94 B-1210 Brussels - Belgium E-mail: esann at dice.ucl.ac.be PROGRAM OF THE CONFERENCE ************************* Wednesday 20th April 1994 ------------------------- 9H30 Registration 10H00 Opening session Session 1: Neural networks and chaos Chairman: M. Hasler (Ecole Polytechnique Fdrale de Lausanne, Switzerland) 10H10 "Concerning the formation of chaotic behaviour in recurrent neural networks" T. Kolb, K. Berns Forschungszentrum Informatik Karlsruhe (Germany) 10H30 "Stability and bifurcation in an autoassociative memory model" W.G. Gibson, J. Robinson, C.M. Thomas University of Sidney (Australia) 10H50 Coffee break Session 2: Theoretical aspects 1 Chairman: C. Jutten (Institut National Polytechnique de Grenoble, France) 11H30 "Capabilities of a structured neural network. Learning and comparison with classical techniques" J. Codina, J. C. Aguado, J.M. Fuertes Universitat Politecnica de Catalunya (Spain) 11H50 "Projection learning: alternative approaches to the computation of the projection" K. Weigl, M. Berthod INRIA Sophia Antipolis (France) 12H10 "Stability bounds of momentum coefficient and learning rate in backpropagation algorithm"" Z. Mao, T.C. Hsia University of California at Davis (USA) 12H30 Lunch Session 3: Links between neural networks and statistics Chairman: J.C. Fort (Universit Nancy I, France) 14H00 "Model selection for neural networks: comparing MDL and NIC"" G. te Brake*, J.N. Kok*, P.M.B. Vitanyi** *Utrecht University, **Centre for Mathematics and Computer Science, Amsterdam (Netherlands) 14H20 "Estimation of performance bounds in supervised classification" P. Comon*, J.L. Voz**, M. Verleysen** *Thomson-Sintra Sophia Antipolis (France), **Universit Catholique de Louvain, Louvain-la-Neuve (Belgium) 14H40 "Input Parameters' estimation via neural networks" I.V. Tetko, A.I. Luik Institute of Bioorganic & Petroleum Chemistry, Kiev (Ukraine) 15H00 "Combining multi-layer perceptrons in classification problems" E. Filippi, M. Costa, E. Pasero Politecnico di Torino (Italy) 15H20 Coffee break Session 4: Algorithms 1 Chairman: J. Hrault (Institut National Polytechnique de Grenoble, France) 16H00 "Diluted neural networks with binary couplings: a replica symmetry breaking calculation of the storage capacity" J. Iwanski, J. Schietse Limburgs Universitair Centrum (Belgium) 16H20 "Storage capacity of the reversed wedge perceptron with binary connections" G.J. Bex, R. Serneels Limburgs Universitair Centrum (Belgium) 16H40 "A general model for higher order neurons" F.J. Lopez-Aligue, M.A. Jaramillo-Moran, I. Acedevo-Sotoca, M.G. Valle Universidad de Extremadura, Badajoz (Spain) 17H00 "A discriminative HCNN modeling" B. Petek University of Ljubljana (Slovenia) Thursday 21th April 1994 ------------------------ Session 5: Biological models Chairman: P. Lansky (Academy of Science of the Czech Republic) 9H00 "Biologically plausible hybrid network design and motor control" G.R. Mulhauser University of Edinburgh (Scotland) 9H20 "Analysis of critical effects in a stochastic neural model" W. Mommaerts, E.C. van der Meulen, T.S. Turova K.U. Leuven (Belgium) 9H40 "Stochastic model of odor intensity coding in first-order olfactory neurons" J.P. Rospars*, P. Lansky** *INRA Versailles (France), **Academy of Sciences, Prague (Czech Republic) 10H00 "Memory, learning and neuromediators" A.S. Mikhailov Fritz-Haber-Institut der MPG, Berlin (Germany), and Russian Academy of Sciences, Moscow (Russia) 10H20 "An explicit comparison of spike dynamics and firing rate dynamics in neural network modeling" F. Chapeau-Blondeau, N. Chambet Universit d'Angers (France) 10H40 Coffee break Session 6: Algorithms 2 Chairman: T. Denoeux (Universit Technologique de Compigne, France) 11H10 "A stop criterion for the Boltzmann machine learning algorithm" B. Ruf Carleton University (Canada) 11H30 "High-order Boltzmann machines applied to the Monk's problems" M. Grana, V. Lavin, A. D'Anjou, F.X. Albizuri, J.A. Lozano UPV/EHU, San Sebastian (Spain) 11H50 "A constructive training algorithm for feedforward neural networks with ternary weights" F. Aviolat, E. Mayoraz Ecole Polytechnique Fdrale de Lausanne (Switzerland) 12H10 "Synchronization in a neural network of phase oscillators with time delayed coupling" T.B. Luzyanina Russian Academy of Sciences, Moscow (Russia) 12H30 Lunch Session 7: Evolutive and incremental learning Chairman: T.J. Stonham (Brunel University, UK) - to be confirmed 14H00 "Reinforcement learning and neural reinforcement learning" S. Sehad, C. Touzet Ecole pour les Etudes et la Recherche en Informatique et Electronique, Nmes (France) 14H20 "Improving piecewise linear separation incremental algorithms using complexity reduction methods" J.M. Moreno, F. Castillo, J. Cabestany Universitat Politecnica de Catalunya (Spain) 14H40 "A comparison of two weight pruning methods" O. Fambon, C. Jutten Institut National Polytechnique de Grenoble (France) 15H00 "Extending immediate reinforcement learning on neural networks to multiple actions" C. Touzet Ecole pour les Etudes et la Recherche en Informatique et Electronique, Nmes (France) 15H20 "Incremental increased complexity training" J. Ludik, I. Cloete University of Stellenbosch (South Africa) 15H40 Coffee break Session 8: Function approximation Chairman: E. Filippi (Politecnico di Torino, Italy) - to be confirmed 16H20 "Approximation of continuous functions by RBF and KBF networks" V. Kurkova, K. Hlavackova Academy of Sciences of the Czech Republic 16H40 "An optimized RBF network for approximation of functions" M. Verleysen*, K. Hlavackova** *Universit Catholique de Louvain, Louvain-la-Neuve (Belgium), **Academy of Science of the Czech Republic 17H00 "VLSI complexity reduction by piece-wise approximation of the sigmoid function" V. Beiu, J.A. Peperstraete, J. Vandewalle, R. Lauwereins K.U. Leuven (Belgium) 20H00 Conference dinner Friday 22th April 1994 ---------------------- Session 9: Algorithms 3 Chairman: J. Vandewalle (K.U. Leuven, Belgium) - to be confirmed 9H00 "Dynamic pattern selection for faster learning and controlled generalization of neural networks" A. Rbel Technische Universitt Berlin (Germany) 9H20 "Noise reduction by multi-target learning" J.A. Bullinaria Edinburgh University (Scotland) 9H40 "Variable binding in a neural network using a distributed representation" A. Browne, J. Pilkington South Bank University, London (UK) 10H00 "A comparison of neural networks, linear controllers, genetic algorithms and simulated annealing for real time control" M. Chiaberge*, J.J. Merelo**, L.M. Reyneri*, A. Prieto**, L. Zocca* *Politecnico di Torino (Italy), **Universidad de Granada (Spain) 10H20 "Visualizing the learning process for neural networks" R. Rojas Freie Universitt Berlin (Germany) 10H40 Coffee break Session 10: Theoretical aspects 2 Chairman: M. Cottrell (Universit Paris I, France) 11H20 "Stability analysis of diagonal recurrent neural networks" Y. Tan, M. Loccufier, R. De Keyser, E. Noldus University of Gent (Belgium) 11H40 "Stochastics of on-line back-propagation" T. Heskes University of Illinois at Urbana-Champaign (USA) 12H00 "A lateral contribution learning algorithm for multi MLP architecture" N. Pican*, J.C. Fort**, F. Alexandre* *INRIA Lorraine, **Universit Nancy I (France) 12H20 Lunch Session 11: Self-organization Chairman: F. Blayo (EERIE Nmes, France) 14H00 "Two or three things that we know about the Kohonen algorithm" M. Cottrell*, J.C. Fort**, G. Pags*** Universits *Paris 1, **Nancy 1, ***Paris 6 (France) 14H20 "Decoding functions for Kohonen maps" M. Alvarez, A. Varfis CEC Joint Research Center, Ispra (Italy) 14H40 "Improvement of learning results of the selforganizing map by calculating fractal dimensions" H. Speckmann, G. Raddatz, W. Rosenstiel University of Tbingen (Germany) 15H00 Coffee break Session 11 (continued): Self-organization Chairman: F. Blayo (EERIE Nmes, France) 15H40 "A non linear Kohonen algorithm" J.-C. Fort*, G. Pags** *Universit Nancy 1, **Universits Pierre et Marie Curie, et Paris 12 (France) 16H00 "Self-organizing maps based on differential equations" A. Kanstein, K. Goser Universitt Dortmund (Germany) 16H20 "Instabilities in self-organized feature maps with short neighbourhood range" R. Der, M. Herrmann Universitt Leipzig (Germany) ESANN'94 Registration and Hotel Booking Form ******************************************** Registration fees ----------------- registration before registration after 18th March 1994 18th March 1994 Universities BEF 14500 BEF 15500 Industries BEF 18500 BEF 19500 University fees are applicable to members and students of academic and teaching institutions. Each registration will be confirmed by an acknowledgment of receipt, which must be given to the registration desk of the conference to get entry badge, proceedings and all materials. Registration fees include attendance to all sessions, the ESANN'94 banquet, a copy of the conference proceedings, daily lunches (20-22 April '94), and coffee breaks twice a day during the symposium. Advance registration is mandatory. Students and young researchers from European countries may apply for European Community grants. Hotel booking ------------- Hotel MAYFAIR (4 stars) - 381 av. Louise - 1050 Brussels Single room : BEF 2800 Double room (large bed) : BEF 3500 Twin room (2 beds) : BEF 3500 Prices include breakfast, service and taxes. A deposit corresponding to the first night is mandatory. Registration to ESANN'94 (please give full address and tick appropriate) ------------------------------------------------------------------------ Ms., Mr., Dr., Prof.:............................................... Name:............................................................... First Name:......................................................... Institution:........................................................ ................................................................... Address:............................................................ ................................................................... ZIP:................................................................ Town:............................................................... Country:............................................................ Tel:................................................................ Fax:................................................................ E-mail:............................................................. VAT n:............................................................. Universities: O registration before 18th March 1994: BEF 14500 O registration after 18th March 1994: BEF 15500 Industries: O registration before 18th March 1994: BEF 18500 O registration after 18th March 1994: BEF 19500 Hotel Mayfair booking (please tick appropriate) O single room deposit: BEF 2800 O double room (large bed) deposit: BEF 3500 O twin room (twin beds) deposit: BEF 3500 Arrival date: ..../..../1994 Departure date: ..../..../1994 O Additional payment if fees are paid through bank abroad check: BEF 500 Total BEF ____ Payment (please tick): O Bank transfer, stating name of participant, made payable to: Gnrale de Banque ch. de Waterloo 1341 A B-1180 Brussels - Belgium Acc.no: 210-0468648-93 of D facto (45 rue Masui, B-1210 Brussels) Bank transfers must be free of charges. EVENTUAL CHARGES MUST BE PAID BY THE PARTICIPANT. O Cheques/Postal Money Orders made payable to: D facto 45 rue Masui B-1210 Brussels - Belgium A SUPPLEMENTARY FEE OF BEF 500 MUST BE ADDED if the payment is made through bank abroad cheque or postal money order. Only registrations accompanied by a cheque, a postal money order or the proof of bank transfer will be considered. Registration and hotel booking form, together with payment, must be send as soon as possible, and in no case later than 8th April 1994, to the conference secretariat: &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& & D facto conference services - ESANN'94 & & 45, rue Masui - B-1210 Brussels - Belgium & &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& Support ******* ESANN'94 is organized with the support of: - Commission of the European Communities (DG XII, Human Capital and Mobility programme) - IEEE Region 8 - IFIP WG 10.6 on neural networks - Region of Brussels-Capital - EERIE (Ecole pour les Etudes et la Recherche en Informatique et Electronique - Nmes) - UCL (Universit Catholique de Louvain - Louvain-la-Neuve) - REGARDS (Research Group on Algorithmic, Related Devices and Systems - UCL) Steering committee ****************** Franois Blayo EERIE, Nmes (F) Marie Cottrell Univ. Paris I (F) Nicolas Franceschini CNRS Marseille (F) Jeanny Hrault INPG Grenoble (F) Michel Verleysen UCL Louvain-la-Neuve (B) Scientific committee ******************** Luis Almeida INESC - Lisboa (P) Jorge Barreto UCL Louvain-en-Woluwe (B) Herv Bourlard L. & H. Speech Products (B) Joan Cabestany Univ. Polit. de Catalunya (E) Dave Cliff University of Sussex (UK) Pierre Comon Thomson-Sintra Sophia (F) Holk Cruse Universitt Bielefeld (D) Dante Del Corso Politecnico di Torino (I) Marc Duranton Philips / LEP (F) Jean-Claude Fort Universit Nancy I (F) Karl Goser Universitt Dortmund (D) Martin Hasler EPFL Lausanne (CH) Philip Husbands University of Sussex (UK) Christian Jutten INPG Grenoble (F) Petr Lansky Acad. of Science of the Czech Rep. (CZ) Jean-Didier Legat UCL Louvain-la-Neuve (B) Jean Arcady Meyer Ecole Normale Suprieure - Paris (F) Erkki Oja Helsinky University of Technology (SF) Guy Orban KU Leuven (B) Gilles Pags Universit Paris I (F) Alberto Prieto Universitad de Granada (E) Pierre Puget LETI Grenoble (F) Ronan Reilly University College Dublin (IRE) Tamas Roska Hungarian Academy of Science (H) Jean-Pierre Rospars INRA Versailles (F) Jean-Pierre Royet Universit Lyon 1 (F) John Stonham Brunel University (UK) Lionel Tarassenko University of Oxford (UK) John Taylor King's College London (UK) Vincent Torre Universita di Genova (I) Claude Touzet EERIE Nmes (F) Joos Vandewalle KUL Leuven (B) Eric Vittoz CSEM Neuchtel (CH) Christian Wellekens Eurecom Sophia-Antipolis (F) _____________________________ Michel Verleysen D facto conference services 45 rue Masui 1210 Brussels Belgium tel: +32 2 245 43 63 fax: +32 2 245 46 94 E-mail: esann at dice.ucl.ac.be _____________________________ From lba at ilusion.inesc.pt Mon Feb 7 04:57:07 1994 From: lba at ilusion.inesc.pt (Luis B. Almeida) Date: Mon, 7 Feb 94 10:57:07 +0100 Subject: Encoding missing values Message-ID: <9402070957.AA18932@ilusion.inesc.pt> Bill Skaggs writes: There is at least one kind of network that has no problem (in principle) with missing inputs, namely a Boltzmann machine. You just refrain from clamping the input node whose value is missing, and treat it like an output node or hidden unit. This may seem to be irrelevant to anything other than Boltzmann machines, but I think it could be argued that nothing very much simpler is capable of dealing with the problem. When you ask a network to handle missing inputs, you are in effect asking it to do pattern completion on the input layer, and for this a Boltzmann machine or some other sort of attractor network would seem to be required. The same effect, of trying to guess the missing inputs, can also be obtained with a recurrent multilayer perceptron, trained with recurrent backprop. This is the reason why the pattern completion results that I described in my 1987 ICNN paper (ref. below) were rather good. L. B. Almeida, "A learning rule for asynchronous perceptrons with feedback in a combinatorial environment", Proc IEEE First International Conference on Neural Networks, San Diego, Ca., 1987. Luis B. Almeida INESC Phone: +351-1-544607, +351-1-3100246 Apartado 10105 Fax: +351-1-525843 P-1017 Lisboa Codex Portugal lba at inesc.pt ----------------------------------------------------------------------------- *** Indonesians are killing innocent people in East Timor *** From jordan at psyche.mit.edu Mon Feb 7 20:47:09 1994 From: jordan at psyche.mit.edu (Michael Jordan) Date: Mon, 7 Feb 94 20:47:09 EST Subject: Encoding missing values Message-ID: > There is at least one kind of network that has no problem (in > principle) with missing inputs, namely a Boltzmann machine. > You just refrain from clamping the input node whose value is > missing, and treat it like an output node or hidden unit. > > This may seem to be irrelevant to anything other than Boltzmann > machines, but I think it could be argued that nothing very much > simpler is capable of dealing with the problem. The above is a nice observation that is worth emphasizing; I agree with all of it except the comment about being irrelevant to anything else. The Boltzmann machine is actually relevant to everything else. What the Boltzmann algorithm is doing with the missing value is essentially the same as what the EM algorithm for mixtures (that Ghahramani and Tresp referred to) is doing, and epitomizes the general case of an iterative "filling in" algorithm. The Boltzmann machine learning algorithm is a generalized EM (GEM) algorithm. During the E step the system computes the conditional correlation function for the nodes under the Boltzmann distribution, where the conditioning variables are the known data (the values of the clamped units) and the current values of the parameters (weights). This "fills in" the relevant statistic (the correlation function) and allows it to be used in the generalized M step (the contrastive Hebb rule). Moreover, despite the fancy terminology, these algorithms are nothing more (nor less) than maximum likelihood estimation, where the likelihood function is the likelihood of the parameters *given the data that was actually observed*. By "filling in" missing data, you're not adding new information to the problem; rather, you're allowing yourself to use all the information that is in those components of the data vector that aren't missing. (EM theory provides the justification for that statement). E.g., if only one component of an input vector is missing, it's obviously wasteful to neglect what the other components of the input vector are telling you. And, indeed, if you neglect the whole vector, you will not end up with maximum likelihood estimates for the weights (nor in general will you get maximum likelihood estimates if you fill in a value with the unconditional mean of that variable). "Filling in" is not the only way to compute ML estimates for missing data problems, but its virtue is that it allows the use of the same learning algorithms as would be used for complete data (without incurring any bias, if the filling in is done correctly). The only downside is that even if the complete-data algorithm is one-pass (which the Boltzmann algorithm and mixture fitting are not) the "filling-in" approach is generally iterative, because the parameter estimates depend on the filled-in values which in turn depend on the parameter estimates. On the other hand, there are so-called "monotone" patterns of missing data for which the filling-in approach is not necessarily iterative. This monotone case might be of interest, because it is relevant for problems involving feedforward networks in which the input vectors are complete but some of the outputs are missing. (Note that even if all the output values for a case are missing, a ML algorithm will not throw the case out; there is statistical structure in the input vector that the algorithm must not neglect). Mike (See Ghahramani's message for references; particularly the Little and Rubin book). From prechelt at ira.uka.de Tue Feb 8 07:19:16 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Tue, 08 Feb 1994 13:19:16 +0100 Subject: SUMMARY: encoding missing values Message-ID: <"irafs2.ira.957:08.01.94.12.19.58"@ira.uka.de> A few days ago, I posted some thoughts about how to represent missing input values to a neural network and asked for comments and further ideas. This message is a summary of the replies I received (some in my personal mail some in connectionists). I show the most significant comments and ideas and append versions of the messages that are trimmed to the most important parts (in case somebody wants to keep this discussion in his/her archive) This was my original message: ------------------------------------------------------------------------ From prechelt at ira.uka.de Wed Feb 2 03:58:56 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Wed, 02 Feb 1994 09:58:56 +0100 Subject: Encoding missing values Message-ID: I am currently thinking about the problem of how to encode data with attributes for which some of the values are missing in the data set for neural network training and use. An example of such data is the 'heart-disease' dataset from the UCI machine learning database (anonymous FTP on "ics.uci.edu" [128.195.1.1], directory "/pub/machine-learning-databases"). There are 920 records altogether with 14 attributes each. Only 299 of the records are complete, the others have one or several missing attribute values. 11% of all values are missing. I consider only networks that handle arbitrary numbers of real-valued inputs here (e.g. all backpropagation-suited network types etc). I do NOT consider missing output values. In this setting, I can think of several ways how to encode such missing values that might be reasonable and depend on the kind of attribute and how it was encoded in the first place: 1. Nominal attributes (that have n different possible values) 1.1 encoded "1-of-n", i.e., one network input per possible value, the relevant one being 1 all others 0. This encoding is very general, but has the disadvantage of producing networks with very many connections. Missing values can either be represented as 'all zero' or by simply treating 'is missing' as just another possible input value, resulting in a "1-of-(n+1)" encoding. 1.2 encoded binary, i.e., log2(n) inputs being used like the bits in a binary representation of the numbers 0...n-1 (or 1...n). Missing values can either be represented as just another possible input value (probably all-bits-zero is best) or by adding an additional network input which is 1 for 'is missing' and 0 for 'is present'. The original inputs should probably be all zero in the 'is missing' case. 2. continuous attributes (or attributes treated as continuous) 2.1 encoded as a single network input, perhaps using some monotone transformation to force the values into a certain distribution. Missing values are either encoded as a kind of 'best guess' (e.g. the average of the non-missing values for this attribute) or by using an additional network input being 0 for 'missing' and 1 for 'present' (or vice versa) and setting the original attribute input either to 0 or to the 'best guess'. (The 'best guess' variant also applies to nominal attributes above) 3. binary attributes (truth values) 3.1 encoded by one input: 0=false 1=true or vice versa Treat like (2.1) 3.2 encoded by one input: -1=false 1=true or vice versa In this case we may act as for (3.1) or may just use 0 to indicate 'missing'. 3.3 treat like nominal attribute with 2 possible values 4. ordinal attributes (having n different possible values, which are ordered) 4.1 treat either like continuous or like nominal attribute. If (1.2) is chosen, a Gray-Code should be used. Continuous representation is risky unless a 'sensible' quantification of the possible values is available. So far to my considerations. Now to my questions. a) Can you think of other encoding methods that seem reasonable ? Which ? b) Do you have experience with some of these methods that is worth sharing ? c) Have you compared any of the alternatives directly ? ------------------------------------------------------------------------ SUMMARY: For a), the following ideas were mentioned: 1. use statistical techniques to compute replacement values from the rest of the data set 2. use a Boltzman machine to do this for you 3. use an autoencoder feed forward network to do this for you 4. randomize on the missing values (correct in the Bayesian sense) For b), some experience was reported. I don't know how to summarize that nicely, so I just don't summarize at all. For c), no explicit quantitative results were given directly. Some replies suggest that data is not always missing randomly. The biases are often known and should be taken into account (e.g. medical tests are not carried out (resulting in missing data) for moreless healthy persons more often than for ill persons). Many replies contained references to published work on this area, from NN, machine learning, and mathematical statistics. To ease searching for these references in the replies below, I have marked them with the string ##REF## (if you have a 'grep' program that extracts whole paragraphs, you can get them all out with one command). Thanks to all who answered. These are the trimmed versions of the replies: ------------------------------------------------------------------------ From: tgd at research.CS.ORST.EDU (Tom Dietterich) [...for nominal attributes:] An alternative here is to encode them as bit-strings in a error-correcting code, so that the hamming distance between any two bit strings is constant. This would probably be better than a dense binary encoding. The cost in additional inputs is small. I haven't tried this though. My guess is that distributed representations at the input are a bad idea. One must always determine WHY the value is missing. In the heart disease data, I believe the values were not measured because other features were believed to be sufficient in each case. In such cases, the network should learn to down-weight the importance of the feature (which can be accomplished by randomizing it---see below). In other cases, it may be more appropriate to treat a missing value as a separate value for the feature, e.g., in survey research, where a subject chooses not to answer a question. [...for continuous attributes:] Ross Quinlan suggests encoding missing values as the mean observed output value when the value is missing. He has tried this in his regression tree work. Another obvious approach is to randomize the missing values--on each presentation of the training example, choose a different, random, value for each missing input feature. This is the "right thing to do" in the bayesian sense. [...for binary attributes:] I'm skeptical of the -1,0,1 encoding, but I think there is more research to be done here. [...for ordinal attributes:] I would treat them as continuous. ------------------------------------------------------------------------ From: shavlik at cs.wisc.edu (Jude W. Shavlik) We looked at some of the methods you talked about in the following article in the journal Machine Learning. ##REF## %T Symbolic and Neural Network Learning Algorithms: An Experimental Comparison %A J. W. Shavlik %A R. J. Mooney %A G. G. Towell %J Machine Learning %V 6 %N 2 %P 111-143 %D 1991 ------------------------------------------------------------------------ From: hertz at nordita.dk (John Hertz) It seems to me that the most natural way to handle missing data is to leave them out. You can do this if you work with a recurrent network (fx Boltzmann machine) where the inputs are fed in by clamping the input units to the given input values and the rest of the net relaxes to a fixed point, after which the output is read off the output units. If some of the input values are missing, the corresponding input units are just left unclamped, free to relax to values most consistent with the known inputs. I have meant for a long time to try this on some medical prognosis data I was working on, but I never got around to it, so I would be happy to hear how it works if you try it. ------------------------------------------------------------------------ From: jozo at sequoia.WPI.EDU (Jozo Dujmovic) In the case of clustering benchmark programs I frequently have the the problem of estimation of missing data. A relatively simple SW that implements a heuristic algorithm generates estimates having the average error of 8%. NN will somehow "implicitly estimate" the missing data. The two approaches might even be in some sense equivalent (?). Jozo [ I suspect that they are not: When you generate values for the missing items and put them in the training set, the network loses the information that this data is only estimated. Since estimations are not as reliable as true input data, the network will weigh inputs that have lots of generated values as less important. If it gets the 'is missing' information explicitly, it can discriminate true values from estimations instead. ] ------------------------------------------------------------------------ From: guy at cs.uq.oz.au A final year student of mine worked on the problem of dealing with missing inputs, without much success. However, the student as not very good, so take the following opinions with a pinch of salt. We (very tentatively) came to the conclusion that if the inputs were redundant, the problem was easy; if the missing input contained vital information, the problem was pretty much impossible. We used the heart disease data. I don't recommend it for the missing inputs problem. All of the inputs are very good indicators of the correct result, so missing inputs were not important. Apparently there is a large literature in statistics on dealing with missing inputs. Anthony Adams (University of Tasmania) has published a technical report on this. His email address is "A.Adams at cs.utas.edu.au". ##REF## @techreport{kn:Vamplew-91, author = "P. Vamplew and A. Adams", address = {Hobart, Tasmania, Australia}, institution = {Department of Computer Science, University of Tasmania}, number = {R1-4}, title = {Real World Problems in Backpropagation: Missing Values and Generalisability}, year = {1991} } ------------------------------------------------------------------------ From: Mike Southcott ##REF## I wrote a paper for the Australian conference on neural networks in 1993. ``Classification of Incomplete Data using neural networks'' Southcott, Bogner. You may find it interesting. You may not be able to get the proceedings for this conference, but I am in the process of digging up a postscript copy for someone in the States, so when I do that, I will send you a copy. ------------------------------------------------------------------------ From: Eric Saund I have done some work on unsupervised learning of mulitple cause clusters in binary data, for which an appropriate encoding scheme is -1 = FALSE, 1 = TRUE, and 0 = NO DATA. This has worked well for me, but my paradigm is not your standard feedforward network and uses a different activiation function from the standard weighted sum followed by sigmoid squashing. I presented the paper on this work at NIPS: ##REF## Saund, Eric; 1994; "Unsupervised Learning of Mixtures of Multiple Causes in Binary Data," in Advances in Neural Information Processing Systems -6-, Cowan, J., Tesauro, G, and Alspector, J., eds. Morgan Kaufmann, San Francisco. ------------------------------------------------------------------------ From: Thierry.Denoeux at hds.univ-compiegne.fr In a recent mailing, Lutz Prechelt mentioned the interesting problem of how to encode attributes with missing values as inputs to a neural network. I have recently been faced to that problem while applying neural nets to rainfall prediction using weather radar images. The problem was to classify pairs of "echoes" -- defined as groups of connected pixels with reflectivity above some threshold -- taken from successive images as corresponding to the same rain cell or not. Each pair of echoes was discribed by a list of attributes. Some of these attributes, refering to the past of a sequence, were not defined for some instances. To encode these attributes with potentially missing values, we applied two different methods actually suggested by Lutz: - the replacement of the missing value by a "best-guess" value - the addition of a binary input indicating whether the corresponding attribute was present or absent. Significantly better results were obtained by the second method. This work was presented at ICANN'93 last september: ##REF## X. Ding, T. Denoeux & F. Helloco (1993). Tracking rain cells in radar images using multilayer neural networks. In Proc. of ICANN'93, Springer-Verlag, p. 962-967. ------------------------------------------------------------------------ From: "N. Karunanithi" [...for nominal attributes:] Both methods have the problem of poor scalability. If the number of missing values increases then the number of additional inputs will increase linearly in 1.1 and logarithmically in 1.2. In fact, 1-of-n encoding may be a poor choice if (1) the number of input features is large and (2) such an expanded dimensional representation does not become a (semi) linearly separable problem. Even if it becomes a linearly separable problem, the overall complexity of the network can sometimes be very high. [...for continuous attributes:] This representation requires GUESS. A nominal transformation may not be a proper representation in some cases. Assume that the output values range over a large numerical interval. For example, from 0.0 to 10,000.0. If you use a simple scaling like dividing by 10,000.0 to make it between 0.0 and 1.0, this will result in poor accuracy of prediction. If the attribute is on the input side, then on theory the scaling is unnecessary because the input layer weights will scale accordingly. However, in practice I had lot of problem with this approach. Maybe a log tranformation before scaling may not be a bad choice. If you use a closed scaling you may have problem whenever a future value exceeds the maximum value of the numerical intervel. For example, assume that the attribute is time, say in miliseconds. Any future time from the point of reference can exceed the limit. Hence any closed scaling will not work properly. [...for ordinal attributes:] I have compared Binary Encoding (1.2), Gray-Coded representation and straighforward scaling. Colsed scaling seems to do a good job. I have also compared open scaling and closed scaling and did find significant improvement in prediction accuracy. ###REF### N. Karunanithi, D. Whitley and Y. K. Malaiya, "Prediction of Software Reliability Using Connectionist Models", IEEE Trans. Software Eng., July 1992, pp 563-574. From yong at cns.brown.edu Tue Feb 8 10:40:35 1994 From: yong at cns.brown.edu (Yong Liu) Date: Tue, 8 Feb 94 10:40:35 EST Subject: some questions on training neural nets Message-ID: <9402081540.AA15383@cns.brown.edu> On the discussion of cross-validation method, Dr. Plutowski referred to his paper by writing > It proves that two versions of cross-validation > (one being the "hold-out set" version discussed above, and the other > being the "delete-1" version) provide unbiased and strongly consistent > estimates of IMSE This is statistical jargon meaning that, on > average, the estimate is accurate, (i.e., the expectation > of the estimate for given training set size equals the IMSE + a noise term) > and asymtotically precise (in that as the training set and test set > size grow large, the estimate converges to the IMSE within the > constant factor due to noise, with probability 1.) Comment: This comment is on the above result about "delete-1" version cross-validation. The result must have assumed that the training data set have no outliers (corruption in Y component of a data point). Since deleting a data point that is outlier will cause a great change in the estimated neural net weights, and also the squared prediction error on this outliers will be large. This will then eventually cause a biased estimation of the IMSE. Even if a robust algorithm is used to estimate the neural net weights in order to reduce the sensitive of outlier in the estimation, the squared prediction error on the outlier will still be large. A possible correction would be to weight this outlier less in the cross-validation, or in another word, to take less attention to this outlier when delete this outlier. A weighted cross-validation like this has been discussed briefly in Liu (1994). The weighting of a data is calculated through an iterative reweighted algorithm for robust regression. One interesting thing about this version of cross-validation is its asymptotical equivalency to Moody's criterion (Moody,1992; Liu, 1993). References: Liu, Y.(1993) Neural Network Model Selection Using Asymptotic Jackknife Estimator and Cross-Validation Method. In C.L. Giles, S.J. Hanson, and and J.D. Cowan editors, {\em Advances in neural information processing system}, volume 5, pages 599-606. Morgan Kaufmann, San Mateo, CA. Liu, Y.(1994) Robust Parameter Estimation and Model Selection for Neural Network Regression. To Appear in Jack D. Cowan, Gerald Tesauro and Joshua Alspector editors, {\em Advances in neural information processing system}, volume 6. Morgan Kaufmann, San Mateo, CA. Moody, J.E. (1992).The effective number of parameters, an analysis of generalization and regularization in nonlinear learning system. In Moody, J.E., Hanson, S.J., and Lippmann, R.P., editors, {\em Advances in Neural Information Processing System 4}. Morgan Kaufmann Publication. ---------------------------- Yong Liu Box 1843 Department of Physics Institute for Brain and Neural Systems Brown University Providence, RI 02912 From pluto at cs.ucsd.edu Wed Feb 9 02:39:00 1994 From: pluto at cs.ucsd.edu (Mark Plutowski) Date: Tue, 08 Feb 1994 23:39:00 -0800 Subject: some questions on training neural nets Message-ID: <9402090739.AA07477@odin.ucsd.edu> ------- Previous Message: --------- From yong at cns.brown.edu Tue Feb 8 10:40:35 1994 From: yong at cns.brown.edu (Yong Liu) Date: Tue, 8 Feb 94 10:40:35 EST Subject: some questions on training neural nets Message-ID: <9402081540.AA15383@cns.brown.edu> On the discussion of cross-validation method, Dr. Plutowski referred to his paper by writing > It proves that two versions of cross-validation > (one being the "hold-out set" version discussed above, and the other > being the "delete-1" version) provide unbiased and strongly consistent > estimates of IMSE This is statistical jargon meaning that, on > average, the estimate is accurate, (i.e., the expectation > of the estimate for given training set size equals the IMSE + a noise term) > and asymtotically precise (in that as the training set and test set > size grow large, the estimate converges to the IMSE within the > constant factor due to noise, with probability 1.) Comment: This comment is on the above result about "delete-1" version cross-validation. The result must have assumed that the training data set have no outliers (corruption in Y component of a data point). Since deleting a data point that is outlier will cause a great change in the estimated neural net weights, and also the squared prediction error on this outliers will be large. This will then eventually cause a biased estimation of the IMSE. - ---------------------------- Yong Liu Box 1843 Department of Physics Institute for Brain and Neural Systems Brown University Providence, RI 02912 ------- End of Previous Message ------ No, actually it turns out that delete-1 cross-validation delivers unbiased estimates of IMSE under fairly reasonable conditions. (More precisely, it delivers estimates of IMSE_N + \sigma^2, for training set size N and noise variance \sigma^2.) Roughly, the noise must have variance the same everywhere in input space, (or, "homoscedasticity" as the statisticians would say,) with examples selected independently from the same, fixed environment (i.e., "i.i.d.") the expectation of the squared-target must be finite (this just ensures that conditional expectations of the target and the noise exist everywhere) plus some conditions on the network to make it behave nicely. For these same conditions, the estimate is additionally "conservative," in that it does not, (asymptotically, anyway, as N grows large) underestimate the expected squared error of the network for optimal weights. (These results and the prerequisite assumptions are of course stated more precisely in the paper.) However, we did require an additional assumption to obtain the "strong" convergence result, in that the optimal weights must be unique. This is to ensure that the weights for each of the deleted subsets of N-1 examples converge to the weights obtained by training on all N examples. As an aside: This latter condition may seem strong, but it seems to be (intuitively) applicable to a particular variant of delete-1 cross-validation commonly employed to make its computation more feasible - (in which case the global optima are in a sense "locally" unique under the right conditions.) In this variant, the network is trained on the entire training set to obtain the "base" network. These weights are then "fine-tuned" upon each of the deleted subsets of size N-1 to obtain the N cross-validated weight vectors. This tends to distribute the fine-tuned weights within a local region that seens to get tighter as the training set size increases. It tends to work well in practice, under the right conditions. (Essentially, you need to ensure that the ratio of examples to weights is sufficiently large, and it is easy to detect when this is not the case.) A bit off the original subject, I suppose, but I hope these results help clarify what cross-validation is doing, at least in that wonderfully ideal place called "asymptopia." It (apparently) turns out that these conditions suffice to ensure that the detrimental effect of a malicious outlier becomes negligible as the size of the training set grows large, at least with respect to the estimation of this particular kind of generalization by cross-validation. = Mark Plutowski UCSD: INC and CS&E P.S. Thank you for the honorable salutation! Actually, I am (still) just a student here. 8-) 8-| From lange at ira.uka.de Wed Feb 9 14:19:22 1994 From: lange at ira.uka.de (lange@ira.uka.de) Date: Wed, 9 Feb 94 14:19:22 MET Subject: Methods for improving generalization (was Re: some questions on ...) Message-ID: <"iraun1.ira.337:09.01.94.13.22.32"@ira.uka.de> Dear Mr. Hicks, in your mail to Mr. Grossman you mentioned the "Soft Weight-Sharing" algorithm and stated, that this algorithm would do some adaption to the data. I don't think, that this is right: Soft Weight-Sharing is just a bit more complicated than Weight-Decay or other things (so some improvements have been made). But Soft Weight-Sharing does not really adapt to the data, because you have to tune the same parameters as in normal Weight-Decay: the parameters, that are used to handle the strength of the penalty-term. The article of Nowlan and Hinton "Simplifying Neural Networks by Soft Weight- Sharing" does not mention a method to do this automatically - so no "real" adaption to the data is made. Maybe the methods of MacKay ("Bayesian Interpolation", Neural Comp. 4 (1992), page 415-447) could be used to get a fully-automatic adaption. A combination of this method with Weight-Decay or Soft Weight-Sharing would perhaps be data-adaptive; but Soft Weight-Sharing alone has still a parameter, that is not adapted by the data. Yours, Frank Lange From sec at ai.univie.ac.at Wed Feb 9 08:53:36 1994 From: sec at ai.univie.ac.at (sec@ai.univie.ac.at) Date: Wed, 9 Feb 1994 14:53:36 +0100 Subject: No subject Message-ID: <199402091353.AA14535@prater.ai.univie.ac.at> * * * * * TWELFTH EUROPEAN MEETING * * ON * * CYBERNETICS AND SYSTEMS RESEARCH * * (EMCSR 1994) * April 5 - 8, 1994 UNIVERSITY OF VIENNA organized by the Austrian Society for Cybernetic Studies in cooperation with Dept.of Medical Cybernetics and Artificial Intelligence, Univ.of Vienna and International Federation for Systems Research Plenary lectures: ***************** MARGARET BODEN (United Kingdom): "Artificial Intelligence and Creativity" STEPHEN GROSSBERG (USA): "Neural Networks for Learning, Recognition, and Prediction" STUART A. UMPLEBY (USA): "Twenty Years of Second Order Cybernetics" 241 papers will be presented and discussed in the following symposia: ********************************************************************* GENERAL SYSTEMS METHODOLOGY G.J.Klir (USA) ADVANCES IN MATHEMATICAL SYSTEMS THEORY J.Miro (Spain), M.Peschel (Germany), F.Pichler (Austria) FUZZY SYSTEMS, APPROXIMATE REASONING AND KNOWLEDGE-BASED SYSTEMS C.Carlsson (Finland), K.-P.Adlassnig (Austria), E.P.Klement (Austria) DESIGNING AND SYSTEMS, AND THEIR EDUCATION B.Banathy (USA), W.Gasparski (Poland), G.Goldschmidt (Israel) HUMANITY, ARCHITECTURE AND CONCEPTUALIZATION G.Pask (United Kingdom), G.de Zeeuw (Netherlands) BIOCYBERNETICS AND MATHEMATICAL BIOLOGY L.M.Ricciardi (Italy) SYSTEMS AND ECOLOGY F.J.Radermacher (Germany), K.Fedra (Austria) CYBERNETICS AND INFORMATICS IN MEDICINE G.Gell (Austria), G.Porenta (Austria) CYBERNETICS OF SOCIO-ECONOMIC SYSTEMS K.Balkus (USA), O.Ladanyi (Austria) SYSTEMS, MANAGEMENT AND ORGANIZATION G.Broekstra (Netherlands), R.Hough (USA) CYBERNETICS OF COUNTRY DEVELOPMENT P.Ballonoff (USA), T.Koizumi (USA), S.A.Umpleby (USA) COMMUNICATION AND COMPUTERS A M.Tjoa (Austria) INTELLIGENT AUTONOMOUS SYSTEMS J.Rozenblit (USA), H.Praehofer (Austria) CYBERNETIC PRINCIPLES OF KNOWLEDGE DEVELOPMENT F.Heylighen (Belgium), S.A.Umpleby (USA) CYBERNETICS, SYSTEMS AND PSYCHOTHERAPY M.Okuyama (Japan), H.Koizumi (USA) ARTIFICIAL NEURAL NETWORKS AND ADAPTIVE SYSTEMS S.Grossberg (USA), G.Dorffner (Austria) ARTIFICIAL INTELLIGENCE AND COGNITIVE SCIENCE V.Marik (Czech Republic), R.Born (Austria) TUTORIALS: ********** A SYNTACTIC APPROACH TO HEURISTIC NETWORKS: LINGUISTIC GEOMETRY Prof.Boris Stilman, University of Colorado, Denver, USA FUZZY SETS AND IMPRECISE BUT RELEVANT DECISIONS Prof.Christer Carlsson, Abo Akademi University, Abo, Finland CONTEXTUAL SYSTEMS: A NEW TECHNOLOGY FOR KNOWLEDGE BASED SYSTEM DEVELOPMENT Dr.Irina V. Ezhkova, Russian Academy of Science, Moscow TWENTY YEARS OF SECOND ORDER CYBERNETICS Prof.Stuart A. Umpleby, George Washington University, Washington, D.C., USA PROCEEDINGS: ************ Trappl R.(ed.): CYBERNETICS AND SYSTEMS '94, 2 vols, 1911 pages, World Scientific Publishing, Singapore. FOR FURTHER INFORMATION PLEASE CONTACT: *************************************** EMCSR'94 Secretariat c/o Austrian Society for Cybernetic Studies Schottengasse 3 A-1010 Vienna Austria Phone: +43-1-53532810 Fax: +43-1-5320652 E-mail: sec at ai.univie.ac.at From gert at jhunix.hcf.jhu.edu Wed Feb 9 09:32:57 1994 From: gert at jhunix.hcf.jhu.edu (Gert Cauwenberghs) Date: Wed, 9 Feb 1994 09:32:57 -0500 Subject: "A Learning Analog Neural Network Chip..." Message-ID: <94Feb9.093258edt.70280-3@jhunix.hcf.jhu.edu> FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/cauwenberghs.nips93.ps.Z A preprint of the paper: A Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics, by Gert Cauwenberghs, 8 pages including figures, to appear in Advances in Neural Information Processing Systems, vol. 6, 1994, is available on the neuroprose repository, in compressed PostScript format: anonymous binary ftp to archive.cis.ohio-state.edu cd pub/neuroprose get cauwenberghs.nips93.ps.Z uncompress and print. The abstract follows below. --- Gert Cauwenberghs (gert at jhunix.hcf.jhu.edu) We present experimental results on supervised learning of dynamical features in an analog VLSI neural network chip. The recurrent network, containing six continuous-time analog neurons and 42 free parameters (connection strengths and thresholds), is trained to generate time-varying outputs approximating given periodic signals presented to the network. The chip implements a stochastic perturbative algorithm, which observes the error gradient along random directions in the parameter space for error-descent learning. In addition to the integrated learning functions and the generation of pseudo-random perturbations, the chip provides for teacher forcing and long-term storage of the volatile parameters. The network learns a 1 kHz circular trajectory in 100 sec. The chip occupies 2 X 2 mm in a 2 um CMOS process, and dissipates 1.2 mW. From yong at cns.brown.edu Wed Feb 9 14:42:14 1994 From: yong at cns.brown.edu (Yong Liu) Date: Wed, 9 Feb 94 14:42:14 EST Subject: some questions on training neural nets Message-ID: <9402091942.AA19342@cns.brown.edu> Plutowski (Tue, 08 Feb 1994) wrote >No, actually it turns out that delete-1 cross-validation delivers >unbiased estimates of IMSE under fairly reasonable conditions. >(More precisely, it delivers estimates of IMSE_N + \sigma^2, >for training set size N and noise variance \sigma^2.) >Roughly, the noise must have variance the same everywhere in input space, >(or, "homoscedasticity" as the statisticians would say,) with examples >selected independently from the same, fixed environment (i.e., "i.i.d.") >the expectation of the squared-target must be finite (this just ensures >that conditional expectations of the target and the noise exist everywhere) >plus some conditions on the network to make it behave nicely. >For these same conditions, the estimate is additionally "conservative," >in that it does not, (asymptotically, anyway, as N grows large) >underestimate the expected squared error of the network for optimal weights. Outliers are the data points that come in an "unexpected" way, both in the training data and in the future. For example, the data is collected so that a proportional of them are typos. So as the size of the data gets large, the number of outliers in them also gets large. Plutowski's assumption, as I understand it, is to assume the ratio of the number outliers over the size of data size is very small. One way to look at data set containing outliers is to assume noises are inhomoscedastic. Outlier data points have their noises with large variance, and good data points have their noises with small variance (Liu 1994). This is different from Plutowski's "homoscedasticity" assumption. Since we have no intention of predicting the value of outliers, robust estimation in both the parameters and the generalization error requires the "removal" of the outliers. These discussion, I hope, could convey the idea that when using cross-validation for the estimation of generalization error, some cautions should be taken as regards to the influence of Bad data in the training data set. ------------ Yong Liu Box 1843 Department of Physics Institute for Brain and Neural Systems Brown University Providence, RI 02912 From pluto at cs.ucsd.edu Wed Feb 9 17:52:55 1994 From: pluto at cs.ucsd.edu (Mark Plutowski) Date: Wed, 9 Feb 94 14:52:55 -0800 Subject: Outliers (Was: "Some questions on training..") Message-ID: <9402092252.AA14771@beowulf> ------- previous message ------- Dr. Liu writes: Outliers are the data points that come in an "unexpected" way, both in the training data and in the future. For example, the data is collected so that a proportional of them are typos. So as the size of the data gets large, the number of outliers in them also gets large. Plutowski's assumption, as I understand it, is to assume the ratio of the number outliers over the size of data size is very small. One way to look at data set containing outliers is to assume noises are inhomoscedastic. Outlier data points have their noises with large variance, and good data points have their noises with small variance (Liu 1994). This is different from Plutowski's "homoscedasticity" assumption. Since we have no intention of predicting the value of outliers, robust estimation in both the parameters and the generalization error requires the "removal" of the outliers. These discussion, I hope, could convey the idea that when using cross-validation for the estimation of generalization error, some cautions should be taken as regards to the influence of Bad data in the training data set. ------------ Yong Liu Box 1843 Department of Physics Institute for Brain and Neural Systems Brown University Providence, RI 02912 ------- end previous message ------- Dear Dr Liu, Yes, this points out the importance of examining the assumptions carefully to ensure that they apply to your particular learning task. As another example of where these results do not apply, note that the assumption of mean zero noise can be easily violated in discrimination tasks (often referred to as "classification" tasks) where the noise involves random misclassification of the target. It also points out an appealling definition of "outlier", My interpretation of this is the following: When the noise variance on the target can depends upon the input (in statistical jargon, referred to as "heteroscedasticity of the conditional variance of Y_i given X_i") there is the possibility that a plot of the conditional target variance over the input space could display discontinuous jumps, corresponding to where it is more likely to encounter targets that are much more "noisy" - as compared to targets for neighboring inputs. Is this accurate? I look forward to reading (Liu 94). Can you (or anyone else) point me to other references utilizing a similar definition of "outlier?" (IMHO) "outlier" is quite a value-laden term that I tend to avoid since I feel it has multiple and often ambiguous interpretations/definitions. I am currently doing work on detection of what I call "offliers" since I have a precise definition of what this means to me, and since I hesitate to use the term "outliers" for the reason stated above. = Mark PS: I would appreciate further opinions/references/examples of what "outlier" means (either in practice or in theory) which I will summarize and post to the mailing list. From mlsouth at cssip.levels.unisa.edu.au Wed Feb 9 21:00:23 1994 From: mlsouth at cssip.levels.unisa.edu.au (mlsouth@cssip.levels.unisa.edu.au) Date: Thu, 10 Feb 1994 12:30:23 +1030 (CST) Subject: Missing values Message-ID: <8610.9402100200@hotham.levels.unisa.edu.au> Connectionists, I did a short study on methods for classification of incomplete data 18 months ago. I compared the statistical methods of discrimination and classification and the EM algorithm to some neural methods. These methods could only be applied to an artificial data set due to the inavailability of a set of real data with missing values. Despite this, I believe that the conclusions are still sound. A copy of the paper ``Classification of incomplete data using neural networks'', M.L. Southcott, R.E. Bogner which was presented to the Fourth Australian Conference on Neural Networks (ACNN '93) is available via anonymous ftp from ftp.cssip.edu.au. The file is pub/users/michael/southcott.missing.ps Michael Southcott mlsouth at cssip.edu.au Centre for Sensor Signal and Information Processing SPRI Building, The Levels, Pooraka 5095, South Australia. From prechelt at ira.uka.de Tue Feb 8 07:19:16 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Tue, 08 Feb 1994 13:19:16 +0100 Subject: SUMMARY: encoding missing values Message-ID: <"irafs2.ira.957:08.01.94.12.19.58"@ira.uka.de> [ Due to a transmission error at our end, Lutz Prechelt's 28 Kbyte summary of the missing values discussion got truncated at about 16 Kbytes. Here is the second half of his summary. Sorry for any inconvenience. -- Dave Touretzky, CONNECTIONISTS moderator ] ------------------------------------------------------------------------ From: "N. Karunanithi" [...for nominal attributes:] Both methods have the problem of poor scalability. If the number of missing values increases then the number of additional inputs will increase linearly in 1.1 and logarithmically in 1.2. In fact, 1-of-n encoding may be a poor choice if (1) the number of input features is large and (2) such an expanded dimensional representation does not become a (semi) linearly separable problem. Even if it becomes a linearly separable problem, the overall complexity of the network can sometimes be very high. [...for continuous attributes:] This representation requires GUESS. A nominal transformation may not be a proper representation in some cases. Assume that the output values range over a large numerical interval. For example, from 0.0 to 10,000.0. If you use a simple scaling like dividing by 10,000.0 to make it between 0.0 and 1.0, this will result in poor accuracy of prediction. If the attribute is on the input side, then on theory the scaling is unnecessary because the input layer weights will scale accordingly. However, in practice I had lot of problem with this approach. Maybe a log tranformation before scaling may not be a bad choice. If you use a closed scaling you may have problem whenever a future value exceeds the maximum value of the numerical intervel. For example, assume that the attribute is time, say in miliseconds. Any future time from the point of reference can exceed the limit. Hence any closed scaling will not work properly. [...for ordinal attributes:] I have compared Binary Encoding (1.2), Gray-Coded representation and straighforward scaling. Colsed scaling seems to do a good job. I have also compared open scaling and closed scaling and did find significant improvement in prediction accuracy. ###REF### N. Karunanithi, D. Whitley and Y. K. Malaiya, "Prediction of Software Reliability Using Connectionist Models", IEEE Trans. Software Eng., July 1992, pp 563-574. From hicks at cs.titech.ac.jp Fri Feb 11 00:02:54 1994 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Fri, 11 Feb 94 00:02:54 JST Subject: Methods for improving generalization (was Re: some questions on ...) In-Reply-To: lange@ira.uka.de's message of Wed, 9 Feb 94 14:19:22 MET <"iraun1.ira.337:09.01.94.13.22.32"@ira.uka.de> Message-ID: <9402101503.AA16767@maruko.cs.titech.ac.jp> Dear Mr. Franke Lange (lange at ira.uka.de), On Wed, 9 Feb 94 14:19:22 MET you wrote: >But Soft Weight-Sharing does not really adapt to the data, >because you have to tune the same parameters as in normal Weight-Decay: >the parameters, that are used to handle the strength of the penalty-term. >The article of Nowlan and Hinton "Simplifying Neural Networks by Soft Weight- >Sharing" does not mention a method to do this automatically - so no "real" >adaption to the data is made. I say "every model is adaptive, and no model is adaptive, but some are more adaptive than others". Every model has parameters which are adjusted during learning. Penalty functions, including soft weight sharing, affects the prior distribution of weights and so can be thought of as just providing different models. All of these models adapt to data. On the other hand, every model >must< make some assumptions about which it is adamant. If it didn't there wouldn't be a model. These assumptions are non-adaptive to the data. (note1) You further wrote: >Maybe the methods of MacKay ("Bayesian Interpolation", Neural Comp. 4 (1992), >page 415-447) could be used to get a fully-automatic adaption. A combination >of this method with Weight-Decay or Soft Weight-Sharing would perhaps be >data-adaptive; but Soft Weight-Sharing alone has still a parameter, that is >not adapted by the data. The article was very enlighenting. Figure 1 on page 417 shows the 2 main steps of modeling which involve Baysian methods: (1) Fit each model to the data, (2) Assign preferences to the alternative models. The first step is the one we are all familiar with. The second one is the topic of the paper and consists of assigning objective preferences to each model: the probability of the data given the model is called the evidence for the model. Re your idea of "fully-automatic adaption". I will first review the parameters related to soft weight sharing: (a) the number of weight groups (b) the mean and variance of each group of weights. The weight penalty weighting is not arbitrary but determined by the variance of the squared error (which changes with time) divided by a factor (determined by cross-validation) to adjust to the number of free parameters. I think you mean by "fully-automatic adaption" that parameters (a) and (b) should be constant during stage (1), and after running the simulation for a large number of times with different values for (a) and (b) we should select the best ones with stage (2) methods: i.e. weighing the evidence for each model. This would take a long time BUT we might get a different answer from the one obtained by choosing (a) and (b) in stage 1. However, as to which way is best called "automatic", I would personaly favor the present stage (1) way, because it automatically (although maybe imperfectly) estimates the best parameters (a) and (b) implicitly during learning, leaving less labor for the later and harder stage (2). I realize I am getting semantic here. (note1) Mackay does give a special example of a 100% data-adaptive model: the Sure Thing hypothesis, which is that the data set will be what it is (predicted of course before seeing the data, selected afterwards), but this hypothesis has very small a priori probability. Too bad for our universe. The other example is of course stock tips, (predicted of course before seeing the money, collected afterwards), but look what happened to Micheal Milliken! Respectfully Yours, Craig Hicks Craig Hicks hicks at cs.titech.ac.jp | Kore ya kono Yuku mo kaeru mo Ogawa Laboratory, Dept. of Computer Science | Wakarete wa Shiru mo shiranu mo Tokyo Institute of Technology, Tokyo, Japan | Ausaka no seki lab:03-3726-1111 ext.2190 home:03-3785-1974 | (from hyaku-nin-issyu) fax: +81(3)3729-0685 (from abroad) 03-3729-0685 (from Japan) From terry at salk.edu Thu Feb 10 12:45:15 1994 From: terry at salk.edu (Terry Sejnowski) Date: Thu, 10 Feb 94 09:45:15 PST Subject: robust statistics Message-ID: <9402101745.AA28545@salk.edu> One man's outlier is another man's data point. Another way to handle outliers is not to remove them but to model them explicitly. Geoff Hinton has pointed out that character recognition can be made more robust by including models for background noise such as postmarks. Steve Nowlan and I recently used mixtures of expert networks to separate multiple interpenetrating flow fields -- the transparency problem for visual motion. The gating network was used to select regions of the visual field that contained reliable estimates of local velocity for which there was coherent global support. There is evidence for such selection neurons in area MT of primate visual cortex, a region of cortex that specializes in the detection of coherent motion. Terry ----- From yong at cns.brown.edu Thu Feb 10 13:39:19 1994 From: yong at cns.brown.edu (Yong Liu) Date: Thu, 10 Feb 94 13:39:19 EST Subject: outlier, robust statistics Message-ID: <9402101839.AA21430@cns.brown.edu> Plutowski wrote (Wed, 9 Feb 94) >It also points out an appealling definition of "outlier", >My interpretation of this is the following: >When the noise variance on the target can depends upon the input >(in statistical jargon, referred to as "heteroscedasticity of >the conditional variance of Y_i given X_i") >there is the possibility that a plot of the conditional >target variance over the input space could display >discontinuous jumps, corresponding to where it is more likely >to encounter targets that are much more "noisy" - as compared >to targets for neighboring inputs. Is this accurate? Yes. It is the heuristics behind modelling the error as a mixture of normal distributions in (Liu 94). In simple words, the statistical formulation regards the error for each data points as from a normal distribution with different variances, and regard the variances as missing observations. By using a prior on the variance and EM algorithm, one can estimate the variance. It turns out during the estimation, the EM algorithm looks for the data points that have larger variances and down-weights those data points. This way of modelling is in agreement with Dr. Sejnowski's view >One man's outlier is another man's data point. Another >way to handle outliers is not to remove them but to model them >explicitly. ... Plutowski also wrote (Wed, 9 Feb 94) >I look forward to reading (Liu 94). Can you (or anyone else) >point me to other references utilizing a similar definition >of "outlier?" (IMHO) "outlier" is quite a value-laden term >that I tend to avoid since I feel it has multiple and >often ambiguous interpretations/definitions. Box and Tiao (1968) hold similar views. Outlier are generated from a distribution that is a perturbation to the underlying distribution, for example, a small amount of noise with ever changing distribution in the background. Huber's (1981) book is referred as a excellent reference. Anyway, no matter what outlier is, what one really want is to use a model/method that is not sensitive to them and predict the relevant information. References Box, G.E.P. and Tiao, G.C.(1968) A Bayesian approach to some outlier problem. Biometrika, 55, 119-129 Huber (1981) Robust Statistics. John Wiley & Sons, Inc.. BTW. I will be a Phd only three month later. ------- Yong Liu Box 1843 Department of Physics Institute for Brain and Neural Systems Brown University Providence, RI 02912 From zl at venezia.rockefeller.edu Thu Feb 10 20:54:42 1994 From: zl at venezia.rockefeller.edu (Zhaoping Li) Date: Thu, 10 Feb 94 20:54:42 -0500 Subject: Paper announcement on neuroprose Message-ID: <9402110154.AA00738@venezia.rockefeller.edu> FTP-host: archive.cis.ohio-state.edu FTP-file: pub/neuroprose/li-zhaoping.stereocoding.ps.Z The file li-zhaoping.stereocoding.ps.Z is now available for copying from the Neuroprose archive. This is a 16 page paper plus 6 figures, to be published in Network: Computation in Neural Systems. --------------------------------------------------------------------------- Efficient Stereo Coding in the Multiscale Representation Zhaoping Li and Joseph J. Atick The Rockefeller University 1230 York Avenue New York, NY 10021, USA Abstract: Stereo images are highly redundant; the left and right frames of typical scenes are very similar. We explore the consequences of the hypothesis that cortical cells --- in addition to their multiscale coding strategies (Li and Atick 1994a) --- are concerned with reducing binocular redundancy due to correlations between the two eyes. We derive the most efficient coding strategies that achieve binocular decorrelation. It is shown that multiscale coding combined with a binocular decorrelation strategy leads to a rich diversity of cell types. In particular, the theory predicts monocular/binocular cells as well as a family of disparity selective cells, among which one can identify cells that are tuned-zero-excitatory, near, far, and tuned inhibitory. The theory also predicts correlations between ocular dominance, cell size, orientation, and disparity selectivities. Consequences on cortical ocular dominance column formation from abnormal developmental conditions such as strabismus and monocular eye closure are also predicted. These findings are compared with physiological measurements. Please address correspondence to Zhaoping Li ---------------------------------------------------------------------------- To obtain a copy: ftp archive.cis.ohio-state.edu login: anonymous password: cd pub/neuroprose binary get li-zhaoping.stereocoding.ps.Z quit Then at your system: uncompress li-zhaoping.stereocoding.ps lpr -P li-zhaoping.stereocoding.ps Zhaoping Li Box 272 Rockefeller University 1230 York Ave New York, NY 10021 phone: 212-327-7423 fax: 212-327-7422 zl at rockvax.rockefeller.edu From prechelt at ira.uka.de Tue Feb 8 07:19:16 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Tue, 08 Feb 1994 13:19:16 +0100 Subject: SUMMARY: encoding missing values Message-ID: <"irafs2.ira.957:08.01.94.12.19.58"@ira.uka.de> [ My attempt to forward Lutz Prechelt's summary of the missing values discussion was twice foiled by technical problems. Note to future posters: do not attempt to transmit lines containing nothing but a period and a carriage return. It confuses our FTP software. Here is my final attempt to transmit the entire summary. If this fails, Lutz will just have to dump it to neuroprose and let people access it via FTP. Sorry about the repeated postings. -- Dave Touretzky, CONNECTIONISTS moderator ] ================================================================ From prechelt at ira.uka.de Tue Feb 8 07:19:16 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Tue, 08 Feb 1994 13:19:16 +0100 Subject: SUMMARY: encoding missing values Message-ID: <"irafs2.ira.957:08.01.94.12.19.58"@ira.uka.de> A few days ago, I posted some thoughts about how to represent missing input values to a neural network and asked for comments and further ideas. This message is a summary of the replies I received (some in my personal mail some in connectionists). I show the most significant comments and ideas and append versions of the messages that are trimmed to the most important parts (in case somebody wants to keep this discussion in his/her archive) This was my original message: ------------------------------------------------------------------------ From prechelt at ira.uka.de Wed Feb 2 03:58:56 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Wed, 02 Feb 1994 09:58:56 +0100 Subject: Encoding missing values Message-ID: I am currently thinking about the problem of how to encode data with attributes for which some of the values are missing in the data set for neural network training and use. An example of such data is the 'heart-disease' dataset from the UCI machine learning database (anonymous FTP on "ics.uci.edu" [128.195.1.1], directory "/pub/machine-learning-databases"). There are 920 records altogether with 14 attributes each. Only 299 of the records are complete, the others have one or several missing attribute values. 11% of all values are missing. I consider only networks that handle arbitrary numbers of real-valued inputs here (e.g. all backpropagation-suited network types etc). I do NOT consider missing output values. In this setting, I can think of several ways how to encode such missing values that might be reasonable and depend on the kind of attribute and how it was encoded in the first place: 1. Nominal attributes (that have n different possible values) 1.1 encoded "1-of-n", i.e., one network input per possible value, the relevant one being 1 all others 0. This encoding is very general, but has the disadvantage of producing networks with very many connections. Missing values can either be represented as 'all zero' or by simply treating 'is missing' as just another possible input value, resulting in a "1-of-(n+1)" encoding. 1.2 encoded binary, i.e., log2(n) inputs being used like the bits in a binary representation of the numbers 0...n-1 (or 1...n). Missing values can either be represented as just another possible input value (probably all-bits-zero is best) or by adding an additional network input which is 1 for 'is missing' and 0 for 'is present'. The original inputs should probably be all zero in the 'is missing' case. 2. continuous attributes (or attributes treated as continuous) 2.1 encoded as a single network input, perhaps using some monotone transformation to force the values into a certain distribution. Missing values are either encoded as a kind of 'best guess' (e.g. the average of the non-missing values for this attribute) or by using an additional network input being 0 for 'missing' and 1 for 'present' (or vice versa) and setting the original attribute input either to 0 or to the 'best guess'. (The 'best guess' variant also applies to nominal attributes above) 3. binary attributes (truth values) 3.1 encoded by one input: 0=false 1=true or vice versa Treat like (2.1) 3.2 encoded by one input: -1=false 1=true or vice versa In this case we may act as for (3.1) or may just use 0 to indicate 'missing'. 3.3 treat like nominal attribute with 2 possible values 4. ordinal attributes (having n different possible values, which are ordered) 4.1 treat either like continuous or like nominal attribute. If (1.2) is chosen, a Gray-Code should be used. Continuous representation is risky unless a 'sensible' quantification of the possible values is available. So far to my considerations. Now to my questions. a) Can you think of other encoding methods that seem reasonable ? Which ? b) Do you have experience with some of these methods that is worth sharing ? c) Have you compared any of the alternatives directly ? ------------------------------------------------------------------------ SUMMARY: For a), the following ideas were mentioned: 1. use statistical techniques to compute replacement values from the rest of the data set 2. use a Boltzman machine to do this for you 3. use an autoencoder feed forward network to do this for you 4. randomize on the missing values (correct in the Bayesian sense) For b), some experience was reported. I don't know how to summarize that nicely, so I just don't summarize at all. For c), no explicit quantitative results were given directly. Some replies suggest that data is not always missing randomly. The biases are often known and should be taken into account (e.g. medical tests are not carried out (resulting in missing data) for moreless healthy persons more often than for ill persons). Many replies contained references to published work on this area, from NN, machine learning, and mathematical statistics. To ease searching for these references in the replies below, I have marked them with the string ##REF## (if you have a 'grep' program that extracts whole paragraphs, you can get them all out with one command). Thanks to all who answered. These are the trimmed versions of the replies: ------------------------------------------------------------------------ From: tgd at research.CS.ORST.EDU (Tom Dietterich) [...for nominal attributes:] An alternative here is to encode them as bit-strings in a error-correcting code, so that the hamming distance between any two bit strings is constant. This would probably be better than a dense binary encoding. The cost in additional inputs is small. I haven't tried this though. My guess is that distributed representations at the input are a bad idea. One must always determine WHY the value is missing. In the heart disease data, I believe the values were not measured because other features were believed to be sufficient in each case. In such cases, the network should learn to down-weight the importance of the feature (which can be accomplished by randomizing it---see below). In other cases, it may be more appropriate to treat a missing value as a separate value for the feature, e.g., in survey research, where a subject chooses not to answer a question. [...for continuous attributes:] Ross Quinlan suggests encoding missing values as the mean observed output value when the value is missing. He has tried this in his regression tree work. Another obvious approach is to randomize the missing values--on each presentation of the training example, choose a different, random, value for each missing input feature. This is the "right thing to do" in the bayesian sense. [...for binary attributes:] I'm skeptical of the -1,0,1 encoding, but I think there is more research to be done here. [...for ordinal attributes:] I would treat them as continuous. ------------------------------------------------------------------------ From: shavlik at cs.wisc.edu (Jude W. Shavlik) We looked at some of the methods you talked about in the following article in the journal Machine Learning. ##REF## %T Symbolic and Neural Network Learning Algorithms: An Experimental Comparison %A J. W. Shavlik %A R. J. Mooney %A G. G. Towell %J Machine Learning %V 6 %N 2 %P 111-143 %D 1991 ------------------------------------------------------------------------ From: hertz at nordita.dk (John Hertz) It seems to me that the most natural way to handle missing data is to leave them out. You can do this if you work with a recurrent network (fx Boltzmann machine) where the inputs are fed in by clamping the input units to the given input values and the rest of the net relaxes to a fixed point, after which the output is read off the output units. If some of the input values are missing, the corresponding input units are just left unclamped, free to relax to values most consistent with the known inputs. I have meant for a long time to try this on some medical prognosis data I was working on, but I never got around to it, so I would be happy to hear how it works if you try it. ------------------------------------------------------------------------ From: jozo at sequoia.WPI.EDU (Jozo Dujmovic) In the case of clustering benchmark programs I frequently have the the problem of estimation of missing data. A relatively simple SW that implements a heuristic algorithm generates estimates having the average error of 8%. NN will somehow "implicitly estimate" the missing data. The two approaches might even be in some sense equivalent (?). Jozo [ I suspect that they are not: When you generate values for the missing items and put them in the training set, the network loses the information that this data is only estimated. Since estimations are not as reliable as true input data, the network will weigh inputs that have lots of generated values as less important. If it gets the 'is missing' information explicitly, it can discriminate true values from estimations instead. ] ------------------------------------------------------------------------ From: guy at cs.uq.oz.au A final year student of mine worked on the problem of dealing with missing inputs, without much success. However, the student as not very good, so take the following opinions with a pinch of salt. We (very tentatively) came to the conclusion that if the inputs were redundant, the problem was easy; if the missing input contained vital information, the problem was pretty much impossible. We used the heart disease data. I don't recommend it for the missing inputs problem. All of the inputs are very good indicators of the correct result, so missing inputs were not important. Apparently there is a large literature in statistics on dealing with missing inputs. Anthony Adams (University of Tasmania) has published a technical report on this. His email address is "A.Adams at cs.utas.edu.au". ##REF## @techreport{kn:Vamplew-91, author = "P. Vamplew and A. Adams", address = {Hobart, Tasmania, Australia}, institution = {Department of Computer Science, University of Tasmania}, number = {R1-4}, title = {Real World Problems in Backpropagation: Missing Values and Generalisability}, year = {1991} } ------------------------------------------------------------------------ From: Mike Southcott ##REF## I wrote a paper for the Australian conference on neural networks in 1993. ``Classification of Incomplete Data using neural networks'' Southcott, Bogner. You may find it interesting. You may not be able to get the proceedings for this conference, but I am in the process of digging up a postscript copy for someone in the States, so when I do that, I will send you a copy. ------------------------------------------------------------------------ From: Eric Saund I have done some work on unsupervised learning of mulitple cause clusters in binary data, for which an appropriate encoding scheme is -1 = FALSE, 1 = TRUE, and 0 = NO DATA. This has worked well for me, but my paradigm is not your standard feedforward network and uses a different activiation function from the standard weighted sum followed by sigmoid squashing. I presented the paper on this work at NIPS: ##REF## Saund, Eric; 1994; "Unsupervised Learning of Mixtures of Multiple Causes in Binary Data," in Advances in Neural Information Processing Systems -6-, Cowan, J., Tesauro, G, and Alspector, J., eds. Morgan Kaufmann, San Francisco. ------------------------------------------------------------------------ From: Thierry.Denoeux at hds.univ-compiegne.fr In a recent mailing, Lutz Prechelt mentioned the interesting problem of how to encode attributes with missing values as inputs to a neural network. I have recently been faced to that problem while applying neural nets to rainfall prediction using weather radar images. The problem was to classify pairs of "echoes" -- defined as groups of connected pixels with reflectivity above some threshold -- taken from successive images as corresponding to the same rain cell or not. Each pair of echoes was discribed by a list of attributes. Some of these attributes, refering to the past of a sequence, were not defined for some instances. To encode these attributes with potentially missing values, we applied two different methods actually suggested by Lutz: - the replacement of the missing value by a "best-guess" value - the addition of a binary input indicating whether the corresponding attribute was present or absent. Significantly better results were obtained by the second method. This work was presented at ICANN'93 last september: ##REF## X. Ding, T. Denoeux & F. Helloco (1993). Tracking rain cells in radar images using multilayer neural networks. In Proc. of ICANN'93, Springer-Verlag, p. 962-967. ------------------------------------------------------------------------ From: "N. Karunanithi" [...for nominal attributes:] Both methods have the problem of poor scalability. If the number of missing values increases then the number of additional inputs will increase linearly in 1.1 and logarithmically in 1.2. In fact, 1-of-n encoding may be a poor choice if (1) the number of input features is large and (2) such an expanded dimensional representation does not become a (semi) linearly separable problem. Even if it becomes a linearly separable problem, the overall complexity of the network can sometimes be very high. [...for continuous attributes:] This representation requires GUESS. A nominal transformation may not be a proper representation in some cases. Assume that the output values range over a large numerical interval. For example, from 0.0 to 10,000.0. If you use a simple scaling like dividing by 10,000.0 to make it between 0.0 and 1.0, this will result in poor accuracy of prediction. If the attribute is on the input side, then on theory the scaling is unnecessary because the input layer weights will scale accordingly. However, in practice I had lot of problem with this approach. Maybe a log tranformation before scaling may not be a bad choice. If you use a closed scaling you may have problem whenever a future value exceeds the maximum value of the numerical intervel. For example, assume that the attribute is time, say in miliseconds. Any future time from the point of reference can exceed the limit. Hence any closed scaling will not work properly. [...for ordinal attributes:] I have compared Binary Encoding (1.2), Gray-Coded representation and straighforward scaling. Colsed scaling seems to do a good job. I have also compared open scaling and closed scaling and did find significant improvement in prediction accuracy. ###REF### N. Karunanithi, D. Whitley and Y. K. Malaiya, "Prediction of Software Reliability Using Connectionist Models", IEEE Trans. Software Eng., July 1992, pp 563-574. N. Karunanithi and Y. K. Malaiya, "The Scaling Problem in Neural Networks for Software Reliability Prediction", Proc. IEEE Int. Symposium on Rel. Eng., Oct. 1992, pp. 776-82. I have not found a simple solution that is general. I think representation in general and the missing information in specific are open problems within connectionist research. I am not sure we will have a magic bullet for all problems. The best approach is to come up with a specific solution for a given problem. ------------------------------------------------------------------------ From: Bill Skaggs There is at least one kind of network that has no problem (in principle) with missing inputs, namely a Boltzmann machine. You just refrain from clamping the input node whose value is missing, and treat it like an output node or hidden unit. This may seem to be irrelevant to anything other than Boltzmann machines, but I think it could be argued that nothing very much simpler is capable of dealing with the problem. When you ask a network to handle missing inputs, you are in effect asking it to do pattern completion on the input layer, and for this a Boltzmann machine or some other sort of attractor network would seem to be required. ------------------------------------------------------------------------ From: "Scott E. Fahlman" [Follow-up to Bill Skaggs:] Good point, but perhaps in need of clarification for some readers: There are two ways of training a Boltzmann machine. In one (the original form), there is no distinction between input and output units. During training we alternate between an instruction phase, in which all of the externally visible units are clamped to some pattern, and a normalization phase, in which the whole network is allow to run free. The idea is to modify the weights so that, when running free, the external units assume the various pattern values in the training set in their proper frequencies. If only some subset of the externally visible units are clamped to certain values, the net will produce compatible completions in the other units, again with frequencies that match this part of the training set. A net trained in this way will (in principle -- it might take a *very* long time for anything complicated) do what you suggest: Complete an "input" pattern and produce a compatible output at the same time. This works even if the input is *totally* missing. I believe it was Geoff Hinton who realized that a Boltzmann machine could be trained more efficiently if you do make a distinction between input and output units, and don't waste any of the training effort learning to reconstruct the input. In this model, the instruction phase clamps both input and output units to some pattern, while the normalization phase clamps only the input units. Since the input units are correct in both cases, all of the networks learning power (such as it is) goes into producing correct patterns on the output units. A net trained in this way will not do input-completion. I bring this up because I think many people will only have seen the latter kind of Boltzmann training, and will therefore misunderstand your observation. By the way, one alternative method I have seen proposed for reconstructing missing input values is to first train an auto-encoder (with some degree of bottleneck to get generalization) on the training set, and then feed the output of this auto-encoder into the classification net. The auto-encoder should be able to replace any missing values with some degree of accuracy. I haven't played with this myself, but it does sound plausible. If anyone can point to a good study of this method, please post it here or send me E-mail. ------------------------------------------------------------------------ From: "David G. Stork" ##REF## There is a provably optimal method for performing classification with missing inputs, described in Chapter 2 of "Pattern Classification and Scene Analysis" (2nd ed.) by R. O. Duda, P. E. Hart and D. G. Stork, which avoids the ad-hoc heuristics that have been described by others. Those interested in obtaining Chapter two via ftp should contact me. ------------------------------------------------------------------------ From: Wray Buntine This missing value problem is of course shared amongst all the learning communities, artificial intelligence, statistics, pattern recognition, etc., not just neural networks. A classic study in this area, which includes most suggestions I've read here so far, is ##REF## @inproceedings{quinlan:ml6, AUTHOR = "J.R. Quinlan", TITLE = "Unknown Attribute Values in Induction", YEAR = 1989, BOOKTITLE = "Proceedings of the Sixth International Machine Learning Workshop", PUBLISHER = "Morgan Kaufmann", ADDRESS = "Cornell, New York"} The most frequently cited methods I've seen, and they're so common amongst the different communities its hard to lay credit: 1) replace missings by their some best guess 2) fracture the example into a set of fractional examples each with the missing value filled in somehow 3) call the missing value another input value 3 is a good thing to do if they are "informative" missing, i.e. if someone leaves the entry "telephone number" blank in a questionaire, then maybe they don't have a telephone, but probably not good otherwise unless you have loads of data and don't mind all the extra example types generated (as already mentioned) 1 is a quick and dirty hack at 2. How good depends on your application. 2 is an approximation to the "correct" approach for handling "non-informative" missing values according to the standard "mixture model". The mathematics for this is general and applies to virtually any learning algorithm trees, feed-forward nets, linear regression, whatever. We do it for feed-forward nets in ##REF## @article{buntine.weigend:bbp, AUTHOR = "W.L. Buntine and A.S. Weigend", TITLE = "Bayesian Back-Propagation", JOURNAL = "Complex Systems", Volume = 5, PAGES = "603--643", Number = 1, YEAR = "1991" } and see Tresp, Ahmad & Neuneier in NIPS'94 for an implementation. But no doubt someone probably published the general idea back in the 50's. I certainly wouldn't call missing values an open problem. Rather, "efficient implementations of the standard approaches" is, in some cases, an open problem. ------------------------------------------------------------------------ From: Volker Tresp In general, the solution to the missing-data problem depends on the missing-data mechanism. For example, if you sample the income of a population and rich people tend to refuse the answer the mean of your sample is biased. To obtain an unbiased solution you would have to take into account the missing-data mechanism. The missing-data mechanism can be ignored if it is independent of the input and the output (in the example: the likelihood that a person refuses to answer is independent of the person's income). Most approaches assume that the missing-data mechanism can be ignored. There exist a number of ad hoc solutions to the missing-data problem but it is also possible to approach the problem from a statistical point of view. In our paper (which will be published in the upcoming NIPS-volume and which will be available on neuroprose shortly) we discuss a systematic likelihood-based approach. NN-regression can be framed as a maximum likelihood learning problem if we assume the standard signal plus Gaussian noise model P(x, y) = P(x) P(y|x) \propto P(x) exp(-1/(2 \sigma^2) (y - NN(x))^2). By deriving the probability density function for a pattern with missing features we can formulate a likelihood function including patterns with complete and incomplete features. The solution requires an integration over the missing input. In practice, the integral is approximated using a numerical approximation. For networks of Gaussian basis functions, it is possible to obtain closed-form solutions (by extending the EM algorithm). Our paper also discusses why and when ad hoc solutions --such as substituting the mean for an unknown input-- are harmful. For example, if the mapping is approximately linear substituting the mean might work quite well. In general, although, it introduces bias. Training with missing and noisy input data is described in: ##REF## ``Training Neural Networks with Deficient Data,'' V. Tresp, S. Ahmad and R. Neuneier, in Cowan, J. D., Tesauro, G., and Alspector, J. (eds.), {\em Advances in Neural Information Processing Systems 6}, Morgan Kaufmann, 1994. A related paper by Zoubin Ghahramani and Michael Jordan will also appear in the upcoming NIPS-volume. Recall with missing and noisy data is discussed in (available in neuroprose as ahmad.missing.ps.Z): ``Some Solutions to the Missing Feature Problem in Vision,'' S. Ahmad and V. Tresp, in {\em Advances in Neural Information Processing Systems 5,} S. J. Hanson, J. D. Cowan, and C. L. Giles eds., San Mateo, CA, Morgan Kaufman, 1993. ------------------------------------------------------------------------ From: Subhash Kak Missing values in feedback networks raise interesting questions: Should these values be considered "don't know" values or should these be generated in some "most likelihood" fashion? These issues are discussed in the following paper: ##REF## S.C. Kak, "Feedback neural networks: new characteristics and a generalization", Circuits, Systems, Signal Processing, vol. 12, no. 2, 1993, pp. 263-278. ------------------------------------------------------------------------ From: Zoubin Ghahramani I have also been looking into the issue of encoding and learning from missing values in a neural network. The issue of handling missing values has been addressed extensively in the statistics literature for obvious reasons. To learn despite the missing values the data has to be filled in, or the missing values integrated over. The basic question is how to fill in the missing data. There are many different methods for doing this in stats (mean imputation, regression imputation, Bayesian methods, EM, etc.). For good reviews see (Little and Rubin 1987; Little, 1992). I do not in general recommend encoding "missing" as yet another value to be learned over. Missing means something in a statistical sense -- that the input could be any of the values with some probability distribution. You could, for example, augment the original data filling in different values for the missing data points according to a prior distribution. Then the training would assign different weights to the artificially filled-in data points depending on how well they predict the output (their posterior probability). This is essentially the method proposed by Buntine and Weigand (1991). Other approaches have been proposed by Tresp et al. (1993) and Ahmad and Tresp (1993). I have just written a paper on the topic of learning from incomplete data. In this paper I bring a statistical algorithm for learning from incomplete data, called EM, into the framework of nonlinear function approximation and classification with missing values. This approach fits the data iteratively with a mixture model and uses that same mixture model to effectively fill in any missing input or output values at each step. You can obtain the preprint by ftp psyche.mit.edu login: anonymous cd pub get zoubin.nips93.ps To obtain code for the algorithm please contact me directly. ##REF## Ahmad, S and Tresp, V (1993) "Some Solutions to the Missing Feature Problem in Vision." In Hanson, S.J., Cowan, J.D., and Giles, C.L., editors, Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers, San Mateo, CA. Buntine, WL, and Weigand, AS (1991) "Bayesian back-propagation." Complex Systems. Vol 5 no 6 pp 603-43 Ghahramani, Z and Jordan MI (1994) "Supervised learning from incomplete data via an EM approach" To appear in Cowan, J.D., Tesauro, G., and Alspector,J. (eds.). Advances in Neural Information Processing Systems 6. Morgan Kaufmann Publishers, San Francisco, CA, 1994. Little, RJA (1992) "Regression With Missing X's: A Review." Journal of the American Statistical Association. Volume 87, Number 420. pp. 1227-1237 Little, RJA. and Rubin, DB (1987). Statistical Analysis with Missing Data. Wiley, New York. Tresp, V, Hollatz J, Ahmad S (1993) "Network structuring and training using rule-based knowledge." In Hanson, S.J., Cowan, J.D., and Giles, C.~L., editors, Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers, San Mateo, CA. ------------------------------------------------------------------------ That's it. Lutz Lutz Prechelt (email: prechelt at ira.uka.de) | Whenever you Institut fuer Programmstrukturen und Datenorganisation | complicate things, Universitaet Karlsruhe; 76128 Karlsruhe; Germany | they get (Voice: ++49/721/608-4068, FAX: ++49/721/694092) | less simple. From n.burgess at ucl.ac.uk Fri Feb 11 05:00:20 1994 From: n.burgess at ucl.ac.uk (Neil Burgess) Date: Fri, 11 Feb 94 10:00:20 +0000 Subject: pre-print in neuroprose Message-ID: <141927.9402111000@link-1.ts.bcc.ac.uk> FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/burgess.hipmod.ps.Z *****do not forward to other groups***** Dear connectionists, the following preprint has been put on neuroprose, contact n.burgess at ucl.ac.uk with any retrieval problems, --Neil `A model of hippocampal function' Neil Burgess, Michael Recce and John O'Keefe Dept. of Anatomy, University College, London WC1E 6BT, U.K. The firing rate maps of hippocampal place cells recorded in a freely moving rat are viewed as a set of approximate radial basis functions over the (2-D) environment of the rat. It is proposed that these firing fields are constructed during exploration from `sensory inputs' (tuning curve responses to the distance of cues from the rat) and used by cells downstream to construct firing rate maps that approximate any desired surface over the environment. It is shown that, when a rat moves freely in an open field, the phase of firing of a place cell (with respect to the EEG $\theta$ rhythm) contains information as to the relative position of its firing field from the rat. A model of hippocampal function is presented in which the firing rate maps of cells downstream of the hippocampus provide a `population vector' encoding the instantaneous direction of the rat from a previously encountered reward site, enabling navigation to it. A neuronal simulation, involving reinforcement only at the goal location, provides good agreement with single cell recording from the hippocampal region, and can navigate to reward sites in open fields using sensory input from environmental cues. The system requires only brief exploration, performs latent learning, and can return to a goal location after encountering it only once. Neural Networks, to be published. 26 pages, 2Mbytes uncompressed. From eric at research.nj.nec.com Fri Feb 11 11:11:29 1994 From: eric at research.nj.nec.com (Eric B. Baum) Date: Fri, 11 Feb 94 11:11:29 EST Subject: No subject Message-ID: <9402111611.AA00562@yin> Fifth Annual NEC Research Symposium NATURAL AND ARTIFICIAL PARALLEL COMPUTATION PRINCETON, NJ MAY 4 - 5, 1994 NEC Research Institute is pleased to announce that the Fifth Annual NEC Research Symposium will be held at the Hyatt Regency Hotel in Princeton, New Jersey on May 4 and 5, 1994. The title of this year's symposium is Natural and Artificial Parallel Computation. The conference will feature ten invited talks. The speakers are: - Larry Abbott, Brandeis University, "Activity- Dependent Modulation of Intrinsic Neuronal Properties" - Catherine Carr, University of Maryland, "Time Coding in the Central Nervous System" - Bill Dally, MIT, "Bandwidth, Granularity, and Mechanisms: Key Issues in the Design of Parallel Computers" - Amiram Grinvald, Weitzmann Institute, "Architecture and Dynamics of Cell Assemblies in the Visual Cortex; New Perspectives From Fast and Slow Optical Imaging" - Akihiko Konagaya, NEC C&C Research Labs, "Knowledge Discovery in Genetic Sequences" - Chris Langton, Santa Fe Institute, "SWARM: An Agent Based Simulation System for Research in Complex Systems" - Thomas Ray, University of Delaware and ATR, "Evolution and Ecology of Digital Organisms" - Shuichi Sakai, Real World Computing Partnership, "RWC Massively Parallel ComputerProject" - Shigeru Tanaka, NEC Fundamental Research Labs, "A Mathematical Theory for the Experience- Dependent Development of Visual Cortex" - Leslie Valiant, Harvard University and NECI, "A Computational Model for Cognition" There will be no contributed papers. Registration is free of charge, but space is limited. Registrations will be accepted on a first come, first served basis. YOU MUST PREREGISTER. There will be no on-site registration. To preregister by e-mail, send a request to: symposium at research.nj.nec.com. Registrants will receive an acknowledgment, space allowing. A request for preregistration is also possible by regular mail to Mrs. Irene Parker, NEC Research Institute, 4 Independence Way, Princeton, NJ 08540. Registrants will also be invited to an Open House/Poster Session and Reception at NEC Research Institute on Tuesday, May 3. The Open House will begin at 3:30 PM and the Reception will begin at 5:30 PM. In order to estimate headcount, please indicate in your preregistration request whether you plan to attend the Open House on May 3. Registrants are expected to make their own arrangements for accommodations. Provided below is a list of hotels in the area together with daily room rates. Please ask for the NEC Corporate Rate when reserving a room. Sessions will start at 8:15 AM Wednesday, May 4 and will be scheduled to finish at approximately 3:30 PM on Thursday, May 5. Red Roof Inn, South Brunswick (908)821-8800 $37.99 Novotel Hotel, Princeton (609)520-1200 $68.00 ($74.00/w breakfast) Palmer Inn, Princeton (609)452-2500 $73.00 Marriott Residence Inn, Princeton (908)329-9600 $85.00 w/continental breakfast Summerfield Suites, Princeton (609)951-0009 $92.00 Hyatt Regency, Princeton (609)987-1234 $105.00 Marriott Hotel, Princeton (609)452-7900 $125.00 - - - - - - - - - - - - - - - - - - - - - - - - - - PLEASE RESPOND BY E-MAIL TO: symposium at research.nj.nec.com I would like to attend: _____ Open House _____ Symposium Name: ____________________________ Organization: ____________________________ E-mail address: ____________________________ Phone number: ____________________________ From bishopc at helios.aston.ac.uk Fri Feb 11 09:59:33 1994 From: bishopc at helios.aston.ac.uk (bishopc) Date: Fri, 11 Feb 94 14:59:33 GMT Subject: Postdoctoral Fellowships Message-ID: <27570.9402111459@sun.aston.ac.uk> ------------------------------------------------------------------- Aston University Neural Computing Research Group TWO POSTDOCTORAL RESEARCH FELLOWSHIPS: -------------------------------------- FUNDAMENTAL RESEARCH IN NEURAL NETWORKS Two postdoctoral fellowships, each with a duration of 3 years, will be funded by the U.K. Science and Engineering Research Council, and are to commence on or after 1 April 1994. These posts are part of a major project to be undertaken within the Neural Computing Research Group at Aston, and will involve close collaboration with Professors Chris Bishop and David Lowe, with additional input from Professor David Bounds. This interdisciplinary program requires researchers capable of extending theoretical concepts, and developing algorithmic and proof-of-principle demonstrations through software simulation. The two Research Fellows will work on distinct, though closely related, areas as follows: 1. Generalization in Neural Networks The usual approach to complexity optimisation and model order selection in neural networks makes use of computationally intensive cross-validation techniques. This project will build on recent developments in the use of Bayesian methods and the description length formalism to develop systematic techniques for model optimization in feedforward neural networks from a principled statistical perspective. In its later stages, the project will demonstrate the practical utility of the techniques which emerge, in the context of a wide range of real-world applications. 2. Dynamic Neural Networks Current embodiments of neural networks, when applied to `dynamic' events such as time series forecasting, are successful only if the underlying `generator' of the data is stationary. If the underlying generator is slowly varying in time then we do not have a principled basis for designing effective neural network structures, though ad hoc procedures do exist. This program will address some of the key issues in this area using techniques from statistical pattern processing and dynamical systems theory. In addition, application studies will be conducted which will focus on time series problems and tracking in non-stationary noise. If you wish to be considered for these positions, please send a CV and publications list, together with the names of 3 referees, to: Professor Chris M Bishop Neural Computing Research Group Aston University Birmingham B4 7ET, U.K. Tel: 021 359 3611 ext. 4270 Fax: 021 333 6215 e-mail: c.m.bishop at aston.ac.uk From ahmad at interval.com Fri Feb 11 12:04:37 1994 From: ahmad at interval.com (ahmad@interval.com) Date: Fri, 11 Feb 94 09:04:37 -0800 Subject: Computing visual feature correspondences Message-ID: <9402111704.AA28021@iris10.interval.com> The following paper is available for anonymous ftp on archive.cis.ohio-state.edu (128.146.8.52), in directory pub/neuroprose, as file "ahmad.correspondence.ps.Z": Feature Densities are Required for Computing Feature Correspondences Subutai Ahmad Interval Research Corporation 1801-C Page Mill Road, Palo Alto, CA 94304 E-mail: ahmad at interval.com Abstract The feature correspondence problem is a classic hurdle in visual object-recognition concerned with determining the correct mapping between the features measured from the image and the features expected by the model. In this paper we show that determining good correspondences requires information about the joint probability density over the image features. We propose "likelihood based correspondence matching" as a general principle for selecting optimal correspondences. The approach is applicable to non-rigid models, allows nonlinear perspective transformations, and can optimally deal with occlusions and missing features. Experiments with rigid and non-rigid 3D hand gesture recognition support the theory. The likelihood based techniques show almost no decrease in classification performance when compared to performance with perfect correspondence knowledge. To appear in: Cowan, J.D., Tesauro, G., and Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6. San Francisco CA: Morgan Kaufmann, 1994. From ahmad at interval.com Fri Feb 11 13:03:31 1994 From: ahmad at interval.com (ahmad@interval.com) Date: Fri, 11 Feb 94 10:03:31 -0800 Subject: Training NN's with missing or noisy data Message-ID: <9402111803.AA28794@iris10.interval.com> The following paper is available for anonymous ftp on archive.cis.ohio-state.edu (128.146.8.52), in directory pub/neuroprose, as file "tresp.deficient.ps.Z". (The companion paper, "Some Solutions to the Missing Feature Problem in Vision" is available as "ahmad.missing.ps.Z") Training Neural Networks with Deficient Data Volker Tresp Subutai Ahmad Siemens AG Interval Research Corporation Central Research 1801-C Page Mill Rd. 81730 Muenchen, Germany Palo Alto, CA 94304 tresp at zfe.siemens.de ahmad at interval.com Ralph Neuneier Siemens AG Central Research Otto-Hahn-Ring 6 81730 Muenchen, Germany ralph at zfe.siemens.de Abstract: We analyze how data with uncertain or missing input features can be incorporated into the training of a neural network. The general solution requires a weighted integration over the unknown or uncertain input although computationally cheaper closed-form solutions can be found for certain Gaussian Basis Function (GBF) networks. We also discuss cases in which heuristical solutions such as substituting the mean of an unknown input can be harmful. The paper will appear in: Cowan, J.D., Tesauro, G., and Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6. San Francisco CA: Morgan Kaufmann, 1994. Subutai Ahmad Interval Research Corporation Phone: 415-354-3639 1801-C Page Mill Rd. Fax: 415-354-0872 Palo Alto, CA 94304 E-mail: ahmad at interval.com From mel at klab.caltech.edu Fri Feb 11 15:05:47 1994 From: mel at klab.caltech.edu (Bartlett Mel) Date: Fri, 11 Feb 94 12:05:47 PST Subject: NIPS*94 Call for Papers Message-ID: <9402112005.AA10791@plato.klab.caltech.edu> ********* PLEASE NOTE NEW SUBMISSIONS FORMAT FOR 1994 ********* CALL FOR PAPERS Neural Information Processing Systems -Natural and Synthetic- Monday, November 28 - Saturday, December 3, 1994 Denver, Colorado This is the eighth meeting of an interdisciplinary conference which brings together neuroscientists, engineers, computer scientists, cognitive scientists, physicists, and mathematicians interested in all aspects of neural processing and computation. The conference will include invited talks, and oral and poster presentations of refereed papers. There will be no parallel sessions. There will also be one day of tutorial presentations (Nov 28) preceding the regular session, and two days of focused workshops will follow at a nearby ski area (Dec 2-3). Major categories for paper submission, and examples of keywords within categories, are the following: Neuroscience: systems physiology, cellular physiology, signal and noise analysis, oscillations, synchronization, inhibition, neuromodulation, synaptic plasticity, computational models. Theory: computational learning theory, complexity theory, dynamical systems, statistical mechanics, probability and statistics, approximation theory. Implementations: VLSI, optical, parallel processors, software simulators, implementation languages. Algorithms and Architectures: learning algorithms, constructive/pruning algorithms, localized basis functions, decision trees, recurrent networks, genetic algorithms, combinatorial optimization, performance comparisons. Visual Processing: image recognition, coding and classification, stereopsis, motion detection, visual psychophysics. Speech, Handwriting and Signal Processing: speech recognition, coding and synthesis, handwriting recognition, adaptive equalization, nonlinear noise removal. Applications: time-series prediction, medical diagnosis, financial analysis, DNA/protein sequence analysis, music processing, expert systems. Cognitive Science & AI: natural language, human learning and memory, perception and psychophysics, symbolic reasoning. Control, Navigation, and Planning: robotic motor control, process control, navigation, path planning, exploration, dynamic programming. Review Criteria: All submitted papers will be thoroughly refereed on the basis of technical quality, novelty, significance and clarity. Submissions should contain new results that have not been published previously. Authors are encouraged to submit their most recent work, as there will be an opportunity after the meeting to revise accepted manuscripts before submitting final camera-ready copy. ********** PLEASE NOTE NEW SUBMISSIONS FORMAT FOR 1994 ********** Paper Format: Submitted papers may be up to eight pages in length. The page limit will be strictly enforced, and any submission exceeding eight pages will not be considered. Authors are encouraged (but not required) to use the NIPS style files obtainable by anonymous FTP at the sites given below. Papers must include physical and e-mail addresses of all authors, and must indicate one of the nine major categories listed above, keyword information if appropriate, and preference for oral or poster presentation. Unless otherwise indicated, correspondence will be sent to the first author. Submission Instructions: Send six copies of submitted papers to the address given below; electronic or FAX submission is not acceptable. Include one additional copy of the abstract only, to be used for preparation of the abstracts booklet distributed at the meeting. Submissions mailed first-class within the US or Canada must be postmarked by May 21, 1994. Submissions from other places must be received by this date. Mail submissions to: David Touretzky NIPS*94 Program Chair Computer Science Department Carnegie Mellon University 5000 Forbes Avenue Pittsburgh PA 15213-3890 USA Mail general inquiries/requests for registration material to: NIPS*94 Conference NIPS Foundation PO Box 60035 Pasadena, CA 91116-6035 USA (e-mail: nips94 at caltech.edu) FTP sites for LaTex style files "nips.tex" and "nips.sty": helper.systems.caltech.edu (131.215.68.12) in /pub/nips b.gp.cs.cmu.edu (128.2.242.8) in /usr/dst/public/nips NIPS*94 Organizing Committee: General Chair, Gerry Tesauro, IBM; Program Chair, David Touretzky, CMU; Publications Chair, Joshua Alspector, Bellcore; Publicity Chair, Bartlett Mel, Caltech; Workshops Chair, Todd Leen, OGI; Treasurer, Rodney Goodman, Caltech; Local Arrangements, Lori Pratt, Colorado School of Mines; Tutorials Chairs, Steve Hanson, Siemens and Gerry Tesauro, IBM; Contracts, Steve Hanson, Siemens and Scott Kirkpatrick, IBM; Government & Corporate Liaison, John Moody, OGI; Overseas Liaisons: Marwan Jabri, Sydney Univ., Mitsuo Kawato, ATR, Alan Murray, Univ. of Edinburgh, Joachim Buhmann, Univ. of Bonn, Andreas Meier, Simon Bolivar Univ. DEADLINE FOR SUBMISSIONS IS MAY 21, 1994 (POSTMARKED) -please post- From yamauchi at alpha.ces.cwru.edu Fri Feb 11 17:24:43 1994 From: yamauchi at alpha.ces.cwru.edu (Brian Yamauchi) Date: Fri, 11 Feb 94 17:24:43 -0500 Subject: Preprints Available Message-ID: <9402112224.AA03791@yuggoth.CES.CWRU.Edu> The following papers are available via anonymous ftp from yuggoth.ces.cwru.edu: ---------------------------------------------------------------------- Sequential Behavior and Learning in Evolved Dynamical Neural Networks Brian Yamauchi(1) and Randall Beer(1,2) Department of Computer Engineering and Science(1) Department of Biology(2) Case Western Reserve University Cleveland, OH 44106 Case Western Reserve University Technical Report CES-93-25 This paper will be appearing in Adaptive Behavior. Abstract This paper explores the use of a real-valued modular genetic algorithm to evolve continuous-time recurrent neural networks capable of sequential behavior and learning. We evolve networks that can generate a fixed sequence of outputs in response to an external trigger occurring at varying intervals of time. We also evolve networks that can learn to generate one of a set of possible sequences based upon reinforcement from the environment. Finally, we utilize concepts from dynamical systems theory to understand the operation of some of these evolved networks. A novel feature of our approach is that we assume neither an a priori discretization of states or time nor an a priori learning algorithm that explicitly modifies network parameters during learning. Rather, we merely expose dynamical neural networks to tasks that require sequential behavior and learning and allow the genetic algorithm to evolve network dynamics capable of accomplishing these tasks. Files: /pub/agents/yamauchi/seqlearn.ps.Z Article Text (73K) /pub/agents/yamauchi/seqlearn-fig.ps.Z Figures (654K) ---------------------------------------------------------------------- Integrating Reactive, Sequential, and Learning Behavior Using Dynamical Neural Networks Brian Yamauchi(1,3) and Randall Beer(1,2) Department of Computer Engineering and Science(1) Department of Biology(2) Case Western Reserve University Cleveland, OH 44106 Navy Center for Applied Research in Artificial Intelligence(3) Naval Research Laboratory Washington, DC 20375-5000 This paper has been submitted to the Third International Conference on Simulation of Adaptive Behavior. Abstract This paper explores the use of dynamical neural networks to control autonomous agents in tasks requiring reactive, sequential, and learning behavior. We use a genetic algorithm to evolve networks that can solve these tasks. These networks provide a mechanism for integrating these different types of behavior in a smooth, continuous manner. We applied this approach to three different task domains: landmark recognition using sonar on a real mobile robot, one-dimensional navigation using a simulated agent, and reinforcement-based sequence learning. For the landmark recognition task, we evolved networks capable of differentiating between two different landmarks based on the spatiotemporal information in a sequence of sonar readings obtained as the robot circled the landmark. For the navigation task, we evolved networks capable of associating the location of a landmark with a corresponding goal location and directing the agent to that goal. For the sequence learning task, we evolved networks that can learn to generate one of a set of possible sequences based upon reinforcement from the environment. A novel feature of the learning aspects of our approach is that we assume neither an a priori discretization of states or time nor an a priori learning algorithm that explicitly modifies network parameters during learning. Instead, we expose dynamical neural networks to tasks that require learning and allow the genetic algorithm to evolve network dynamics capable of accomplishing these tasks. Files: /pub/agents/yamauchi/integ.ps.Z Complete Article (233K) If your printer has problems printing the complete document as a single file, try printing the following two files: /pub/agents/yamauchi/integ-part1.ps.Z Pages 1-8 (77K) /pub/agents/yamauchi/integ-part2.ps.Z Pages 9-11 (147K) ---------------------------------------------------------------------- On the Dynamics of a Continuous Hopfield Neuron with Self-Connection Randall Beer Department of Computer Engineering and Science Department of Biology Case Western Reserve University Cleveland, OH 44106 Case Western Reserve University Technical Report CES-94-1 This paper has been submitted to Neural Computation. Continuous-time recurrent neural networks are being applied to a wide variety of problems. As a first step toward a comprehensive understanding of the dynamics of such networks, this paper studies the dynamical behavior of their basic building block: a continuous Hopfield neuron with self-connection. Specifically, we characterize the equilibria of this model neuron and the dependence of those equilibria on the parameters. We also describe the bifurcations of this model and derive very accurate approximate expressions for its bifurcation set. Finally, we indicate how the basic theory developed in this paper generalizes to a larger class of related model neurons. File: /pub/agents/beer/CTRNNDynamics1.ps.Z Complete Article (233K) ---------------------------------------------------------------------- FTP instructions: To retrieve and print a file (for example: seqlearn.ps), use the following commands: unix> ftp yuggoth.ces.cwru.edu Name: anonymous Password: (your email address) ftp> binary ftp> cd /pub/agents/yamauchi (or cd /pub/agents/beer for CTRNNDynamics1.ps.Z) ftp> get seqlearn.ps.Z ftp> quit unix> uncompress seqlearn.ps.Z unix> lpr seqlearn.ps (ls doesn't currently work properly on our ftp server. This will be fixed soon, but in the meantime, these files can still be copied, even though they don't appear in the directory listing.) _______________________________________________________________________________ Brian Yamauchi Case Western Reserve University yamauchi at alpha.ces.cwru.edu Department of Computer Engineering and Science _______________________________________________________________________________ From isabelle at neural.att.com Fri Feb 11 20:51:16 1994 From: isabelle at neural.att.com (Isabelle Guyon) Date: Fri, 11 Feb 94 20:51:16 EST Subject: robust statistics Message-ID: <9402120151.AA21483@neural> I would like to bring more arguments to Terry's remarks: > One man's outlyer is another man's data point. If the data is perfectly clean, outlyers are very valuable patterns. From mmoller at daimi.aau.dk Mon Feb 14 02:15:18 1994 From: mmoller at daimi.aau.dk (Martin Fodslette M|ller) Date: Mon, 14 Feb 1994 08:15:18 +0100 Subject: Thesis available. Message-ID: <199402140715.AA18638@titan.daimi.aau.dk> /******************* PLEASE DO NOT FORWARD ***********************/ I finally finished up my thesis: Efficient Training of Feed-Forward Neural Networks The thesis has the following content: Chapter 1. Resume in danish (should anyone need that (-:) Chapter 2. Notation and basic definitions. Chapter 3. Training Methods: An Overview Chapter 4. Calculation of Hessian Information Chapter 5. Different Error Functions. Appendix A. A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning. Appendix B. Supervised Learning on Large Redundant Training Sets. Appendix C. Exact Calculation of the Product of the Hessian Matrix and a Vector in O(N) time. Appendix D. Adaptive Preconditioning of the Hessian Matrix. Appendix E. Improving Network Solutions. The appendices concerns own work (original contributions), while the chapters provide an overview. The thesis is now available in a limited number of hard-copies. People interested in a copy should send an email with there address to me. Best Regards -martin ---------------------------------------------------------------- Martin Moller email: mmoller at daimi.aau.dk Computer Science Dept. Fax: +45 8942 3255 Aarhus University Phone: +45 8942 3371 Ny Munkegade, Build. 540, DK-8000 Aarhus C, Denmark ---------------------------------------------------------------- From edelman at wisdom.weizmann.ac.il Mon Feb 14 02:39:27 1994 From: edelman at wisdom.weizmann.ac.il (Edelman Shimon) Date: Mon, 14 Feb 1994 09:39:27 +0200 Subject: TR available: Representation of similarity in 3D ... Message-ID: <199402140739.JAA00503@eris.wisdom.weizmann.ac.il> FTP-host: eris.wisdom.weizmann.ac.il FTP-filename: /pub/tr-94-02.ps.Z URL: http://eris.wisdom.weizmann.ac.il/ Uncompressed size: 2.6 Mb. Preliminary version; comments welcome. Representation of similarity in 3D object discrimination Shimon Edelman \begin{abstract} How does the brain represent visual objects? In simple perceptual generalization tasks, the human visual system performs as if it represents the stimuli in a low-dimensional metric psychological space \cite{Shepard87}. In theories of 3D shape recognition, the role of feature-space representations (as opposed to structural \cite{Biederman87} or pictorial \cite{Ullman89} descriptions) has been for a long time a major point of contention. If shapes are indeed represented as points in a feature space, patterns of perceived similarity among different objects must reflect the structure of this space. The feature space hypothesis can then be tested by presenting subjects with complex parameterized 3D shapes, and by relating the similarities among subjective representations, as revealed in the response data by multidimensional scaling \cite{Shepard80}, to the objective parameterization of the stimuli. The results of four such tests, reported below, support the notion that discrimination among 3D objects may rely on a low-dimensional feature space representation, and suggest that this space may be spanned by explicitly encoded class prototypes. \end{abstract} From grumbach at inf.enst.fr Mon Feb 14 03:51:22 1994 From: grumbach at inf.enst.fr (grumbach@inf.enst.fr) Date: Mon, 14 Feb 94 09:51:22 +0100 Subject: papers on time and neural networks Message-ID: <9402140851.AA10372@enst.enst.fr> As guest editors of a special issue of the Sigart Bulletin about : Time and Neural Networks we are looking for 4 articles about 10 pages each. Sigart is a quarterly publication of the Association for Computing Machinery (ACM) special interest group on Artificial Intelligence. The paper may either deal with approachs of time processing using traditional connectionist architectures, or with more specific models integrating time in their basis. If you are interested, and if you can submit a paper (not already published) within a short delay (about 1 month and a half), please send a draft (if possible a Word file) : - preferably by giving ftp access to it (information via e-mail) - or sending it as "attached file" on e-mail - or posting a paper copy of it. Drafts should be received before April 1. Notification of acceptance will be sent before April 20. grumbach at enst.fr or chaps at enst.fr Alain Grumbach and Cedric Chappelier ENST dept INF 46 rue Barrault 75634 Paris Cedex 13 France From P.Refenes at cs.ucl.ac.uk Mon Feb 14 09:13:12 1994 From: P.Refenes at cs.ucl.ac.uk (P.Refenes@cs.ucl.ac.uk) Date: Mon, 14 Feb 94 14:13:12 +0000 Subject: robust statistics In-Reply-To: Your message of "Thu, 10 Feb 94 09:45:15 PST." <9402101745.AA28545@salk.edu> Message-ID: The term outliers does not mean that they are not part of the joint data probability distribution or that they contain no information for estimating the regression surface; it means rather that outliers are too small a fraction of the observations to be allowed to dominate the small-sample behaviour of the statistics to be calculated. With parametric regression modelling techniques it is easy to quantify this efefct by simply comptuing the effect that each data point has on the regression surface. This is not a trivial problem in non-parametric modelling but the statistics literature is full of methods to deal with it. Paul refenes From rsun at cs.ua.edu Mon Feb 14 12:22:20 1994 From: rsun at cs.ua.edu (Ron Sun) Date: Mon, 14 Feb 1994 11:22:20 -0600 Subject: No subject Message-ID: <9402141722.AA28238@athos.cs.ua.edu> A monograph on connectionist models is available from John Wiley and Sons, Inc. Title: Integrating Rules and Conenctionism for Robust Commonsense Reasoning ISBN 0-471-59324-9 Author: Ron Sun Assistant Professor Department of Computer Science The University of Alabama Tuscaloosa, AL 35487 contact John Wiley and Sons, Inc. at 1-800-call-wiley Or John Wiley and Sons, Inc. 605 Third Ave. New York, NY 10158-0012 USA (212) 850-6589 FAX: (212) 850-6088 ------------------------------------------------------------------ A brief description is as follows: One of the outstanding problems for artificial intelligence is the problem of better modeling commonsense reasoning and alleviating brittleness of traditional symbolic rule-based models. This work tackles this problem by trying to combining rules with connectionist models in an integrated framework. This idea leads to the development of a connectionist architecture with dual representation combining symbolic and subsymbolic (feature-based) processing for evidential robust reasoning: {\sc CONSYDERR}. Reasoning data are analyzed based on the notions of {\it rules} and {\it similarity} and modeled by the architecture which carries out rule application and similarity matching through interaction of the two levels; formal analyses are performed to understand rule encoding in connectionist models, in order to prove that it handles a superset of Horn clause logic and a nonmonotonic logic; the notion of causality is explored for the purpose of clarifying how the proposed architecture can better capture commonsense reasoning, and it is shown that causal knowledge can be well represented by {\sc CONSYDERR} and utilized in reasoning, which further justifies the design of the architecture; the variable binding problem is addressed, and a solution is proposed within this architecture and is shown to surpass existing ones; several aspects of the architecture are discussed to demonstrate how connectionist models can supplement, enhance, and integrate symbolic rule-based reasoning; large-scale application-oriented systems are prototyped. This architecture utilizes the synergy resulting from the interaction of the two different types of representation and processing, and is therefore capable of handling a large number of difficult issues in one integrated framework, such as partial and inexact information, cumulative evidential combination, lack of exact match, similarity-based inference, inheritance, and representational interactions, all of which are proven to be crucial elements of commonsense reasoning. The results show that connectionism coupled with symbolic processing capabilities can be effective and efficient models of reasoning for both theoretical and practical purposes. Table of Content 1 Introduction 1.1 Overview 1.2 Commonsense Reasoning 1.3 The Problem of Common Reasoning Patterns 1.4 What is the Point? 1.5 Some Clarifications 1.6 The Organization of the Book 1.7 Summary 2 Accounting for Commonsense Reasoning: A Framework with Rules and Similarities 2.1 Overview 2.2 Examples of Reasoning 2.3 Patterns of Reasoning 2.4 Brittleness of Rule-Based Reasoning 2.5 Towards a Solution 2.6 Some Reflections on Rules and Connectionism 2.7 Summary 3 A Connectionist Architecture for Commonsense Reasoning 3.1 Overview 3.2 A Generic Architecture 3.3 Fine-Tuning --- from Constraints to Specifications 3.4 Summary 3.5 Appendix 4 Evaluations and Experiments 4.1 Overview 4.2 Accounting for the Reasoning Examples 4.3 Evaluations of the Architecture 4.4 Systematic Experiments 4.5 Choice, Focus and Context 4.6 Reasoning with Geographical Knowledge 4.7 Applications to Other Domains 4.8 Summary 4.9 Appendix: Determining Similarities and CD representations 5 More on the Architecture: Logic and Causality 5.1 Overview 5.2 Causality in General 5.3 Shoham's Causal Theory 5.4 Defining FEL 5.5 Accounting for Commonsense Causal Reasoning 5.6 Determining Weights 5.7 Summary 5.8 Appendix: Proofs For Theorems 6 More on the Architecture: Beyond Logic 6.1 Overview 6.2 Further Analysis of Inheritance 6.3 Analysis of Interaction in Representation 6.4 Knowledge Acquisition, Learning, and Adaptation 6.5 Summary 7 An Extension: Variables and Bindings 7.1 Overview 7.2 The Variable Binding Problem 7.3 First-Order FEL 7.4 Representing Variables 7.5 A Formal Treatment 7.6 Dealing with Difficult Issues 7.7 Compilation 7.8 Correctness 7.9 Summary 7.10 Appendix 8 Reviews and Comparisons 8.1 Overview 8.2 Rule-Based Reasoning 8.3 Case-Based Reasoning 8.4 Connectionism 8.5 Summary 9 Conclusions 9.1 Overview 9.2 Some Accomplishments 9.3 Lessons Learned 9.4 Existing Limitations 9.5 Future Directions 9.6 Summary References From trevor at white.Stanford.EDU Mon Feb 14 17:37:50 1994 From: trevor at white.Stanford.EDU (Trevor Darrell) Date: Mon, 14 Feb 94 14:37:50 PST Subject: outlier, robust statistics In-Reply-To: Terry Sejnowski's message of Thu, 10 Feb 94 09:45:15 PST <9402101745.AA28545@salk.edu> Message-ID: <9402142237.AA24561@white.Stanford.EDU> [terry at salk.edu] One man's outlier is another man's data point. Another way to handle outliers is not to remove them but to model them explicitly. Geoff Hinton has pointed out that character recognition can be made more robust by including models for background noise such as postmarks. Explicitly modeling an occluding or transparently combined "outlier" process is a powerful way to build a robust estimator. As mentioned in other replies to this post, estimators which use a mixture model (either implicitly or explicitly), such as the EM algorithm, are promising methods to implement this type of strategy. One issue which often complicates matters is how to decide how many objects or processes there are in the signal, e.g. determine K in the EM estimator. I would like to ask if anyone has a pointer to work on estimating K in the context of an EM estimator or similar methods? Often the appropriate cardinality of the model is not easily known a priori. Steve Nowlan and I recently used mixtures of expert networks to separate multiple interpenetrating flow fields -- the transparency problem for visual motion. The gating network was used to select regions of the visual field that contained reliable estimates of local velocity for which there was coherent global support. There is evidence for such selection neurons in area MT of primate visual cortex, a region of cortex that specializes in the detection of coherent motion. I'd also like to add a pointer to some related work Sandy Pentland, Eero Simoncelli and I have done in this domain developing a strategy for robust estimation ("outlier exclusion") based on minimum description length theory. Our method effectively implements a clustering method to find how many processes there are (e.g. estimate K), and then iteratively refine estimates of the parameters and "support" (segmentation) of those processes. We have developed versions of this method for range and motion segmentation, both for occluded and transparently combined processes. [pluto at cs.ucsd.edu:] >I look forward to reading (Liu 94). Can you (or anyone else) >point me to other references utilizing a similar definition >of "outlier?" (IMHO) "outlier" is quite a value-laden term >that I tend to avoid since I feel it has multiple and >often ambiguous interpretations/definitions. Here are some references to conference papers on our work. A longer journal paper that combines these is in the works, email me if you would like a preprint when it becomes available. Darrell, Sclaroff and Pentland, "Segmentation by Minimal Description", Proc. 3rd Intl. Conf. Computer Vision, Osaka, Japan, 1990 (also avail. as MIT Media Lab Percom TR-163.) Darrell and Pentland, "Robust Estimation of a Multi-Layer Motion Representation", Proc. IEEE Workshop on Visual Motion, Princeton, October 1991 Darrell and Pentland, "Against Edges: Function Approximation with Multiple Support Maps", NIPS 4, 1992 Darrell and Simoncelli, "Separation of Transparent Motion into Layers using Velocity-tuned Mechanisms", Assn. for Resarch in Vision and Opthm. (ARVO) 1993, also available as MIT Media Lab Percom TR-244. (Percom TR's can be anon. ftp'ed from whitechapel.media.mit.edu) --trevor From jagota at next1.msci.memst.edu Mon Feb 14 20:18:56 1994 From: jagota at next1.msci.memst.edu (Arun Jagota) Date: Mon, 14 Feb 1994 19:18:56 -0600 Subject: DIMACS Challenge neural net papers Message-ID: <199402150118.AA02676@next1> Dear Connectionists: Expanded versions of two neural net papers presented at the DIMACS Challenge on Cliques, Coloring, and Satisfiability are now available via anonymous ftp (see below). First an excerpt from the Challenge announcement back in 1993: ---------------------- The purpose of this Challenge is to encourage high quality empirical research on difficult problems. The problems chosen are known to be difficult to solve in theory. How difficult are they to solve in practice? ---------------------- ftp ftp.cs.buffalo.edu (or 128.205.32.9 subject-to-change) Name : anonymous > cd users/jagota > binary > get DIMACS_Grossman.ps.Z > get DIMACS_Jagota.ps.Z > quit > uncompress *.Z Sorry, no hard copies. Copies may be requested by electronic mail to me (jagota at next1.msci.memst.edu) for those without access to ftp or for whom ftp fails. Please use as last resort. Applying The INN Model to the MaxClique Problem Tal Grossman, email: tal at goshawk.lanl.gov Complex Systems Group, T-13, and Center for Non Linear Studies MS B213, Los Alamos National Laboratory Los Alamos, NM 87545 Los Alamos Tech Report: LA-UR-93-3082 A neural network model, the INN (Inverted Neurons Network), is applied to the Maximum Clique problem. First, I describe the INN model and how it implements a given graph instance. The model has a threshold parameter $t$, which determines the character of the network stable states. As shown in an earlier work (Grossman-Jagota), the stable states of the network correspond to the $t$-codegree sets of its underlying graph, and, in the case of $t<1$, to its maximal cliques. These results are briefly reviewed. In this work I concentrate on improving the deterministic dynamics called $t$-annealing. The main issue is the initialization procedure and the choice of parameters. Adaptive procedures for choosing the initial state of the network and setting the threshold are presented. The result is the ``Adaptive t-Annealing" algorithm (AtA). This algorithm is tested on many benchmark problems and found to be more efficient than steepest descent or the simple t-annealing procedure. Approximately Solving Maximum Clique using Neural Network and Related Heuristics * Arun Jagota Laura Sanchis Memphis State University Colgate University Ravikanth Ganesan State University of New York at Buffalo We explore neural network and related heuristic methods for the fast approximate solution of the Maximum Clique problem. One of these algorithms, {\em Mean Field Annealing}, is implemented on the Connection Machine CM-5 and a fast annealing schedule is experimentally evaluated on random graphs, as well as on several benchmark graphs. The other algorithms, which perform certain randomized local search operations, are evaluated on the same benchmark graphs, and on {\bf Sanchis} graphs. One of our algorithms adjusts its internal parameters as its computation evolves. On {\bf Sanchis} graphs, it finds significantly larger cliques than the other algorithms do. Another algorithm, GSD$(\emptyset)$, works best overall, but is slower than the others. All our algorithms obtain significantly larger cliques than other simpler heuristics but run slightly slower; they obtain significantly smaller cliques on average than exact algorithms or more sophisticated heuristics but run considerably faster. All our algorithms are simple and inherently parallel. * - 24 pages in length (twice as long as its previous version). Arun Jagota From terry at salk.edu Tue Feb 15 02:56:04 1994 From: terry at salk.edu (Terry Sejnowski) Date: Mon, 14 Feb 94 23:56:04 PST Subject: outlier, robust statistics Message-ID: <9402150756.AA17907@salk.edu> I have received many requests for a reference to the motion model I mentioned recently in the context of robust statistics. An early version can be found in: Nowlan, S. J. and Sejnowski, T. J., Filter selection model for generating visual motion signals, In: C. L. Giles, S. J. Hanson and J. D. Cowan (Eds.) Advances in Neural Information Processing Systems 5, San Mateo, CA: Morgan Kaufman Publishers, 369-376 (1993). Two longer papers on the computational theory and the biological consequences are in review. Darrell and Pentland have an interesting iterative approach in which multiple hypotheses compete to include motion samples within their regions of support. A relaxation scheme must decide on the number of objects and the correct velocity assignments. Our approach to motion estimation is simpler in that hypotheses do not correspond to objects, but to distinct velocities, and the number of hypotheses is always fixed. This allows the selection of regions of support to be performed non-iteratively. The architecture of the model is feedforward with soft-max within layers, so it is quite fast. Mixtures of experts was used to optimize the weights in the network. Terry ----- From schmidhu at informatik.tu-muenchen.de Tue Feb 15 04:06:19 1994 From: schmidhu at informatik.tu-muenchen.de (Juergen Schmidhuber) Date: Tue, 15 Feb 1994 10:06:19 +0100 Subject: postdoctoral thesis Message-ID: <94Feb15.100623met.42337@papa.informatik.tu-muenchen.de> ---------------- postdoctoral thesis ---------------- Juergen Schmidhuber Technische Universitaet Muenchen (submitted April 1993, accepted October 1993) ----------------------------------------------------- NETZWERKARCHITEKTUREN, ZIELFUNKTIONEN UND KETTENREGEL Es gibt relativ neuartige, auf R"uckkopplung basierende k"unstliche neuronale Netze (KNN), deren F"ahigkeiten betr"achtlich "uber simple Musterassoziation hinausge- hen. Diese KNN gestatten im Prinzip die Implementierung beliebiger auf einem herk"ommlichen sequentiell arbei- tenden Digitalrechner berechenbarer Funktionen. Im Ge- gensatz zu herk"ommlichen Rechnern l"a"st sich dabei jedoch die Qualit"at der Ausgaben (formal spezifiziert durch eine sinnvolle Zielfunktion) bez"uglich der ``Software'' (bei KNN die Gewichtsmatrix) mathematisch differenzieren, was die Anwendung der Kettenregel zur Herleitung gradientenbasierter Software"anderungsalgo- rithmen erm"oglicht. Die Arbeit verdeutlicht dies durch formale Herleitung einer Reihe neuartiger Lernalgorith- men aus folgenden Bereichen: (1) "uberwachtes Lernen sequentiellen Ein/Ausgabeverhaltens mit zyklischen und azyklischen Architekturen, (2) ``Reinforcement Lernen'' und Subzielgenerierung ohne informierten Lehrer, (3) un"uberwachtes Lernen zur Redundanzextraktion aus Ein- gaben und Eingabestr"omen. Zahlreiche Experimente zei- gen M"oglichkeiten und Schranken dieser Lernalgorithmen auf. Zum Abschluss wird ein ``selbstreferentielles'' neuronales Netzwerk pr"asentiert, welches theoretisch lernen kann, seinen eigenen Software"anderungsalgorith- mus zu "andern. ----------------------------------------------------- The postdoctoral thesis above is now available (in unrevised form) via ftp. To obtain a copy, follow the instructions at the end of this message. Here is additional information for those who are interested but don't understand German (or are unfamiliar with Germany's academic system): The postdoctoral thesis is part of a process called ``Habilitation'' which is seen as a qualification for tenure. The thesis is about learning algorithms derived by the chain rule. It addresses supervised sequence learning, variants of reinforcement learning, and unsupervised learning (for redundancy reduction). Unlike some previous papers of mine, it contains lots of experiments and lots of figures. Here is a very brief summary based on pointers to recent English publications upon which the thesis elaborates: Chapters 2 and 3 are on supervised sequence learning and extend publications [1] and [4]. Chapter 4 is on variants of learning with a ``distal teacher'' and extends publication [7] (robot experiments in chapter 4 were conducted by Eldracher and Baginski, see e.g. [9]). Chapters 5, 6 and 7 describe unsupervised learning algorithms based on detection of redundant information in input patterns and pattern sequences: Chapter 5 elaborates on publication [5], and chapter 6 extends publication [3]. Chapter 6 includes a result by Peter Dayan, Richard Zemel and A. Pouget (SALK Institute) who demonstrated that equation (4.3) in [3] with $\beta = 0, \alpha = = \gamma =1$ is essentially equivalent to equation (5.1). Chapter 6 also includes experiments conducted by Stefanie Lindstaedt who successfully applied the method in [3] to redundant images of letters presented according to the probabilities of English language, see [10]. Chapter 7 extends publications [2] and [8]. Experiments show how sequence processing neural nets using algorithms for redundancy reduction can learn to bridge time lags (between correlated events) of more than 1000 discrete time steps. Other experiments use neural nets for text compression and compare them to standard data compression algorithms. Finally, chapter 8 elaborates on publication [6]. -------------------------- References ------------------------------- [1] J. H. Schmidhuber. A fixed size storage O(n^3) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2):243--248, 1992. [2] J. H. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234--242, 1992. [3] J. H. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863--879, 1992. [4] J. H. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131--139, 1992. [5] J. H. Schmidhuber and D. Prelinger. Discovering predictable classifications. Neural Computation, 5(4):625--635, 1993. [6] J. H. Schmidhuber. A self-referential weight matrix. In Proc. of the Int. Conf. on Artificial Neural Networks, Amsterdam, pages 446--451. Springer, 1993. [7] J. H. Schmidhuber and R. Wahnsiedler. Planning simple trajectories using neural subgoal generators. In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, Proc. of the 2nd Int. Conf. on Simulation of Adaptive Behavior, pages 196--202. MIT Press, 1992. [8] J. H. Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Huening, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87--95. Augustinus, 1993. [9] M. Eldracher and B. Baginski. Neural subgoal generation using backpropagation. In George G. Lendaris, Stephen Grossberg and Bart Kosko, editors, Proc. of WCNN'93, Lawrence Erlbaum Associates, Inc., Hillsdale, pages = III-145--III-148, 1993. [10] S. Lindstaedt. Comparison of unsupervised neural networks for redundancy reduction. In M. C. Mozer, P. Smolensky, D. S. Touretzky, J. L. Elman and A. S. Weigend, editors, Proc. of the 1993 Connectionist Models Summer School, pages 308-315. Hillsdale, NJ: Erlbaum Associates, 1993. ---------------------------------------------------------------------- The thesis comes in three parts. To obtain a copy, do: unix> ftp 131.159.8.35 Name: anonymous Password: (your email address, please) ftp> binary ftp> cd pub/fki ftp> get schmidhuber.habil.1.ps.Z ftp> get schmidhuber.habil.2.ps.Z ftp> get schmidhuber.habil.3.ps.Z ftp> bye unix> uncompress schmidhuber.habil.1.ps.Z unix> lpr schmidhuber.habil.1.ps . . . Note: The layout is designed for conventional European DINA4 format. Expect 145 pages. ---------------------------------------------------------------------- Dr. habil. J. H. Schmidhuber, Fakultaet fuer Informatik, Technische Universitaet Muenchen, 80290 Muenchen, Germany schmidhu at informatik.tu-muenchen.de --------- postdoctoral thesis (unrevised) ----------- NETZWERKARCHITEKTUREN, ZIELFUNKTIONEN UND KETTENREGEL Juergen Schmidhuber, TUM From Petri.Myllymaki at cs.Helsinki.FI Tue Feb 15 04:52:42 1994 From: Petri.Myllymaki at cs.Helsinki.FI (Petri Myllymaki) Date: Tue, 15 Feb 1994 11:52:42 +0200 Subject: Thesis in neuroprose Message-ID: <199402150952.LAA01783@keos.Helsinki.FI> FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/Thesis/myllymaki.thesis.ps.Z The following report has been placed in the neuroprose archive. ----------------------------------------------------------------------- Bayesian Reasoning by Stochastic Neural Networks Petri Myllymaki Ph.Lic. Thesis Department of Computer Science, University of Helsinki Report C-1993-67, Helsinki, December 1993 78 pages This work has been motivated by problems in several research areas: expert system design, uncertain reasoning, optimization theory, and neural network research. From the expert system design point of view, our goal was to develop a generic expert system shell capable of handling uncertain data. The theoretical framework used here for handling uncertainty is probabilistic reasoning, in particular the theory of Bayesian belief network representations. The probabilistic reasoning task we are interested in is, given a Bayesian network representation of a probability distribution on a set of discrete random variables, to find a globally maximal probability state consistent with given initial constraints. To solve this NP-hard problem approximatively, we use an iterative stochastic method, Gibbs sampling. As this method can be quite inefficient when implemented on a conventional sequential computer, we show how to construct a Gibbs sampling process for a given Bayesian network on a massively parallel architecture, a harmony neural network, which is a special case of the Boltzmann machine architecture. To empirically test the method developed, we implemented a hybrid neural-symbolic expert system shell, NEULA. The symbolic part of the system consists of a high-level conceptual description language and a compiler, which can be used for constructing Bayesian networks and providing them with the corresponding parameters (conditional probabilities). As the number of parameters needed for a given network may generally be quite large, we restrict ourselves to Bayesian networks having a special hierarchical structure. The neural part of the system consists of a neural network simulator which performs massively parallel Gibbs sampling. The performance of the NEULA system was empirically tested by using a small artificial test example. Computing Reviews (1991) Categories and Subject Descriptors: G.3 [Probability and statistics]: Probabilistic algorithms F.1.1 [Models of computation]: Neural networks G.1.6 [Optimization]: Constrained optimization I.2.5 [Programming languages and software]: Expert system tools and techniques General Terms: Algorithms, Theory. Additional Key Words and Phrases: Monte Carlo algorithms, Gibbs sampling, simulated annealing, Bayesian belief networks, connectionism, massive parallelism ----------------------------------------------------------------------- To obtain a copy: ftp archive.cis.ohio-state.edu login: anonymous password: cd pub/neuroprose/Thesis binary get myllymaki.thesis.ps.Z quit Then at your system: uncompress myllymaki.thesis.ps.Z lpr myllymaki.thesis.ps ----------------------------------------------------------------------- Petri Myllymaki Petri.Myllymaki at cs.Helsinki.FI Department of Computer Science Int.+358 0 708 4212 (tel.) P.O.Box 26 (Teollisuuskatu 23) Int.+358 0 708 4441 (fax) FIN-00014 University of Helsinki, Finland ----------------------------------------------------------------------- From thrun at uran.cs.bonn.edu Tue Feb 15 08:25:02 1994 From: thrun at uran.cs.bonn.edu (Sebastian Thrun) Date: Tue, 15 Feb 1994 14:25:02 +0100 Subject: 2 papers on robot learning Message-ID: <199402151325.OAA17317@carbon.informatik.uni-bonn.de> This is to announce two recent papers in the connectionists' archive. Both papers deal with robot learning issues. The first paper describes two learning approaches (EBNN with reinforcement learning, COLUMBUS), and the second paper gives some empirical results for learning robot navigation using reinforcement learning and EBNN. Both approaches have been evaluated using real robot hardware. Enjoy reading! Sebastian ------------------------------------------------------------------------ LIFELONG ROBOT LEARNING Sebastian Thrun Tom Mitchell University of Bonn Carnegie Mellon University Learning provides a useful tool for the automatic design of autonomous robots. Recent research on learning robot control has predominantly focussed on learning single tasks that were studied in isolation. If robots encounter a multitude of control learning tasks over their entire lifetime, however, there is an opportunity to transfer knowledge between them. In order to do so, robots may learn the invariants of the individual tasks and environments. This task-independent knowledge can be employed to bias generalization when learning control, which reduces the need for real-world experimentation. We argue that knowledge transfer is essential if robots are to learn control with moderate learning times in complex scenarios. Two approaches to lifelong robot learning which both capture invariant knowledge about the robot and its environments are reviewed. Both approaches have been evaluated using a HERO-2000 mobile robot. Learning tasks included navigation in unknown indoor environments and a simple find-and-fetch task. (Technical Report IAI-TR-93-7, Univ. of Bonn, CS Dept.) ------------------------------------------------------------------------ AN APPROACH TO LEARNING ROBOT NAVIGATION Sebastian Thrun. Univ. of Bonn Designing robots that can learn by themselves to perform complex real-world tasks is still an open challenge for the fields of Robotics and Artificial Intelligence. In this paper we describe an approach to learning indoor robot navigation through trial-and-error. A mobile robot, equipped with visual, ultrasonic and infrared sensors, learns to navigate to a designated target object. In less than 10 minutes operation time, the robot is able to learn to navigate to a marked target object in an office environment. The underlying learning mechanism is the explanation-based neural network (EBNN) learning algorithm. EBNN initially learns function from scratch using neural network representations. With increasing experience, EBNN employs domain knowledge to explain and to analyze training data in order to generalize in a knowledgeable way. (to appear in: Proceedings of the IEEE Conference on Intelligent Robots and Systems 1994) ------------------------------------------------------------------------ Postscript versions of both papers may be retrieved from Jordan Pollack's neuroprose archive by following the instructions below. unix> ftp archive.cis.ohio-state.edu ftp login name> anonymous ftp password> xxx at yyy.zzz ftp> cd pub/neuroprose ftp> bin ftp> get thrun.lifelong-learning.ps.Z ftp> get thrun.learning-robot-navg.ps.Z ftp> bye unix> uncompress thrun.lifelong-learning.ps.Z unix> uncompress thrun.learning-robot-navg.ps.Z unix> lpr thrun.lifelong-learning.ps.Z unix> lpr thrun.learning-robot-navg.ps.Z From chaps at inf.enst.fr Tue Feb 15 09:22:03 1994 From: chaps at inf.enst.fr (Cedric Chappelier) Date: Tue, 15 Feb 94 15:22:03 +0100 Subject: papers on time and neural networks (Correction) Message-ID: <9402151422.AA03059@ulysse.enst.fr.enst.fr> Yesterday we send the following announcement. We want to make a little correction : the format of the paper can either be Word file (as mentioned in the first mail) OR A LATEX FILE. > > As guest editors of a special issue of the Sigart Bulletin about : > > Time and Neural Networks > > we are looking for 4 articles about 10 pages each. > > Sigart is a quarterly publication of the Association for Computing > Machinery (ACM) special interest group on Artificial Intelligence. > > The paper may either deal with approachs of time processing using > traditional connectionist architectures, or with more specific models > integrating time in their basis. > > If you are interested, and if you can submit a paper (not already > published) within a short delay (about 1 month and a half), please send a > draft (if possible a Word file) : ^^^^^^^^^^^^^^^^^^^^^^^ OR A LATEX FILE > - preferably by giving ftp access to it (information via e-mail) > - or sending it as "attached file" on e-mail > - or posting a paper copy of it. > > Drafts should be received before April 1. > Notification of acceptance will be sent before April 20. > > grumbach at enst.fr or chaps at enst.fr > > Alain Grumbach and Cedric Chappelier > ENST dept INF > 46 rue Barrault > 75634 Paris Cedex 13 > France > > Sorry for the negligence. --- E-mail: chaps at inf.enst.fr || Cedric.Chappelier at enst.fr P-mail: Telecom Paris 46, rue Barrault - 75634 Paris cedex 13 From COTTRLL at FRMOP22.CNUSC.FR Tue Feb 15 18:42:00 1994 From: COTTRLL at FRMOP22.CNUSC.FR (COTTRELL) Date: Tue, 15 Feb 94 18:42 Subject: Available paper : Kohonen algorithm Message-ID: <"94-02-15-18:42:21.90*COTTRLL"@FRMOP22.CNUSC.FR> The following paper is available from anonymous ftp on archive.cis.ohio-state.edu (128.146.8.52) in directory pub/neuroprose as file cottrell.things.ps "Two or three things that we know about the Kohonen algorithm" 10 pages by Marie Cottrell, Jean-Claude Fort, Gilles Pages SAMOS, Universite Paris 1 90, rue de Tolbiac 75634 PARIS Cedex 13 FRANCE ABSTRACT Many theoretical papers are published about the Kohonen algorithm. It is not not easy to understand what is exactly proved, because of the great variety of mathematical methods. Despite all these efforts, many problems remain without solution. In this small review paper, we intend to sum up the situation. To appear in the Proceedings of ESANN 94, Bruxelles To retrieve >ftp archive.cis.ohio-state.edu name : anonymous password: (use your e-mail address) ftp> cd pub/neuroprose ftp> get cottrell.things.ps ftp> quit From platt at synaptics.com Tue Feb 15 20:13:14 1994 From: platt at synaptics.com (John Platt) Date: Tue, 15 Feb 94 17:13:14 PST Subject: Neuroprose paper available Message-ID: <9402160113.AA18442@synaptx.synaptics.com> ****** PAPER AVAILABLE VIA NEUROPROSE *************************************** ****** AVAILABLE VIA FTP ONLY *********************************************** ****** PLEASE DO NOT FORWARD TO OTHER MAILING LISTS OR BOARDS. ************** FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/wolf.address-block.ps.Z The following paper has been placed in the Neuroprose archives at Ohio State. The file is wolf.address-block.ps.Z . Only the electronic version of this paper is available. This paper is 8 pages in length. NOTE: The uncompressed postscript file is approximately 2.7 megabytes in length, so it may take a while to print out. Also, you may have to tell the lpr program to use a symbolic link to copy into the spool directory (lpr -s under SunOS). ----------------------------------------------------------------------------- Postal Address Block Location Using A Convolutional Locator Network Ralph Wolf and John C. Platt Synaptics, Inc. 2698 Orchard Parkway San Jose, CA 95134 ABSTRACT: This paper describes the use of a convolutional neural network to perform address block location on machine-printed mail pieces. Locating the address block is a difficult object recognition problem because there is often a large amount of extraneous printing on a mail piece and because address blocks vary dramatically in size and shape. We used a convolutional locator network with four outputs, each trained to find a different corner of the address block. A simple set of rules was used to generate ABL candidates from the network output. The system performs very well: when allowed five guesses, the network will tightly bound the address delivery information in 98.2% of the cases. ----------------------------------------------------------------------------- John Platt platt at synaptics.com From terry at salk.edu Tue Feb 15 22:44:00 1994 From: terry at salk.edu (Terry Sejnowski) Date: Tue, 15 Feb 94 19:44:00 PST Subject: Telluride Workshops Message-ID: <9402160344.AA25170@salk.edu> CALL FOR PARTICIPATION IN TWO WORKSHOPS ON "NEUROMORPHIC ENGINEERING" JULY 3 - 9, 1994 AND JULY 10 - 16, 1994 TELLURIDE, COLORADO Christof Koch (Caltech) and Terry Sejnowski (Salk Institute/UCSD) invite applications for two different workshops that will be held in Telluride, Colorado in July 1994. Travel and housing expenses will be provided for ten to twenty active researchers for each workshop. Deadline for application is March 10, 1994. GOALS: Carver Mead has introduced the term "Neuromorphic Engineering" for a new field based on the design and fabrication of artificial neural systems, such as vision systems, head-eye systems, and roving robots, whose architecture and design principles are based on those of biological nervous systems. The goal of these workshops is to bring together young investigators and more established researchers from academia with their counterparts in industry and national laboratories, working on both neurobiological as well as engineering aspects of sensory systems and sensory-motor integration. The focus of the workshop will be on ``active" participation, with demonstration systems and hands-on-experience for all participants. Neuromorphic engineering has a wide range of applications from nonlinear adaptive control of complex systems to the design of smart sensors. Many of the fundamental principles in this field, such as the use of learning methods and the design of parallel hardware, are inspired by biological systems. However, existing applications are modest and the challenge of scaling up from small artificial neural networks and designing completely autonomous systems at the levels achieved by biological systems lies ahead. The assumption underlying these workshops is that the next generation of neuromorphic systems would benefit from closer attention to the principles found through experimental and theoretical studies of brain systems. WORKSHOPS: NEUROMORPHIC ANALOG VLSI SYSTEMS Sunday, July 3 to Saturday, July 9, 1994 Organized by Rodney Douglas (Oxford), Misha Mahowald (Oxford) and Stephen Lisberger (UCSF). The goal of this week is to bring together biologists and engineers who are interested in exploring neuromorphic systems through the medium of analog VLSI. The workshop will cover methods for the design and fabrication of multi-chip neuromorphic systems. This framework is suitable both for creating analogs of specific biological systems, which can serve as a modeling environment for biologists, and as a tool for engineers to create cooperative circuits based on biological principles. The workshop will provide the community with a common formal language for describing neuromorphic systems. Equipment will be present for participants to evaluate existing neuromorphic chips (including silicon retina, silicon neurons, oculomotor system). SYSTEMS LEVEL MODELS OF VISUAL BEHAVIOR Sunday, July 10 to Saturday, July 16, 1994 Organized by Dana Ballard (Rochester) and Richard Andersen (Caltech). The goal of this week is to bring together biologists and engineers who are interested in systems level modeling of visual behaviors and their interactions with the motor systems. Sessions will cover issues of sensory-motor integration in the mammalian brain. Special emphasis will be placed on understanding neural algorithms used by the brain which can provide insights into constructing electrical circuits which can accomplish similar tasks. Issues to be covered will include spatial localization and constancy, attention, motor planning, eye movements, and the use of visual motion information for motor control. Two or three prominent neuroscientists will be invited to give lectures on the above subjects. These researchers will also be asked to bring their own demonstrations, classroom experiments, and software for computer models. Demonstrations include recording eye movements and simple eye movement psychophysical experiments, neural network models for coordinate transformations and the representation of space, visual attention psychophysical experiments. Participants can conduct their own experiments using the Virtual Reality equipment. FORMAT: Time in both workshops will be divided between planned presentation, free interaction, and contributed material. Each day will consist of a lecture in the morning that covers the theory behind the hands-on investigation in the afternoon. Following each lecture, there will be a demonstration that introduces participants to the equipment that will be available in the afternoon session. Participants will be free to explore and play with whatever they choose in the afternoon. Participants are encouraged to bring their own material to share with others. After dinner, time for participants to provide an informal lecture/demonstration is reserved. LOCATION AND ARRANGEMENTS: The two workshops will take place at the "Telluride Summer Research Center," located in the small town of Telluride, 9000 feet high in Southwest Colorado, about 6 hours away from Denver (350 miles) and 4 hours from Aspen. Continental and United Airlines provide many daily flights directly into Telluride. Participants will be housed in shared condominiums, within walking distance of the Center. The workshop is intended to be very informal and hands-on. Participants are not required to have had previous experience in analog VLSI circuit design, computational or machine vision, systems level neurophysiology or modeling the brain at the systems level. However, we strongly encourage active researchers with relevant backgrounds from academia, industry and national laboratories to apply, in particular if they are prepared to talk about their work or to bring demonstrators to Telluride (e.g. robots, chips, software). We expect to be able to pay for shipping necessary equipment to Telluride and will have at least three technical staff present throughout both workshops to assist us with software and hardware problems. We will have a network of SUN workstations running UNIX and connected to the Internet at the Center available to us. All domestic travel and housing expenses will be provided. Participants are expected to pay for food and incidental expenses. HOW TO APPLY: The deadline for receipt of applications is March 10, 1994 Applicants should be at the level of graduate students or above (i.e. post- doctoral fellows, faculty, research and engineering staff and the equivalent positions in industry and national laboratories). We actively encourage qualified women and minority candidates to apply. Each participant can apply for only one workshop and the application should include: 1. Name, address, telephone, e-mail, FAX, and and minority status (optional). 2. Resume. 3. One page summary of background and interests relevant to the workshop. 4. Description of special equipment needed for demonstrations. 5. Two letters of recommendation Complete applications should be sent to: Prof. Terrence Sejnowski The Salk Institute Post Office Box 85800 San Diego, CA 92186-5800 Applicants will be notified by April 15, 1994. From venu at pixel.mipg.upenn.edu Wed Feb 16 17:28:00 1994 From: venu at pixel.mipg.upenn.edu (Venugopal) Date: Wed, 16 Feb 94 17:28:00 EST Subject: Paper available on ftp Message-ID: <9402162228.AA00373@pixel.mipg.upenn.edu> *** PLEASE DO NOT FORWARD TO OTHER GROUPS *** Preprint of the following paper (to appear in Circuits, Systems and Signal Processing) is available on ftp from neuroprose archive: AN IMPROVED SCHEME FOR THE DIRECT ADAPTIVE CONTROL OF DYNAMICAL SYSTEMS USING BACKPROPAGATION NEURAL NETWORKS K. P. Venugopal, R. Sudhakar and A. S. Pandya Department of Electrical Eng. Department of Computer Science and Eng. Florida Atlantic University Abstract: This paper presents an improved direct control architecture for the on-line learning control of dynamical systems using backpropagation neural networks. The proposed architecture is compared with the other direct control schemes. In the present scheme, the neural network interconnection strengths are updated based on the output error of the dynamical system directly, rather than using a transformed version of the error employed in other schemes. The ill effects of the controlled dynamics on the on-line updating of the network weights are moderated by including a compensating gain layer. An error feedback is introduced to improve the dynamic response of the control system. Simulation studies are performed using the nonlinear dynamics of an underwater vehicle and the promising results support the effectiveness of the proposed scheme. ----------------------------------------- The file at archive.cis.ohio-state.edu is venugopal.css.ps.Z (34 pages) to ftp the files: unix> ftp archive.cis.ohio-state.edu Name (archive.cis.ohio-state.edu:xxxxx): anonymous Password: your address ftp> cd pub/neuroprose ftp> binary ftp> get venugopal.css.ps.Z uncompress the file after transfering to your machine. unix> uncompress venugopal.css.ps.Z ________________________________________________________________ K. P. Venugopal Medical Image Processing Group University of Pennsylvania 423 Blockley Hall Philadelphia, PA 19104 (venu at pixel.mipg.upenn.edu) From anandan at sarnoff.com Wed Feb 16 09:22:51 1994 From: anandan at sarnoff.com (P. Anandan x3249) Date: Wed, 16 Feb 94 09:22:51 EST Subject: outlier, robust statistics In-Reply-To: <9402150756.AA17907@salk.edu> (message from Terry Sejnowski on Mon, 14 Feb 94 23:56:04 PST) Message-ID: <9402161422.AA13890@peanut.sarnoff.com> Hi Terry, It may be worth mentioning that a simple extension of your "fixed velocity" formulation leads to something quite powerful and is a decent approximation for many real situations. This is to look formulate the hypothesis space as 2-D affine transforms of the image plane. Most of the references below have not used robust estimators but have focussed on the layered representation problem. However, recent extensions of all these algorithms at Sarnoff have included several different types of robust estimators as options. One noteworthy omission (simply because I have not yet updated my bib file, is the paper by Black and Jepson, CVPR93.) I also did not inlude the paper by Wang and Adelson at CVPR93, because that can be viewed as falling into either category (affine hypotheses or object hypotheses). In general, when you use a parametric motion model (translation, affine, 8-parameter quadratic for planar surface motion), you have the choice of working with motion-parameters as hypotheses or the objects as hypotheses. But if you are working with non-parametric motion fields (e.g., smooth flow), it is not obvious how to work with motion parameters as hypotheses. Last but not least, I should mention a recent paper that we have written which is under review that goes beyond parametric layers to include residual flow to fully account for the scene motion. This is an alternative approach to the standard formulation of the spatial-coherence assumption as a "smoothness" constraint (e.g., minimum quadratic variation, etc.). This paper also describes a computational framework that identifies the critical choice points for layered motion estimation and shows how different algorithms fit into that framework. I should be in a position to send you a copy of the paper in a couple of weeks or so. -- anandan @article{Irani-Peleg:IJCV, author = {M. Irani and S. Peleg}, title = {Computing Occluding and Transparent Motions}, journal = IJCV, year = {accepted for publication, 1993}, } @inproceedings{Bergen-etal:AICV91, author = {J.R. Bergen and P.J. Burt and K. Hanna and R. Hingorani and P. Jeanne and S. Peleg}, title = {Dynamic Multiple-Motion Computation}, booktitle = {Artificial Intelligence and Computer Vision: Proceedings of the Israeli Conference}, publisher = {Elsevier}, editor = {Y.A. Feldman and A. Bruckstein}, year = {1991}, pages = {147--156} } @inproceedings{Burt-etal:WVM89, title = {Object tracking with a moving camera, an application of dynamic motion analysis}, author ={P.J. Burt and J.R. Bergen and R. Hingorani and R. Kolczynski and W.A. Lee and A. Leung and J. Lubin and H. Shvaytser}, booktitle = WVM, address = {Irvine, CA}, month = {March}, year = {1989}, pages = {2--12} } @article{Bergen-etal:PAMI92, author = {J.R. Bergen and P.J. Burt and R. Hingorani and S. Peleg}, title = {A Three Frame Algorithm for Estimating Two-Component Image Motion}, journal = PAMI, month = {September}, year = {1992}, volume = {14}, pages = {886--896} } From M.Cooke at dcs.shef.ac.uk Wed Feb 16 09:22:17 1994 From: M.Cooke at dcs.shef.ac.uk (Martin Cooke) Date: Wed, 16 Feb 94 14:22:17 GMT Subject: missing values Message-ID: <9402161427.AA10510@dcs.shef.ac.uk> I've only just seen the discussion on missing values, so forgive this late response. The issue of training the Kohonen self-organising feature map with partial data is covered in Samad & Harp (1992) Self-organisation with partial data Network, 3, 205-212. Essentially, weight changes are restricted to the subspace of available data. Samad & Harp report three experiments using partial training data, and demonstrate that performance is essentially unchanged up to about 60% missing data. This is presumably due to the n -> 2 dimensionality reduction. We recently applied this result to training a speech recogniser on partial data, and got similar results [tech. rep. in preparation]. We're coming at this from the field of auditory scene analysis, where the result of source segregation is an inherently partial description of one or other source. I'd be happy to supply further details on request. Martin Cooke Computer Science Sheffield University UK From mmoller at daimi.aau.dk Wed Feb 16 11:10:00 1994 From: mmoller at daimi.aau.dk (Martin Fodslette M|ller) Date: Wed, 16 Feb 1994 17:10:00 +0100 Subject: copy of thesis. Message-ID: <199402161610.AA28147@titan.daimi.aau.dk> To all that have requested a copy of my thesis (and apologies to those that did not for sending this message). Thank you all for your interest in my thesis. Since so many have requested a copy (about 200), I will not be able to answer you all separately right now. Please accept my apologies. You will all receive a copy of the thesis in a few weeks. Best Regards -martin ---------------------------------------------------------------- Martin Moller email: mmoller at daimi.aau.dk Computer Science Dept. Fax: +45 8942 3255 Aarhus University Phone: +45 8942 3371 Ny Munkegade, Build. 540, DK-8000 Aarhus C, Denmark ---------------------------------------------------------------- From venu at pixel.mipg.upenn.edu Wed Feb 16 17:15:31 1994 From: venu at pixel.mipg.upenn.edu (Venugopal) Date: Wed, 16 Feb 94 17:15:31 EST Subject: Thesis available on ftp Message-ID: <9402162215.AA00370@pixel.mipg.upenn.edu> The following thesis is available on ftp from neuroprose archive: LEARNING IN CONNECTIONIST NETWORKS USING THE ALOPEX ALGORITHM K. P. Venugopal Florida Atlantic University Abstract: The ALOPEX algorithm is presented as a `universal' learning algorithm for connectionist models. It is shown that the ALOPEX procedure can be used efficiently as a supervised learning algorithm for such models. The algorithm is demonstrated successfully on a variety of network architectures. Such architectures include multi- layered perceptrons, time-delay models, asymmetric fully recurrent networks and memory neurons. The learning performance as well as the generalization capability of the ALOPEX algorithm, are compared with those of the backpropagation procedure, concerning a number of benchmark problems, and it is shown that the ALOPEX has specific advantages. Results on the MONKS problems are the best reported ones so far. Two new architectures are proposed for the on-line, direct adaptive control of dynamical systems using neural networks. The proposed schemes are shown to provide better response and tracking characteristics, than the other existing direct control schemes. A velocity reference scheme is introduced to improve the dynamic response of on-line learning controllers. The proposed learning algorithm and architectures are also studied on three practical problems: (i) classification of handwritten digits using Fourier descriptors, (ii) recognition of underwater targets from sonar returns, conidering temporal dependencies of consecutive returns, and (iii) on-line learning control of autonomous underwater vehicles, starting from random initial conditions. Detailed studies are conducted on the learning control applications. Also, the ability of the neural network controllers to adapt to slow and sudden varying parameter disturbances and measurement noise is studied in detail. --------------------- Some of the related papers: K. P. Venugopal, A. S. Pandya and R. Sudhakar, 'A recurrent neural network controller and learning algorithm for the on-line learning control of autonomous underwater vehicles', to appear in Neural Networks (1994) K. P. Venugopal, R. Sudhakar and A. S. Pandya, 'On-line learning control of autonomous underwater vehicles using feedforward neural networks', IEEE Journal of Oceanic Engineering, vol. 17 (1992) K. P. Venugopal, R. Sudhakar and A. S. Pandya, 'An improved scheme for the direct adaptive control of dynamical systems using backpropagation neural networks' to appear in Circuits, Systems and Signal Processing (1994) K. P. Venugopal and S. M. Smith, 'Improving the dynamic response of neural network controllers using velocity reference feedback' IEEE Trans. on Neural Networks, vol. 4, (1993) K. P. Unnikrishnan and K. P. Venugopal, 'Alopex: a correlation based learning algorithm for feedforward and feedback neural networks' to appear in Neural Computation, vol. 6, (1994) A. S. Pandya and K. P. Venugopal, 'A stochastic parallel algorithm for learning in neural networks', to appear in IEICE Transactions on Information Processing (1994) ----------------------------------------- The files at archive.cis.ohio-state.edu are venugopal.thesis1.ps.Z venugopal.thesis2.ps.Z venugopal.thesis3.ps.Z venugopal.thesis4.ps.Z venugopal.thesis5.ps.Z venugopal.thesis6.ps.Z venugopal.thesis7.ps.Z (total 200 pages) to ftp the files: unix> ftp archive.cis.ohio-state.edu Name (archive.cis.ohio-state.edu:xxxxx): anonymous Password: your address ftp> cd pub/neuroprose/Thesis ftp> binary ftp> mget venugopal.thesis* uncompress the files after transfering to your machine. unix> uncompress venugopal* ------------------------------------------------- K. P. Venugopal Medical Image Processing Group University of Pennsylvania 423 Blockley Hall Philadelphia, PA 19104 (venu at pixel.mipg.upenn.edu) From minton at ptolemy.arc.nasa.gov Wed Feb 16 21:03:21 1994 From: minton at ptolemy.arc.nasa.gov (Steve Minton) Date: Wed, 16 Feb 94 18:03:21 PST Subject: JAIR article Message-ID: <9402170203.AA27856@ptolemy.arc.nasa.gov> Readers of this newsgroup may be interested the following article, which was recently published in the Journal of Artificial Intelligence Research: Ling, C.X. (1994) "Learning the Past Tense of English Verbs: The Symbolic Pattern Associator vs. Connectionist Models", Volume 1, pages 209-229 Postscript: volume1/ling94a.ps (247K) Online Appendix: volume1/ling-appendix.Z (109K) data file, compressed Appendix: Learning the past tense of English verbs - a seemingly minor aspect of language acquisition - has generated heated debates since 1986, and has become a landmark task for testing the adequacy of cognitive modeling. Several artificial neural networks (ANNs) have been implemented, and a challenge for better symbolic models has been posed. In this paper, we present a general-purpose Symbolic Pattern Associator (SPA) based upon the decision-tree learning algorithm ID3. We conduct extensive head-to-head comparisons on the generalization ability between ANN models and the SPA under different representations. We conclude that the SPA generalizes the past tense of unseen verbs better than ANN models by a wide margin, and we offer insights as to why this should be the case. We also discuss a new default strategy for decision-tree learning algorithms. JAIR's server can be accessed by WWW, FTP, gopher, or automated email. For further information, check out our WWW server (URL is gopher://p.gp.cs.cmu.edu/) or one of our FTP sites (/usr/jair/pub at p.gp.cs.cmu.edu), or send email to jair at cs.cmu.edu with the subject AUTORESPOND and the message body HELP. From COTTRLL at FRMOP22.CNUSC.FR Thu Feb 17 10:04:00 1994 From: COTTRLL at FRMOP22.CNUSC.FR (COTTRELL) Date: Thu, 17 Feb 94 10:04 Subject: Paper available Message-ID: <"94-02-17-10:04:06.72*COTTRLL"@FRMOP22.CNUSC.FR> Dear connectionnits Some people report that they cannot retrieve the paper cottrell.things.ps that I put in the neuroprose archive some days ago I will try to solve the problem as soon as possible Please wait a little before trying again Yours sincerely Marie Cottrell SAMOS Universite Paris1 90, rue de Tolbiac F-75634 PARIS 13 FRANCE E-mail : cottrll at frmop22.cnusc.fr From COTTRLL at FRMOP22.CNUSC.FR Thu Feb 17 19:54:00 1994 From: COTTRLL at FRMOP22.CNUSC.FR (COTTRELL) Date: Thu, 17 Feb 94 19:54 Subject: Paper available : Kohonen algorithm Message-ID: <"94-02-17-19:54:08.03*COTTRLL"@FRMOP22.CNUSC.FR> Dear connectionnists The problem that some of you encounter in retrieving the paper Two or three... file cottrell.things.ps in neuroprose repository comes from a change in its name its name is now : cottrell.things.ps.Z in pub/neuroprose in archive.cis.ohio-state.edu It has been compressed. Sorry for the delay Yours sincerely Marie Cottrell From reza at ai.mit.edu Thu Feb 17 09:03:53 1994 From: reza at ai.mit.edu (Reza Shadmehr) Date: Thu, 17 Feb 94 09:03:53 EST Subject: Tech reports from CBCL at MIT Message-ID: <9402171403.AA02835@corpus-callosum> Hello, Following is a list of recent technical reports from the Center for Biological and Computational Learning at M.I.T. These reports are available via anonymous ftp. (see end of this message for details) -------------------------------- :CBCL Paper #78/AI Memo #1405 :author Amnon Shashua :title On Geometric and Algebraic Aspects of 3D Affine and Projective Structures from Perspective 2D Views :date July 1993 :pages 14 :keywords visual recognition, structure from motion, projective geometry, 3D reconstruction We investigate the differences --- conceptually and algorithmically --- between affine and projective frameworks for the tasks of visual recognition and reconstruction from perspective views. It is shown that an affine invariant exists between any view and a fixed view chosen as a reference view. This implies that for tasks for which a reference view can be chosen, such as in alignment schemes for visual recognition, projective invariants are not really necessary. We then use the affine invariant to derive new algebraic connections between perspective views. It is shown that three perspective views of an object are connected by certain algebraic functions of image coordinates alone (no structure or camera geometry needs to be involved). -------------- :CBCL Paper #79/AI Memo #1390 :author Jose L. Marroquin and Federico Girosi :title Some Extensions of the K-Means Algorithm for Image Segmentation and Pattern Classification :date January 1993 :pages 21 :keywords K-means, clustering, vector quantization, segmentation, classification We present some extensions to the k-means algorithm for vector quantization that permit its efficient use in image segmentation and pattern classification tasks. We show that by introducing a certain set of state variables it is possible to find the representative centers of the lower dimensional manifolds that define the boundaries between classes; this permits one, for example, to find class boundaries directly from sparse data or to efficiently place centers for pattern classification. The same state variables can be used to determine adaptively the optimal number of centers for clouds of data with space-varying density. Some examples of the application of these extensions are also given. -------------- :CBCL Paper #80/AI Memo #1431 :title Example-Based Image Analysis and Synthesis :author David Beymer, Amnon Shashua and Tomaso Poggio :date November, 1993 :pages 21 :keywords computer graphics, networks, computer vision, teleconferencing, image compression, computer interfaces Image analysis and graphics synthesis can be achieved with learning techniques using directly image examples without physically-based, 3D models. In our technique: 1) the mapping from novel images to a vector of ``pose'' and ``expression'' parameters can be learned from a small set of example images using a function approximation technique that we call an analysis network; 2) the inverse mapping from input ``pose'' and ``expression'' parameters to output images can be synthesized from a small set of example images and used to produce new images using a similar synthesis network. The techniques described here have several applications in computer graphics, special effects, interactive multimedia and very low bandwidth teleconferencing. -------------- :CBCL Paper #81/AI Memo #1432 :title Conditions for Viewpoint Dependent Face Recognition :author Philippe G. Schyns and Heinrich H. B\"ulthoff :date August 1993 :pages 6 :keywords face recognition, RBF Network Symmetry Face recognition stands out as a singular case of object recognition: although most faces are very much alike, people discriminate between many different faces with outstanding efficiency. Even though little is known about the mechanisms of face recognition, viewpoint dependence, a recurrent characteristic of many research on faces, could inform algorithms and representations. Poggio and Vetter's symmetry argument predicts that learning only one view of a face may be sufficient for recognition, if this view allows the computation of a symmetric, "virtual," view. More specifically, as faces are roughly bilaterally symmetric objects, learning a side-view---which always has a symmetric view--- should give rise to better generalization performances that learning the frontal view. It is also predicted that among all new views, a virtual view should be best recognized. We ran two psychophysical experiments to test these predictions. Stimuli were views of 3D models of laser-scanned faces. Only shape was available for recognition; all other face cues--- texture, color, hair, etc.--- were removed from the stimuli. The first experiment tested wqhich single views of a face give rise to best generalization performances. The results were compatible with the symmetry argument: face recognition from a single view is always better when the learned view allows the computation 0f a symmetric view. -------------- :CBCL Paper #82/AI Memo #1437 :author Reza Shadmehr and Ferdinando A. Mussa-Ivaldi :title Geometric Structure of the Adaptive Controller of the Human Arm :date July 1993 :pages 34 :keywords Motor learning, reaching movements, internal models, force fields, virtual environments, generalization, motor control The objects with which the hand interacts with may significantly change the dynamics of the arm. How does the brain adapt control of arm movements to this new dynamics? We show that adaptation is via composition of a model of the task's dynamics. By exploring generalization capabilities of this adaptation we infer some of the properties of the computational elements with which the brain formed this model: the elements have broad receptive fields and encode the learned dynamics as a map structured in an intrinsic coordinate system closely related to the geometry of the skeletomusculature. The low--level nature of these elements suggests that they may represent a set of primitives with which movement are represented in the CNS. -------------- :CBCL Paper #83/AI Memo #1440 :author Michael I. Jordan and Robert A. Jacobs :title Hierarchical Mixtures of Experts and the EM Algorithm :date August 1993 :pages 29 :keywords supervised learning, statistics, decision trees, neural networks We present a tree-structured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIM's). Learning is treated as a maximum likelihood problem; in particular, we present an Expectation-Maximization (EM) algorithm for adjusting the parameters of the architecture. We also develop an on-line learning algorithm in which the parameters are updated incrementally. Comparative simulation results are presented in the robot dynamics domain. -------------- :CBCL Paper #84/AI Memo #1441 :title On the Convergence of Stochastic Iterative Dynamic Programming Algorithms :author Tommi Jaakkola, Michael I. Jordan and Satinder P. Singh :date August 1993 :pages 15 :keywords reinforcement learning, stochastic approximation, convergence, dynamic programming Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(lambda) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD (lambda) and Q-learning belong. -------------- :CBCL Paper #86/AI Memo #1449 :title Formalizing Triggers: A Learning Model for Finite Spaces :author Patha Niyogi and Robert Berwick :pages 14 :keywords language learning, parameter systems, Markov chains, convergence times, computational learning theory :date November 1993 In a recent seminal paper, Gibson and Wexler (1993) take important steps to formalizing the notion of language learning in a (finite) space whose grammars are characterized by a finite number of {\it parameters\/}. They introduce the Triggering Learning Algorithm (TLA) and show that even in finite space convergence may be a problem due to local maxima. In this paper we explicitly formalize learning in finite parameter space as a Markov structure whose states are parameter settings. We show that this captures the dynamics of TLA completely and allows us to explicitly compute the rates of convergence for TLA and other variants of TLA e.g. random walk. Also included in the paper are a corrected version of GW's central convergence proof, a list of ``problem states'' in addition to local maxima, and batch and PAC-style learning bounds for the model. -------------- :CBCL Paper #87/AI Memo #1458 :title Convergence Results for the EM Approach to Mixtures of Experts Architectures :author Michael Jordan and Xei Xu :pages 33 :date September 1993 The Expectation-Maximization (EM) algorithm is an iterative approach to maximum likelihood parameter estimation. Jordan and Jacobs (1993) recently proposed an EM algorithm for the mixture of experts architecture of Jacobs, Jordan, Nowlan and Hinton (1991) and the hierarchical mixture of experts architecture of Jordan and Jacobs (1992). They showed empirically that the EM algorithm for these architectures yields significantly faster convergence than gradient ascent. In the current paper we provide a theoretical analysis of this algorithm. We show that the algorithm can be regarded as a variable metric algorithm with its searching direction having a positive projection on the gradient of the log likelihood. We also analyze the convergence of the algorithm and provide an explicit expression for the convergence rate. In addition, we describe an acceleration technique that yields a significant speedup in simulation experiments. -------------- :CBCL Paper #89/AI Memo #1461 :title Face Recognition under Varying Pose :author David J. Beymer :pages 14 :date December 1993 :keywords computer vision, face recognition, facial feature detection, template matching While researchers in computer vision and pattern recognition have worked on automatic techniques for recognizing faces for the last 20 years, most systems specialize on frontal views of the face. We present a face recognizer that works under varying pose, the difficult part of which is to handle face rotations in depth. Building on successful template-based systems, our basic approach is to represent faces with templates from multiple model views that cover different poses from the viewing sphere. Our system has achieved a recognition rate of 98% on a data base of 62 people containing 10 testing and 15 modelling views per person. -------------- :CBCL Paper #90/AI Memo #1452 :title Algebraic Functions for Recognition :author Amnon Shashua :pages 11 :date January 1994 In the general case, a trilinear relationship between three perspective views is shown to exist. The trilinearity result is shown to be of much practical use in visual recognition by alignment --- yielding a direct method that cuts through the computations of camera transformation, scene structure and epipolar geometry. The proof of the central result may be of further interest as it demonstrates certain regularities across homographies of the plane and introdues new view invariants. Experiments on simulated and real image data were conducted, including a comparative analysis with epipolar intersection and the linear combination methods, with results indicating a greater degree of robustness in practice and higher level of performance in re-projection tasks. ============================ How to get a copy of a report: The files are in compressed postscript format and are named by their AI memo number. They are put in a directory named as the year in which the paper was written. Here is the procedure for ftp-ing: unix> ftp publications.ai.mit.edu (128.52.32.22, log-in as anonymous) ftp> cd ai-publications/1993 ftp> binary ftp> get AIM-number.ps.Z ftp> quit unix> zcat AIM-number.ps.Z | lpr Best wishes, Reza Shadmehr Center for Biological and Computational Learning M. I. T. Cambridge, MA 02139 From mel at klab.caltech.edu Thu Feb 17 21:00:32 1994 From: mel at klab.caltech.edu (Bartlett Mel) Date: Thu, 17 Feb 94 18:00:32 PST Subject: NIPS*94 Call for Workshops Message-ID: <9402180200.AA20549@plato.klab.caltech.edu> CALL FOR PROPOSALS NIPS*94 Post-Conference Workshops December 2 and 3, 1994 Vail, Colorado Following the regular program of the Neural Information Processing Systems 1994 conference, workshops on current topics in neural information processing will be held on December 2 and 3, 1994, in Vail, Colorado. Proposals by qualified individuals interested in chairing one of these workshops are solicited. Past topics have included: active learning and control, architectural issues, attention, bayesian analysis, benchmarking neural network applications, computational complexity issues, computational neuroscience, fast training techniques, genetic algorithms, music, neural network dynamics, optimization, recurrent nets, rules and connectionist models, self- organization, sensory biophysics, speech, time series prediction, vision and audition, implementations, and grammars. The goal of the workshops is to provide an informal forum for researchers to discuss important issues of current interest. Sessions will meet in the morning and in the afternoon of both days, with free time in between for ongoing individual exchange or outdoor activities. Concrete open and/or controversial issues are encouraged and preferred as workshop topics. Representation of alternative viewpoints and panel-style discussions are particularly encouraged. Individuals proposing to chair a workshop will have responsibilities including: 1) arranging short informal presentations by experts working on the topic, 2) moderating or leading the discussion and reporting its high points, findings, and conclusions to the group during evening plenary sessions (the ``gong show''), and 3) writing a brief summary. Submission Procedure: Interested parties should submit a short proposal for a workshop of interest postmarked by May 21, 1994. (Express mail is not necessary. Submissions by electronic mail will also be accepted.) Proposals should include a title, a description of what the workshop is to address and accomplish, the proposed length of the workshop (one day or two days), and the planned format. It should motivate why the topic is of interest or controversial, why it should be discussed and what the targeted group of participants is. In addition, please send a brief resume of the prospective workshop chair, a list of publications and evidence of scholarship in the field of interest. Mail submissions to: Todd K. Leen, NIPS*94 Workshops Chair Department of Computer Science and Engineering Oregon Graduate Institute of Science and Technology P.O. Box 91000 Portland Oregon 97291-1000 USA (e-mail: tleen at cse.ogi.edu) Name, mailing address, phone number, fax number, and e-mail net address should be on all submissions. PROPOSALS MUST BE POSTMARKED BY MAY 21, 1994 Please Post From scheler at informatik.tu-muenchen.de Fri Feb 18 11:10:21 1994 From: scheler at informatik.tu-muenchen.de (Gabriele Scheler) Date: Fri, 18 Feb 1994 17:10:21 +0100 Subject: TR announcement: Adaptive Distance Measures Message-ID: <94Feb18.171027met.42273@papa.informatik.tu-muenchen.de> FTP-host: archive.cis.ohio-state.edu FTP-file: pub/neuroprose/scheler.adaptive.ps.Z The file scheler.adaptive.ps.Z is now available for copying from the Neuroprose repository: Pattern Classification with Adaptive Distance Measures Gabriele Scheler Technische Universit"at M"unchen (25 pages) also available as Report FKI-188-94 from Institut f"ur Informatik TU M"unchen D 80290 M"unchen ftp-host: flop.informatik.tu-muenchen.de ftp-file: pub/fki/fki-188-94.ps.gz ABSTRACT: In this paper, we want to explore the notion of learning the classification of patterns from examples by synthesizing distance functions. A working implementation of a distance classifier is presented. Its operation is illustrated with the problem of classification according to parity (highly non-linear) and a classification of feature vectors which involves dimension reduction (a linear problem). A solution to these problems is sought in two steps: (a) a parametrized distance function (called a `distance function scheme') is chosen, (b) setting parameters to values according to the classification of training patterns results in a specific distance function. This induces a classification on all remaining patterns. The general idea of this approach is to find restricted functional shapes in order to model certain cognitive functions of classification exactly, i.e. performing classifications that occur as well as excluding classifications that do not naturally occur and may even be experimentally proven to be excluded from learnability by a living organism. There are also certain technical advantages in using restricted function shapes and simple learning rules, such as reducing learning time, generating training sets and individual patterns to set certain parameters, determining the learnability of a specific problem with a given function scheme or providing additions to functions for individual exceptions, while retaining the general shape for generalization. From soller at asylum.cs.utah.edu Fri Feb 18 19:13:34 1994 From: soller at asylum.cs.utah.edu (Jerome Soller) Date: Fri, 18 Feb 94 17:13:34 -0700 Subject: 2nd An. Utah Workshop on the Applicat. of Intelligent and Adap. Systems Message-ID: <9402190013.AA09689@asylum.cs.utah.edu> ------------------------------------------------ 2nd Annual Utah Workshop on: "Applications of Intelligent and Adaptive Systems" Sponsored by: The University of Utah Cognitive Science Industrial Advisory Board and The Joint Services Software Technology Conference '94 -------------------------------------------------- Date: April 15, 1994 Time: 8:00 a.m.-2:30 p.m. Cost: contact Jerome Soller or Dale Sanders for the cost for non-conference attendees, free for conference attendees Location: Salt Lake City Marriott, Salon E, 75 South and West Temple -------------------------------------------------- Talk 1: "The Use of Genetic Algorithms and Neural Networks in the Automatic Interpretation of Medical Images", Dr. Charles Rosenberg Research Investigator, VA Geriatric, Research, Education, and Clinical Center and Adjunct Assistant Professor, Department of Psychology, University of Utah (crr at cogsci.psych.utah.edu) ((801) 582-1565, x-2458) -------------------------------------------------- Talk 2: "A Hybrid On-line Handwriting Recognition System" Dr. Nicholas S. Flann. Assistant Professor, Computer Science Department, Utah State University. (flann at nick.cs.usu.edu) ((801) 750-2451) -------------------------------------------------- Talk 3: "Prototyping Activities in Robotics, Control, and Manufacturing" Dr. Tarek M. Sobh Research Assistant Professor Computer Science Department University of Utah (sobh at wingate.cs.utah.edu) ((801) 585-5047) -------------------------------------------------- Talk 4: "Software Architecture and Unmanned Ground Vehicles" Dr. David Morgenthaler Program Manager Sarcos Research Corporation Salt Lake City, UT (David_Morgenthaler at ced.utah.edu) ((801) 581-0155) -------------------------------------------------- Lunch Break: 11:45 a.m.-12:45 p.m. -------------------------------------------------- Talk 5: "Use of Decision Support in a Hospital Information System" Dr. Allan Pryor Professor of Medical Informatics University of Utah and Assistant Vice President of Informatics Intermountain Health Care Salt Lake City UT (tapryor at cc.utah.edu) ((801) 321-2128) -------------------------------------------------- Talk 6: "Applications of Neural Networks in Critical Care Monitoring" Dr. Joe Orr Research Instructor Department of Anesthesiology University of Utah (jorr at soma.med.utah.edu) ((801) 581-6393) -------------------------------------------------- Pre-registration required; For registration, copies of the abstracts, or references for publications relating to these talks, please contact: Jerome Soller, Veterans Affairs Medical Center and University of Utah Computer Science (801) 582-1565, ext 2469; (801) 581-7977 soller at cs.utah.edu or Dale Sanders, TRW Inc., Ogden Engineering Services (801) 625-8343 dale_sanders at oz.bmd.trw.com -------------------------------------------------- We wish to thank the following for their support of this workshop: Applied Information and Management Systems, Inc.; Intermountain Health Care; The Joint Services Software Technology Conference; Salt Lake Veterans Affairs Geriatric Research, Education, and Clinical Center; Sarcos Corporation; 3M Health Information Systems; TRW Systems Integration Group; University of Utah Departments of Computer Science, Medical Informatics, and Physiology; Utah Information Technology Association From judd at scr.siemens.com Fri Feb 18 21:31:24 1994 From: judd at scr.siemens.com (Stephen Judd) Date: Fri, 18 Feb 1994 21:31:24 -0500 Subject: Optimal Stopping Time paper Message-ID: <199402190231.VAA27524@tern.siemens.com> ***Do not forward to other bboards*** FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/wang.optistop.ps.Z The file wang.optistop.ps.Z is now available for copying from the Neuroprose repository: Optimal Stopping and Effective Machine Complexity in Learning Changfeng Wang U.Penn Santosh S. Venkatesh U.Penn J. Stephen Judd Siemens Abstract: We study the problem of when to stop training a class of feedforward networks -- networks with fixed input weights, one hidden layer, and a linear output -- when they are trained with a gradient descent algorithm on a finite number of examples. Under general regularity conditions, it is shown analytically that there are, in general, three distinct phases in the generalization performance in the learning process. In particular, the network has better generalization performance when learning is stopped at a certain time before the global minimum of the empirical error is reached. A notion of "effective size" of a machine is defined and used to explain the trade-off between the complexity of the machine and the training error in the learning process. The study leads naturally to a network size selection criterion, which turns out to be a generalization of Akaike's Information Criterion for the learning process. It is shown that stopping learning before the global minimum of the empirical error has the effect of network size selection. (8 pages) To appear in NIPS-6- (1993) sj Stephen Judd Siemens Corporate Research, (609) 734-6573 755 College Rd. East, fax (609) 734-6565 Princeton, judd at learning.scr.siemens.com NJ usa 08540 From mjolsness-eric at CS.YALE.EDU Mon Feb 21 10:58:26 1994 From: mjolsness-eric at CS.YALE.EDU (Eric Mjolsness) Date: Mon, 21 Feb 94 10:58:26 EST Subject: clustering & matching papers Message-ID: <199402211558.AA05604@NEBULA.SYSTEMSZ.CS.YALE.EDU> ****** PLEASE DO NOT FORWARD TO OTHER MAILING LISTS OR BOARDS. ************** ****** PAPER AVAILABLE VIA NEUROPROSE *************************************** FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/gold.object-clustering.ps.Z FTP-filename: /pub/neuroprose/lu.object-matching.ps.Z The following two NIPS papers have been placed in the Neuroprose archive at Ohio State. The files are "gold.object-clustering.ps.Z" and "lu.object-matching.ps.Z". Each is 8 pages in length. The uncompressed postscript file for the second paper, "lu.object-matching.ps.Z", contains images and is 4.3 megabytes long. So you may need to use a symbolic link in printing it: "lpr -s" under SunOS. ----------------------------------------------------------------------------- Clustering with a Domain-Specific Distance Measure Stephen Gold, Eric Mjolsness and Anand Rangarajan Yale Computer Science Department With a point matching distance measure which is invariant under translation, rotation and permutation, we learn 2-D point-set objects, by clustering noisy point-set images. Unlike traditional clustering methods which use distance measures that operate on feature vectors - a representation common to most problem domains - this object-based clustering technique employs a distance measure specific to a type of object within a problem domain. Formulating the clustering problem as two nested objective functions, we derive optimization dynamics similar to the Expectation-Maximization algorithm used in mixture models. ----------------------------------------------------------------------------- Two-Dimensional Object Localization by Coarse-to-Fine Correlation Matching Chien-Ping Lu and Eric Mjolsness Yale Computer Science Department We present a Mean Field Theory method for locating two-dimensional objects that have undergone rigid transformations. The resulting algorithm is a coarse-to-fine correlation matching. We first consider problems of matching synthetic point data, and derive a point matching objective function. A tractable line segment matching objective function is derived by considering each line segment as a dense collection of points, and approximating it by a sum of Gaussians. The algorithm is tested on real images from which line segments are extracted and matched. ----------------------------------------------------------------------------- - Eric Mjolsness mjolsness at cs.yale.edu ------- From pkso at castle.ed.ac.uk Tue Feb 22 13:54:42 1994 From: pkso at castle.ed.ac.uk (P Sollich) Date: Tue, 22 Feb 94 18:54:42 GMT Subject: Preprint on query learning in Neuroprose archive Message-ID: <9402221854.aa28409@uk.ac.ed.castle> FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/sollich.queries.ps.Z The file sollich.queries.ps.Z (16 pages) is now available via anonymous ftp from the Neuroprose archive. Title and abstract are given below. We regret that hardcopies are not available. --------------------------------------------------------------------------- Query Construction, Entropy and Generalization in Neural Network Models Peter Sollich Department of Physics, University of Edinburgh, Kings Buildings, Mayfield Road, Edinburgh EH9 3JZ, U.K. (To appear in Physical Review E) Abstract We study query construction algorithms, which aim at improving the generalization ability of systems that learn from examples by choosing optimal, non-redundant training sets. We set up a general probabilistic framework for deriving such algorithms from the requirement of optimizing a suitable objective function; specifically, we consider the objective functions entropy (or information gain) and generalization error. For two learning scenarios, the high-low game and the linear perceptron, we evaluate the generalization performance obtained by applying the corresponding query construction algorithms and compare it to training on random examples. We find qualitative differences between the two scenarios due to the different structure of the underlying rules (nonlinear and `non-invertible' vs.linear); in particular, for the linear perceptron, random examples lead to the same generalization ability as a sequence of queries in the limit of an infinite number of examples. We also investigate learning algorithms which are ill-matched to the learning environment and find that in this case, minimum entropy queries can in fact yield a lower generalization ability than random examples. Finally, we study the efficiency of single queries and its dependence on the learning history, i.e. on whether the previous training examples were generated randomly or by querying, and the difference between globally and locally optimal query construction. --------------------------------------------------------------------------- Peter Sollich Dept. of Physics University of Edinburgh e-mail: P.Sollich at ed.ac.uk Kings Buildings Tel. +44-31-650 5236 Mayfield Road Edinburgh EH9 3JZ, U.K. --------------------------------------------------------------------------- From B344DSL at UTARLG.UTA.EDU Tue Feb 22 22:18:10 1994 From: B344DSL at UTARLG.UTA.EDU (B344DSL@UTARLG.UTA.EDU) Date: Tue, 22 Feb 1994 21:18:10 -0600 (CST) Subject: Conference announcement Message-ID: <01H9786W7CBM0004O8@UTARLG.UTA.EDU> ANNOUNCEMENT AND CALL FOR ABSTRACTS Conference on Oscillations in Neural Systems, Sponsored by the Metroplex Institute for Neural Dynamics (MIND) and the University of Texas at Arlington. To be held Thursday through Saturday, MAY 5-7, 1994 Location: UNIVERSITY OF TEXAS AT ARLINGTON MAIN LIBRARY, 6TH FLOOR PARLOR Official Conference Motel: Park Inn 703 Benge Drive Arlington, TX 76013 1-800-777-0100 or 817-860-2323 A block of rooms has been reserved at the Park Inn for $35 a night (single or double). Room sharing arrangements are possible. Reservations should be made directly through the motel. Official Conference Travel Agent: Airline reservations to Dallas-Fort Worth airport should be made through Dan Dipert travel in Arlington, 1-800-443-5335. For those who wish to fly on American Airlines, a Star File account has been set up for a 5% discount off lowest available fares (two week advance, staying over Saturday night) or 10% off regular coach fare; arrangements for Star File reservations should be made through Dan Dipert. Please let the conference organizers know (by e-mail or telephone) when you plan to arrive: some people can be met at the airport (about 30 minutes from Arlington), others can call Super Shuttle at 817-329-2000 upon arrival for transportation to the Park Inn (about $14-$16 per person). Registration for the conference is $25 for students, $65 for non- student oral or poster presenters, $85 for others. MIND members will have $20 (or $10 for students) deducted from the registration. A registration form is attached to this announcement. Registrants will receive the MIND monthly newsletter (on e-mail when possible) for the remainder of 1994. Invited speakers: Bill Baird (University of California, Berkeley) Adi Bulsara (Naval Research Laboratories, San Diego) Alianna Maren (Accurate Automation Corporation) George Mpitsos (Oregon State University) Martin Stemmler (California Institute of Technology) Roger Traub (IBM, Tarrytown, New York) Robert Wong (Downstate Medical Center, Brooklyn) Geoffrey Yuen (Northwestern University) Those interested in presenting are invited to submit abstracts (1-2 paragraphs) any time between now and March 15, 1994, of any work related to the theme of the conference. The topic of neural oscillation is currently of great interest to psychologists and neuroscientists alike. Recently it has been observed that neurons in separate areas of the brain will oscillate in synchrony in response to certain stimuli. One hypothesized function for such synchronized oscillations is to solve the "binding problem," that is, how is it that disparate features of objects (e.g., a person's face and their voice) are tied together into a single unitary whole. Some bold speculators (such as Francis Crick in his recent book, The Astonishing Hypothesis) even argue that synchronized neural oscillations form the basis for consciousness. Talks will be 1 hour for invited speakers and 45 minutes for contributed speakers including questions. There will be no parallel sessors. Contributors whose work is considered worthy of presentation but who cannot be fit into the schedule will be invited to present posters. Presenters will not be required to write complete papers. After the conference is over, we will attempt to obtain a contract with a publisher for a book based on the conference. Oral and poster presenters will be invited to submit chapters to this book, although it is not a precondition for being a speaker. Two books based on previous MIND conferences (Motivation, Emotion, and Goal Direction in Neural Networks and Neural Networks for Knowledge Representation and Inference) have been published by Lawrence Erlbaum Associates, and a book based on our last conference (Optimality in Biological and Artificial Networks?) is now in progress, under contract with Erlbaum as part of their joint series with INNS. Abstracts should submitted, by e-mail, snail mail, or fax, to: Professor Daniel S. Levine Department of Mathematics, University of Texas at Arlington 411 S. Nedderman Drive Arlington, TX 76019-0408 Office telephone: 817-273-3598, fax: 817-794-5802 e-mail: b344dsl at utarlg.uta.edu Further inquiries about the conference can be addressed to Professor Levine or to the other two conference organizers: Professor Vincent Brown Mr. Timothy Shirey 817-273-3247 214-495-3500 or 214-422-4570 b096vrb at utarlg.uta.edu 73353.3524 at compuserve.com Please distribute this announcement to anyone you think may be interested in the conference. REGISTRATION FOR MIND/INNS CONFERENCE ON OSCILLATIONS IN NEURAL SYSTEMS, UNIVERSITY OF TEXAS AT ARLINGTON, MAY 5-7, 1994 Name ______________________________________________________________ Address ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ____________________________________________________________ E-Mail __________________________________________________________ Telephone _________________________________________________________ Registration fee enclosed: _____ $15 Student, member of MIND _____ $25 Student _____ $65 Non-student oral or poster presenter _____ $65 Non-student member of MIND _____ $85 All others Will you be staying at the Park Inn? ____ Yes ____ No Are you planning to share a room with someone you know? ____ Yes ____ No If so, please list that person's name __________________________ If not, would be you be interested in sharing a room with another conference attendee to be assigned? ____ Yes ____ No PLEASE REMEMBER TO CALL THE PARK INN DIRECTLY FOR YOUR RESERVATION (WHETHER SINGLE OR DOUBLE) AT 1-800-777-0100 OR 817-860-2323. From fellous at selforg.usc.edu Tue Feb 22 23:31:06 1994 From: fellous at selforg.usc.edu (Jean-Marc Fellous) Date: Tue, 22 Feb 94 20:31:06 PST Subject: Research Associate Message-ID: <9402230431.AA00747@selforg.usc.edu> Could you please post this announcement ? Thanks, Jean-Marc >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< TENNESSEE STATE UNIVERSITY CENTER FOR NEURAL ENGINEERING RESEARCH ASSOCIATE Applications are invited for a research associate position, for a unique consortium involving a medical school, and an engineering college, Oak Ridge National Laboratory and a private high-tech industry. Ph.D in Biomedical/Electrical Engineering (or related fields) with strong intrest in artificial and biological neural networks is required, inthe areas of auditory system modeling and sensory motor control. This position will be supported for at least two years and possibly longer. Teaching of a graduate or an undergraduate course is optional. Send resume to : Dr. Mohan J. Malkani Director, Center for Neural Engineering Tennessee State University 3500 John Merritt Blvd. Nashville, TN 37209-1561 (615)320-3550 Fax: (615)320-3554 e-mail: malkani at harpo.tnstate.edu From sbh at eng.cam.ac.uk Tue Feb 22 12:00:33 1994 From: sbh at eng.cam.ac.uk (S.B. Holden) Date: Tue, 22 Feb 94 17:00:33 GMT Subject: PhD dissertation available by anonymous ftp Message-ID: <5730.199402221700@tw700.eng.cam.ac.uk> The following PhD dissertation is available by anonymous ftp from the archive of the Speech, Vision and Robotics Group at the Cambridge University Engineering Department. On the Theory of Generalization and Self-Structuring in Linearly Weighted Connectionist Networks Sean B. Holden Technical Report CUED/F-INFENG/TR161 Cambridge University Engineering Department Trumpington Street Cambridge CB2 1PZ England Abstract The study of connectionist networks has often been criticized for an overall lack of rigour, and for being based on excessively ad hoc techniques. Even though connectionist networks have now been the subject of several decades of study, the available body of research is characterized by the existence of a significant body of experimental results, and a large number of different techniques, with relatively little supporting, explanatory theory. This dissertation addresses the theory of {\em generalization performance\/} and {\em architecture selection\/} for a specific class of connectionist networks; a subsidiary aim is to compare these networks with the well-known class of multilayer perceptrons. After discussing in general terms the motivation for our study, we introduce and review the class of networks of interest, which we call {\em $\Phi$-networks\/}, along with the relevant supervised training algorithms. In particular, we argue that $\Phi$-networks can in general be trained significantly faster than multilayer perceptrons, and we demonstrate that many standard networks are specific examples of $\Phi$-networks. Chapters 3, 4 and 5 consider generalization performance by presenting an analysis based on tools from computational learning theory. In chapter 3 we introduce and review the theoretical apparatus required, which is drawn from {\em Probably Approximately Correct (PAC) learning theory\/}. In chapter 4 we investigate the {\em growth function\/} and {\em VC dimension\/} for general and specific $\Phi$-networks, obtaining several new results. We also introduce a technique which allows us to use the relevant PAC learning formalism to gain some insight into the effect of training algorithms which adapt architecture as well as weights (we call these {\em self-structuring training algorithms\/}). We then use our results to provide a theoretical explanation for the observation that $\Phi$-networks can in practice require a relatively large number of weights when compared with multilayer perceptrons. In chapter 5 we derive new necessary and sufficient conditions on the number of training examples required when training a $\Phi$-network such that we can expect a particular generalization performance. We compare our results with those derived elsewhere for feedforward networks of Linear Threshold Elements, and we extend one of our results to take into account the effect of using a self-structuring training algorithm. In chapter 6 we consider in detail the problem of designing a good self-structuring training algorithm for $\Phi$-networks. We discuss the best way in which to define an optimum architecture, and we then use various ideas from linear algebra to derive an algorithm, which we test experimentally. Our initial analysis allows us to show that the well-known {\em weight decay\/} approach to self-structuring is not guaranteed to provide a network which has an architecture close to the optimum one. We also extend our theoretical work in order to provide a basis for the derivation of an improved version of our algorithm. Finally, chapter 7 provides conclusions and suggestions for future research. ************************ How to obtain a copy ************************ a) Via FTP: unix> ftp svr-ftp.eng.cam.ac.uk Name: anonymous Password: (type your email address) ftp> cd reports ftp> binary ftp> get holden_tr161.ps.Z ftp> quit unix> uncompress holden_tr161.ps.Z unix> lpr holden_tr161.ps (or however you print PostScript) b) Via postal mail: Request a hardcopy from Dr. Sean B. Holden, Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 1PZ, England. or email me: sbh at eng.cam.ac.uk From viola at salk.edu Wed Feb 23 14:17:52 1994 From: viola at salk.edu (Paul Viola) Date: Wed, 23 Feb 94 11:17:52 PST Subject: Heinous Patent Message-ID: <9402231917.AA24448@salk.edu> From: Vision-List moderator Phil Kahn VISION-LIST Digest Tue Feb 22 11:26:42 PDT 94 Volume 13 : Issue 8 Date: Thu, 17 Feb 1994 22:23:00 GMT From: eledavis at ubvms.cc.buffalo.edu (Elliot Davis) Organization: University at Buffalo Subject: Error Reduction I would greatly appreciate your thoughts on the: ERROR TEMPLATE TECHNIQUE The "Error Template" technique (patent 4,802,231) provides an alternative method for reducing false alarms in pattern recognition systems. In this approach, a pattern representing a mismatched pattern is stored in the reference lexicon. It is a reference pattern to an error rather then to what is desired. THIS IS DONE WITH THE EXPECTATION THAT IF THE ERROR PATTERN OR A VARIATION OF IT IS REPEATED IT WILL TEND TO BE CLOSER TO ITSELF THEN TO THE PATTERN THAT IT FALSED OUT TO. ... Unless this patent is very old, I find it terrifying. It is a concept that is clearly part of the pattern recognition literature of the 70's. Essentially pattern classification works by finding clusters that represent classes. These clusters along with a measurement model define a probability density over the pattern space. All this technique is doing is adding an additional cluster which represents a particular type of measurement error sensing a class. Pattern classification theory tells us that this should be done whenever there is a particular measurement error that is not modeled well by our measurement model. You add a cluster when the distribution of data is different from the probability density predicted by the model -- i.e. a particular measurement error is more common than your model predicts. You can add these clusters by hand, as the patent suggests, or you can let a density estimation scheme discover them for you (a mixture of gaussians model trained with EM works nicely). End of story. So remember, anytime someone adds another cluster to a pattern classification model, they owe the owner of this patent money. I wonder what the date of this fine patent is?? Paul Viola From cohn at psyche.mit.edu Wed Feb 23 18:15:17 1994 From: cohn at psyche.mit.edu (David Cohn) Date: Wed, 23 Feb 94 18:15:17 EST Subject: Paper available: Exploration using optimal experiment design Message-ID: <9402232315.AA21110@psyche.mit.edu> Those who find Peter Sollich's paper on query construction of interest may also wish to look at the following paper, now available by anonymous ftp. This is a slightly revised version of the paper that is to appear in Advances in Neural Information Processing Systems 6, but includes a correction to Equation 2 that was made too late to be included in the NIPS volume. ##################################################################### Neural Network Exploration Using Optimal Experiment Design David A. Cohn Dept. of Brain and Cognitive Sciences Massachusetts Inst.\ of Technology Cambridge, MA 02139 Consider the problem of learning input/output mappings through exploration, e.g. learning the kinematics or dynamics of a robotic manipulator. If actions are expensive and computation is cheap, then we should explore by selecting a trajectory through the input space which gives us the most amount of information in the fewest number of steps. I discuss how results from the field of optimal experiment design may be used to guide such exploration, and demonstrate its use on a simple kinematics problem. ##################################################################### The paper may be retrieved by anonymous ftp to "psyche.mit.edu" using the following protocol: unix> ftp psyche.mit.edu Name (psyche.mit.edu:joebob): anonymous <- use "anonymous" here 331 Guest login ok, send ident as password. Password: joebob at machine.univ.edu <- use your email address here 230 Guest login ok, access restrictions apply. ftp> cd pub/cohn <- go to the directory 250 CWD command successful. ftp> binary <- change to binary transfer 200 Type set to I. ftp> get cohn.explore.ps.Z <- get the file 200 PORT command successful. 150 Binary data connection for cohn.explore.ps.Z ... 226 Binary Transfer complete. local: cohn.explore.ps.Z remote: cohn.explore.ps.Z 301099 bytes received in 2.8 seconds (1e+02 Kbytes/s) ftp> quit <- all done 221 Goodbye. From terry at salk.edu Thu Feb 24 05:49:35 1994 From: terry at salk.edu (Terry Sejnowski) Date: Thu, 24 Feb 94 02:49:35 PST Subject: Shakespeare and Neural Nets Message-ID: <9402241049.AA02725@salk.edu> from New Scientist 22 january 1994 p. 23 In an interesting article on the use of statistical measures to assess the attribution of texts to authors, Robert Matthews and Tom Merrriam report that: "Applying our neural network to disputed works such as 'The Two Noble Kinsman' has produced some interesting results and helped to settle some bitter arguments over authorship of controversial texts. ... "The first task was to train the network. This we did by exposing it to data extracted from a large number of samples of Shakespeare's undisputed work, together with that of his successor with The King's Men [a theater], John Fletcher. ... We then set the network loose on 'The Two Noble Kinsman'. Drawing on a wide variety of essentially subjective evidence, scholars have claimed that Shakespeare's hand dominates Acts I and V, with much of the rest appearing to be by Fletcher. In March last year, our neural network agreed with these attributions -- and proferred the extra opinion that Fletcher may have received considerable help from Shakespeare in Act IV. In short, our neural network quantitatively supports the subjective view of its much more sophisticated human counterparts that 'The Two Noble Kinsman' is a genuine collaboration between Shakespeare and one of his contemporaries." These results will appear in the journal 'Literary and Linguistic Computing'. A similar approach might be used to determine the contributions of coauthors to scientific papers. Terry ----- From efiesler at maya.idiap.ch Fri Feb 25 09:16:09 1994 From: efiesler at maya.idiap.ch (E. Fiesler) Date: Fri, 25 Feb 94 15:16:09 +0100 Subject: NN Formalization paper available by ftp. Message-ID: <9402251416.AA04305@maya.idiap.ch> PLEASE POST ----------- The following paper is available via anonymous ftp from the neuroprose archive. It counts 13 A4-size PostScript pages, and replaces a shorter preliminary ver- sion. Instructions for retrieval follow the abstract. NEURAL NETWORK CLASSIFICATION AND FORMALIZATION E. Fiesler IDIAP c.p. 609 CH-1920 Martigny Switzerland This paper has been accepted for publication in the special issue on Neural Network Standards of "Computer Standards & Interfaces", volume 16, edited by J. Fulcher. Elsevier Science Publishers, Amsterdam, 1994. ABSTRACT In order to assist the field of neural networks in maturing, a formalization and a solid foundation are essential. Additionally, to permit the introduction of formal proofs, it is essential to have an all-encompassing formal mathemat- ical definition of a neural network. This publication offers a neural network formalization consisting of a topological taxonomy, a uniform nomenclature, and an accompanying consistent mnemonic notation. Supported by this formalization, both a flexible mathemat- ical definition are presented. ------------------------------ To obtain a copy of this paper, please follow the following FTP instructions: unix> ftp archive.cis.ohio-state.edu (or: ftp 128.146.8.52) login: anonymous password: ftp> cd pub/neuroprose ftp> binary ftp> get fiesler.formalization.ps.Z ftp> bye unix> zcat fiesler.formalization.ps.Z | lpr (or however you uncompress and print postscript) For convenience of those outside the US, the paper has also been placed on the IDIAP ftp site: unix> ftp Maya.IDIAP.CH (or: ftp 192.33.221.1) login: anonymous password: ftp> cd pub/papers/neural ftp> binary ftp> get fiesler.formalization.ps.Z (OR get fiesler.formalization.ps) ftp> bye unix> zcat fiesler.formalization.ps.Z | lpr OR unix> lpr fiesler.formalization.ps (Hard copies of the paper are unfortunately not available.) P.S. Thanks for the update, Jordan ! From giles at research.nj.nec.com Fri Feb 25 18:28:59 1994 From: giles at research.nj.nec.com (Lee Giles) Date: Fri, 25 Feb 94 18:28:59 EST Subject: Available Message-ID: <9402252328.AA28936@fuzzy> ******************************************************************************** Reprint:USING RECURRENT NEURAL NETWORKS TO LEARN THE STRUCTURE OF INTERCONNECTION NETWORKS The following reprint is available via the University of Maryland Department of Computer Science Technical Report archive: ________________________________________________________________________________ "Using Recurrent Neural Networks to Learn the Structure of Interconnection Networks" UNIVERSITY OF MARYLAND TECHNICAL REPORT UMIACS-TR-94-20 AND CS-TR-3226 G.W. Goudreau(a) and C.L. Giles(b,c) goudreau at cs.ucf.edu, giles at research.nj.nec.com (a) Department of Computer Science, U. of Central Florida, Orlando, FL 32816 (b) NEC Research Inst.,4 Independence Way, Princeton, NJ 08540 (c) Inst. for Advanced Computer Studies, U. of Maryland, College Park, MD 20742 A modified Recurrent Neural Network (RNN) is used to learn a Self-Routing Interconnection Network (SRIN) from a set of routing examples. The RNN is modified so that it has several distinct initial states. This is equivalent to a single RNN learning multiple different synchronous sequential machines. We define such a sequential machine structure as "augmented" and show that a SRIN is essentially an Augmented Synchronous Sequential Machine (ASSM). As an example, we learn a small six-switch SRIN. After training we extract the network's internal representation of the ASSM and corresponding SRIN. -------------------------------------------------------------------------------- FTP INSTRUCTIONS unix> ftp cs.umd.edu (128.8.128.8) Name: anonymous Password: (your_userid at your_site) ftp> cd pub/pub/papers/TRs ftp> binary ftp> get 3226.ps.Z ftp> quit unix> uncompress 3226.ps.Z --------------------------------------------------------------------------------- -- C. Lee Giles / NEC Research Institute / 4 Independence Way Princeton, NJ 08540 / 609-951-2642 / Fax 2482 == From terry at salk.edu Fri Feb 25 12:59:53 1994 From: terry at salk.edu (Terry Sejnowski) Date: Fri, 25 Feb 94 09:59:53 PST Subject: NEURAL COMPUTATION 6:2 Message-ID: <9402251759.AA18225@salk.edu> Neural Computation March 1994 Volume 6 Issue 2 Article: Hierarchical Mixtures of Experts and the EM Algorithm Michael I. Jordan and Robert A. Jacobs Notes: TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play Gerald Tesauro Correlated Attractors from Uncorrelated Stimuli L.F. Cugliandolo Letters: Learning of Phase-lags in Coupled Neural Oscillators Bard Ermentrout and Nancy Kopell A Mechanism for Neuronal Gain Control by Descending Pathways Mark E. Nelson The Role of Weight Normalization in Competitive Learning Geoffrey J. Goodhill and Harry G. Barrow A Probabilistic Resource Allocating Network for Novelty Detection Stephen Roberts and Lionel Tarassenko Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm and Resistance to Local Minima William Finnoff Relating Real-time Backpropagation and Back-propagation Through Time: An Application of Flow Graph Interreciprocity Francoise Beaufays and Eric A. Wan Smooth On-line Learning Algorithms for Hidden Markov Models Pierre Baldi and Yves Chauvin On Functional Approximation with Normalized Gaussian Units Michel Benaim Statistical Physics, Mixtures of Distributions and the EM Algorithm Yuille, A.L., Stolorz, P., and Utans, J. ----- SUBSCRIPTIONS - 1994 - VOLUME 6 - BIMONTHLY (6 issues) ______ $40 Student and Retired ______ $65 Individual ______ $166 Institution Add $22 for postage and handling outside USA (+7% GST for Canada). (Back issues from Volumes 1-5 are regularly available for $28 each to institutions and $14 each for individuals Add $5 for postage per issue outside USA (+7% GST for Canada) MIT Press Journals, 55 Hayward Street, Cambridge, MA 02142. Tel: (617) 253-2889 FAX: (617) 258-6779 e-mail: hiscox at mitvma.mit.edu ----- From heger at Informatik.Uni-Bremen.DE Mon Feb 28 07:27:12 1994 From: heger at Informatik.Uni-Bremen.DE (Matthias Heger) Date: Mon, 28 Feb 94 13:27:12 +0100 Subject: paper available Message-ID: <9402281227.AA06748@Informatik.Uni-Bremen.DE> FTP-host: ftp.gmd.de FTP-filename: /Learning/rl/papers/heger.consider-risk.ps.Z The file heger.consider-risk.ps.Z is now available for copying from the RL papers repository: *************************************************** * Consideration of Risk in Reinforcement Learning * *************************************************** (Revised submission to the 11th International Conference on Machine Learning (ML94), 15 pages) Abstract -------- Most Reinforcement Learning (RL) work supposes policies for sequential decision tasks to be optimal that minimize the expected total discounted cost (e.g. Q-Learning [Wat 89], AHC [Bar Sut And 83]). On the other hand, it is well known that it is not always reliable and can be treacherous to use the expected value as a decision criterion [Tha 87]. A lot of alter- native decision criteria have been suggested in decision theory to get a more sophisticated consideration of risk but most RL researchers have not concerned themselves with this subject until now. The purpose of this paper is to draw the reader's attention to the problems of the expected value criterion in Markov Decision Processes and to give Dynamic Pro- gramming algorithms for an alternative criterion, namely the Minimax cri- terion. A counterpart to Watkins' Q-Learning related to the Minimax cri- terion is presented. The new algorithm, called Q^-Learning (Q-hat-Learning), finds policies that minimize the >>worst-case<< total discounted costs. Most mathematical details aren't presented here but can be found in [Heg 94]. ---------------------------------------------------------------------------- Here is an example of retrieving and printing the file: -> ftp ftp.gmd.de Connected to gmdzi.gmd.de. 220 gmdzi FTP server (Version 5.72 Fri Nov 20 20:35:05 MET 1992) ready. Name (ftp.gmd.de:heger): anonymous 331 Guest login ok, send your email-address as password. Password: 230-This is an experimental FTP Server. See /README for details. This site is in Germany, Europe. Please restrict downloads to our non-working hours (i.e outside of 08:00-18:00 MET, Mo-Fr) *** Local time is 12:25:22 MET 230 Guest login ok, access restrictions apply. ftp> cd Learning/rl/papers 250 CWD command successful. ftp> binary 200 Type set to I. ftp> get heger.consider-risk.ps.Z 200 PORT command successful. 150 Opening BINARY mode data connection for heger.consider-risk.ps.Z (100477 bytes). 226 Transfer complete. local: heger.consider-risk.ps.Z remote: heger.consider-risk.ps.Z 100477 bytes received in 3.2e+02 seconds (0.3 Kbytes/s) ftp> quit 221 Goodbye. -> uncompress heger.consider-risk.ps.Z -> lpr heger.consider-risk.ps ------------------------------------------------------------------------------- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + Matthias Heger + + Zentrum fuer Kognitionswissenschaften, Universitaet Bremen, + + Postfach 330 440 + + D-28334 Bremen, Germany + + + + email: heger at informatik.uni-bremen.de + + Tel.: +49 (0) 421 218 4659 + + Fax: +49 (0) 421 218 3054 + +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From gerda at ai.univie.ac.at Mon Feb 28 10:42:04 1994 From: gerda at ai.univie.ac.at (Gerda Helscher) Date: Mon, 28 Feb 1994 16:42:04 +0100 Subject: EMCSR'94 Message-ID: <199402281542.AA23377@anif.ai.univie.ac.at> After the general info which appeared in this mailing list recently about the T W E L F T H E U R O P E A N M E E T I N G O N C Y B E R N E T I C S A N D S Y S T E M S R E S E A R C H ( E M C S R ' 9 4 ) here is the detailed programme of Neural Network-related events: Plenary Lecture by S t e p h e n G r o s s b e r g : "Neural Networks for Learning, Recognition and Prediction" Wednesday, April 6, 9:00 a.m., University of Vienna, Main Building, Room 47 Symposium A r t i f i c i a l N e u r a l N e t w o r k s a n d A d a p t i v e S y s t e m s Chairpersons: S.Grossberg, USA, and G.Dorffner, Austria Tuesday, April 5, and Wednesday, April 6, Univ. of Vienna, Main Building, Room 47 Tuesday, April 5: 14.00-14.30: Synchronization in a Large Neural Network of Phase Oscillators with the Central Element Y.Kazanovich, Russian Academy of Sciences, Moscow, Russia 14.30-15.00: Synchronization in a Neural Network Model with Time Delayed Coupling T.B.Luzyanina, Russian Academy of Sciences, Moscow, Russia 15.00-15.30: Reinforcement Learning in a Network Model of the Basal Ganglia R.M.Borisyuk, J.R.Wickens, R.Koetter, University of Otago, New Zealand Wednesday, April 6: 11.00-11.30: Adaptive High Performance Classifier Based on Random Threshold Neurons E.M.Kussul, T.N.Baidyk, V.V.Lukovich, D.A.Rachkovskij, Ukrainian Academy of Science, Kiev, Ukraine 11.30-12.00: Dynamics of Ordering for One-dimensional Topological Mappings R.Folk, A.Kartashov, University of Linz, Austria 12.00-12.30: Informational Properties of Willshaw-like Neural Networks Capable of Autoassociative Learning A.Kartashov, R.Folk, A.Goltsev, A.Frolov, University of Linz, Austria 12.30-13.00: Relaxing the Hyperplane Assumption in the Analysis and Modification of Back-propagation Neural Networks L.Y.Pratt, A.N.Christensen, Colorado School of Mines, Golden, CO, USA 14.00-14.30: Improving Discriminability Based Transfer by Modifying the IM Metric to Use Sigmoidal Activations L.Y.Pratt, V.I.Gough, Colorado School of Mines, Golden, CO, USA 14.30-15.00: Order-theoretic View of Families of Neural Network Architectures M.Holena, University of Paderborn, Germany 15.00-15.30: A New Class of Neural Networks: Recognition Invariant to Arbitrary Transformation Groups A.Kartashov, K.Erman, University of Linz, Austria 16.00-16.30: Neural Assembly Architecture for Texture Recognition A.Goltsev, A.Kartashov, R.Folk, University of Linz, Austria 16.30-17.00: A Neural System for Character Recognition on Isovalue Maps E.P.L.Passos, L.E.S.Varella, M.A.Santos, R.L.de Araujo, Engineering Military Institute, Rio de Janeiro, Brazil 17.00-17.30: Neurocomputing Model Inference for Nonlinear Signal Processing Z.Zografski, T.Durrani, University of Strathclyde, Glasgow, United Kingdom 17.30-18.00: Learning from Examples and VLSI Implementation of Neural Networks V.Beiu, J.A.Peperstraete, J.Vandewalle, R.Lauwereins, Catholic University of Leuven, Heverlee, Belgium For more information please contact: sec at ai.univie.ac.at From ZECCHINA at to.infn.it Mon Feb 28 13:22:01 1994 From: ZECCHINA at to.infn.it (Riccardo Zecchina - tel.11-5647358, fax. 11-5647399) Date: Mon, 28 Feb 1994 19:22:01 +0100 (WET) Subject: role of response functions in ANN's. Message-ID: <940228192201.20800db9@to.infn.it> FTP-host: archive.cis.ohio-state.edu FTP-file: pub/neuroprose/zecchina.response.ps.Z The file zecchina.response.ps.Z is available for copying from the Neuroprose repository: "Response Functions Improving Performance in Analog Attractor Neural Networks" N .Brunel, R. Zecchina (13 pages, to appear in Phys. Rev. E Rapid Comm.) ABSTRACT: In the context of attractor neural networks, we study how the equilibrium analog neural activities, reached by the network dynamics during memory retrieval, may improve storage performance by reducing the interferences between the recalled pattern and the other stored ones. We determine a simple dynamics that stabilizes network states which are highly correlated with the retrieved pattern, for a number of stored memories that does not exceed $\alpha_{\star} N$, where $\alpha_{\star}\in[0,0.41]$ depends on the global activity level in the network and $N$ is the number of neurons.  From andre at physics.uottawa.ca Mon Feb 28 12:13:53 1994 From: andre at physics.uottawa.ca (Andre Longtin) Date: Mon, 28 Feb 94 12:13:53 EST Subject: Hebb Symposium Message-ID: <9402281713.AA23088@miro.physics.uottawa.ca.physics.uottawa.ca> ******* Preliminary Announcement ******* THE FIELDS INSTITUTE FOR RESEARCH IN MATHEMATICAL SCIENCES HEBB SYMPOSIUM ON NEURONS AND BIOLOGICAL DYNAMICS Sunday, May 15 to Friday May 20, 1994 Koffler Pharmaceutical Center University of Toronto D.O. Hebb's classic, "The Organization of Behavior" published in 1949, sketched out how behavior might emerge from the properties of nerve cells and assemblies of nerve cells. This book was a landmark achievement in neurophysiological psychology. The modifiable synapse, discussed at length by Hebb and now known as the "Hebb synapse", was a lasting contribution. Hebb was from Nova Scotia and spent most of his professional life at McGill in the Psychology Department. We are having this symposium in his honor. Topics will range from cellular level to systems level, with an eye towards interesting dynamics and connections between dynamics and functions. We will bring together physiological and mathematical researchers with some didactic and research talks oriented towards graduate students and postdoctoral fellows. SCIENTIFIC PROGRAM: Lectures will be presented by Nancy Kopell (Boston University) and David Mumford (Harvard) in the Institute's Distinguished Lecture Series. Invited talks by Larry Abbott (Brandeis), *Moshe Abeles (Hebrew U., Jerusalem), Harold Atwood (U. Toronto), David Brillinger (Berkeley), Jos Eggermont (U. Calgary), Bard Ermentrout (U. Pittsburg), Leon Glass (McGill), Ilona Kovacs (Rutgers), Gilles Laurent (Caltech), Andre Longtin (U. Ottawa), Leonard Maler (U. Ottawa), Karl Pribram (Radford U.), Paul Rapp (Med. Coll. Penn.), John Rinzel (NIH), Mike Shadlin (Stanford), Matt Wilson (Tucson), Martin Wojtowicz (U. Toronto), Steve Zucker (McGill). Invited Attendees: Jose Segundo (UCLA), Alessandro Villa (Lausanne) The meeting will emphasize poster sessions as well as discussion groups where participants can give short oral presentations of their work. (*=tentative) TOPICS Larry Abbott: Population vectors and Hebbian learning Moshe Abeles: Information processing of synchronized activity Harold Atwood: Synaptic transmission and plasticity David Brillinger: Statistical analysis of neurophysiological data Jos Eggermont: Spatial and temporal interactions in auditory cortex Bard Ermentrout: Patterns in visual cortex Leon Glass: Nonlinear dynamics of neural networks Ilona Kovacs: Visual psychophysics/perceptual organization Gilles Laurent: Oscillations in olfaction Andre Longtin: Stochastic nonlinear dynamics of sensory transduction Leonard Maler: Bursting and recurrent feedback in electroreception Karl Pribram: Behavioral neurodynamics Paul Rapp: Dynamical characterization of neurological data John Rinzel: Thalamic rhythmogenesis in sleep and epilepsy Mike Shadlin: Analysis of visual motion Matt Wilson: Behaviorally induced changes in hippocampal connectivity Martin Wojtowicz: Membranes, channels and synapses Steve Zucker: Neural networks and visual computations IMPORTANT DATES: Monday April 11: Last date to return questionnaire Friday April 22: Cut-off for registrations and Deadline for hotel/residence booking Sunday May 15: Arrival and registration (9 am - 12 noon) Sunday May 15 to Friday May 20 Scientific program (ending Friday noon) INFORMATION ON SCIENTIFIC PROGRAM: David Brillinger (brill at stat.berkeley.edu) Andre Longtin (andre at physics.uottawa.ca) REGISTRATION AND ORGANIZATIONAL INFORMATION: To receive registration information, please fill out the questionnaire below and return it to: Sheri Albers The Fields Institute 185 Columbia St. W. Waterloo, Ontario, Canada N2L 5Z5 Phone: (519) 725-0096 Fax: (519) 725-0704 e-mail: hebb at fields.uwaterloo.ca ------------------------------------------------------------- ******* Questionnaire ******* TO BE COMPLETED BY ANYONE WISHING TO ATTEND THE HEBB SYMPOSIUM ON NEURONS AND BIOLOGICAL DYNAMICS Name: Institution: Department: Address: Phone: Fax: E-mail: I plan to attend: Yes ( ) No ( ) Maybe ( ) I plan to participate in the discussion groups: Yes ( ) No ( ) Maybe ( ) I plan to present a poster: Yes ( ) No ( ) Maybe ( ) Topic or tentative title: Arrival and departure dates (if other than May 14-20): FAX TO: (519)725-0704 or e-mail: hebb at fields.uwaterloo.ca  From ling at csd.uwo.ca Tue Feb 1 03:37:10 1994 From: ling at csd.uwo.ca (Charles X. Ling) Date: Tue, 1 Feb 94 03:37:10 EST Subject: some questions on training neural nets... Message-ID: <9402010837.AA01695@godel.csd.uwo.ca> Hi neural net experts, I am using backprop (and variations of it) quite often although I have not followed neural net (NN) research as well as I wanted. Some rather basic issues in training NN still puzzle me a lot, and I hope to get advice and help from the experts in the area. Sorry for being ignorant. Say we are learning a function F (such as a Boolean function of n vars). The training set (TR) and testing set (TS) are drawn randomly according to the same probability distribution, with no noise added in. 1. Is it true that, since there is no noise, the smaller the training error on TR, the better it would predict in general on TS? That is, stopping training earlier is not needed (so cross-validation is not needed). 2. Is it true that, to get reliable prediction (good or bad), we should always choose net architecture with a minimum number of hidden units (or weights via weight decaying)? Will cross-validation help if we have too much freedom in the net (could results on the validation set be coincident)? 3. If, for some reason, cross-validation is needed, and TR is split to TR1 (for training) and TR2 (for validation), what would be the proper ways to do cross-validation? Training on TR1 uses only partial information in TR, but training TR1 to find right parameters and then training on TR1+TR2 may require parameters different from the estimation of training TR1. 4. In case the net has too much freedom (even different random seeds produce very different predictive accuracies), how can we effectively reduce the variations? Weight decaying seems to be a powerful tool, any others? What kind of "simple" functions weight decaying is biased to? Thanks very much for help Charles From marwan at sedal.sedal.su.OZ.AU Tue Feb 1 21:07:09 1994 From: marwan at sedal.sedal.su.OZ.AU (Marwan Jabri) Date: Tue, 1 Feb 94 21:07:09 EST Subject: job openning Message-ID: <9402011007.AA09253@sedal.sedal.su.OZ.AU> The advertisment below could be of interest to a person with Unix and connectionism skills. --------------------------------------------------------------------- Systems Engineering and Design Automation Laboratory Sydney University Electrical Engineering Computer Systems Officer (in other words, a software engineer!) Reference No: B04/17 Applications are invited for the position of Computer Systems Officer with the Systems Engineering and Design Automation Laboratory (SEDAL) at Sydney University Electrical Engineering. The position is aimed at: - Supporting the administration of a computer network (Sun and DEC workstations); - Developing software in the areas of neural computing, video coding and parallel computers. The appointee must have knowledge and experience of C programming under Unix, DOS and Windows, and a degree in electronics or computer science. Experience in the areas of neural computing and/or video coding is highly desirable. Appointment will be for one year in the first instance with the possibility of renewal for up to a further four years subject to need and funding. Further information from Marwan Jabri on (+61-2) 692 2240, fax (+61-2) 660 1228 or email: marwan at sedal.su.oz.au. Salary: Level 5 $28,899 - $32,598 per annum Closing: 10 February 1994 To apply, an application quoting reference number, including CV, qualifications and the names, addresses, phone numbers and email addresses of two referees should be sent to Personnel Officier Personnel Services K07 The University of Sydney NSW 2006 Australia ---------------------------------------------------------------------- Equal opportunity and no smoking in the workplace are University Policies. The University Resevers the right not to proceed with an appointment for financial or other reasons. From prechelt at ira.uka.de Tue Feb 1 09:08:12 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Tue, 01 Feb 1994 15:08:12 +0100 Subject: donations to bibliography server Message-ID: <"irafs2.ira.632:01.01.94.14.08.31"@ira.uka.de> A colleague of mine here at University of Karlsruhe is currently building a large bibliographic database that is available free of charge on the internet. It currently contains about 210000 entries from various fields of computer science (mostly parallel processing, graphics, theoretical computer science, computational geometry, human computer interaction) Although there are several thousand entries on Artificial Intelligence topics, connectionism is not covered very well yet (Neural Computation's contents are present and some personal bibliographies). To extend this database by at least some basic information about neural network and other connectionist research, it would be fine if somebody could donate bibliographies on these topics which are (almost) comprehensive in some respect. In particular, I think it would be a very good start to have complete contents of NIPS, IJCNN, and Neural Networks (and perhaps, other journals such as Complex Systems). If anybody is able and willing to donate such bibliographies, please send me email. BibTeX format would be best, but refer or other parsable formats are OK, too. For information on the bibliography service, send mail with a single line containing the word 'help' in the body to bibserv at ira.uka.de [ The query service is still in a test stage and is not yet available to people located outside email domain '.de' (Germany) due to resource restrictions. The bibliographies themselves, however, are available for anonymous ftp from ftp.ira.uka.de:/pub/bibliography ] Lutz Lutz Prechelt (email: prechelt at ira.uka.de) | Whenever you Institut fuer Programmstrukturen und Datenorganisation | complicate things, Universitaet Karlsruhe; 76128 Karlsruhe; Germany | they get (Voice: ++49/721/608-4068, FAX: ++49/721/694092) | less simple. From schraudo at salk.edu Tue Feb 1 03:04:05 1994 From: schraudo at salk.edu (Nici Schraudolph) Date: Tue, 1 Feb 94 00:04:05 PST Subject: Neural Computation BibTeX database available Message-ID: <9402010804.AA02809@salk.edu> I've made a database of BibTeX entries for all articles published in the first five volumes of the journal Neural Computation; it's available by anonymous ftp from mitpress.mit.edu (18.173.0.28), file NC.bib.Z in the pub/NeuralComp directory. Share and enjoy, - Nici Schraudolph. From prechelt at ira.uka.de Wed Feb 2 04:12:48 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Wed, 02 Feb 1994 10:12:48 +0100 Subject: Techreport on CuPit available Message-ID: <"irafs2.ira.960:02.01.94.09.13.09"@ira.uka.de> The technical report Lutz Prechelt: "CuPit --- A Parallel Language for Neural Algorithms: Language Reference and Tutorial" is now available for anonymous ftp from ftp.ira.uka.de /pub/uni-karlsruhe/papers/cupit.ps.gz (154 Kb, 75 pages) It is NOT on neuroprose, because its topic does not quite fit into neuroprose's scope. Abstract: ---------- CuPit is a parallel programming language with two main design goals: 1. to allow the simple, problem-adequate formulation of learning algorithms for neural networks with focus on algorithms that change the topology of the underlying neural network during the learning process and 2. to allow the generation of efficient code for massively parallel machines from a completely machine-independent program description, in particular to maximize both data locality and load balancing even for irregular neural networks. The idea to achieve these goals lies in the programming model: CuPit programs are object-centered, with connections and nodes of a graph (which is the neural network) being the objects. Algorithms are based on parallel local computations in the nodes and connections and communication along the connections (plus broadcast and reduction operations). This report describes the design considerations and the resulting language definition and discusses in detail a tutorial example program. ---------- Remember to use 'binary' mode for ftp. To uncompress the Postscript file, you need to have the GNU gzip utility. Lutz Lutz Prechelt (email: prechelt at ira.uka.de) | Whenever you Institut fuer Programmstrukturen und Datenorganisation | complicate things, Universitaet Karlsruhe; D-76128 Karlsruhe; Germany | they get (Voice: ++49/721/608-4068, FAX: ++49/721/694092) | less simple. From prechelt at ira.uka.de Wed Feb 2 03:58:56 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Wed, 02 Feb 1994 09:58:56 +0100 Subject: Encoding missing values Message-ID: <"irafs2.ira.708:02.01.94.08.59.37"@ira.uka.de> I am currently thinking about the problem of how to encode data with attributes for which some of the values are missing in the data set for neural network training and use. An example of such data is the 'heart-disease' dataset from the UCI machine learning database (anonymous FTP on "ics.uci.edu" [128.195.1.1], directory "/pub/machine-learning-databases"). There are 920 records altogether with 14 attributes each. Only 299 of the records are complete, the others have one or several missing attribute values. 11% of all values are missing. I consider only networks that handle arbitrary numbers of real-valued inputs here (e.g. all backpropagation-suited network types etc). I do NOT consider missing output values. In this setting, I can think of several ways how to encode such missing values that might be reasonable and depend on the kind of attribute and how it was encoded in the first place: 1. Nominal attributes (that have n different possible values) 1.1 encoded "1-of-n", i.e., one network input per possible value, the relevant one being 1 all others 0. This encoding is very general, but has the disadvantage of producing networks with very many connections. Missing values can either be represented as 'all zero' or by simply treating 'is missing' as just another possible input value, resulting in a "1-of-(n+1)" encoding. 1.2 encoded binary, i.e., log2(n) inputs being used like the bits in a binary representation of the numbers 0...n-1 (or 1...n). Missing values can either be represented as just another possible input value (probably all-bits-zero is best) or by adding an additional network input which is 1 for 'is missing' and 0 for 'is present'. The original inputs should probably be all zero in the 'is missing' case. 2. continuous attributes (or attributes treated as continuous) 2.1 encoded as a single network input, perhaps using some monotone transformation to force the values into a certain distribution. Missing values are either encoded as a kind of 'best guess' (e.g. the average of the non-missing values for this attribute) or by using an additional network input being 0 for 'missing' and 1 for 'present' (or vice versa) and setting the original attribute input either to 0 or to the 'best guess'. (The 'best guess' variant also applies to nominal attributes above) 3. binary attributes (truth values) 3.1 encoded by one input: 0=false 1=true or vice versa Treat like (2.1) 3.2 encoded by one input: -1=false 1=true or vice versa In this case we may act as for (3.1) or may just use 0 to indicate 'missing'. 3.3 treat like nominal attribute with 2 possible values 4. ordinal attributes (having n different possible values, which are ordered) 4.1 treat either like continuous or like nominal attribute. If (1.2) is chosen, a Gray-Code should be used. Continuous representation is risky unless a 'sensible' quantification of the possible values is available. So far to my considerations. Now to my questions. a) Can you think of other encoding methods that seem reasonable ? Which ? b) Do you have experience with some of these methods that is worth sharing ? c) Have you compared any of the alternatives directly ? Lutz Lutz Prechelt (email: prechelt at ira.uka.de) | Whenever you Institut fuer Programmstrukturen und Datenorganisation | complicate things, Universitaet Karlsruhe; 76128 Karlsruhe; Germany | they get (Voice: ++49/721/608-4068, FAX: ++49/721/694092) | less simple. From marshall at cs.unc.edu Wed Feb 2 12:41:49 1994 From: marshall at cs.unc.edu (Jonathan A. Marshall) Date: Wed, 2 Feb 94 12:41:49 -0500 Subject: Papers on visual occlusion and neural networks Message-ID: <9402021741.AA17887@marshall.cs.unc.edu> Dear Colleagues, Below I list two new papers that I have added to the Neuroprose archives (thanks to Jordan Pollack!). In addition, I list two of my older papers in Neuroprose. You can retrieve a copy of these papers -- follow the instructions at the end of this message. --Jonathan ---------------------------------------------------------------------------- marshall.occlusion.ps.Z (5 pages) A SELF-ORGANIZING NEURAL NETWORK THAT LEARNS TO DETECT AND REPRESENT VISUAL DEPTH FROM OCCLUSION EVENTS JONATHAN A. MARSHALL and RICHARD K. ALLEY Department of Computer Science, CB 3175, Sitterson Hall University of North Carolina, Chapel Hill, NC 27599-3175, U.S.A. marshall at cs.unc.edu, alley at cs.unc.edu Visual occlusion events constitute a major source of depth information. We have developed a neural network model that learns to detect and represent depth relations, after a period of exposure to motion sequences containing occlusion and disocclusion events. The network's learning is governed by a new set of learning and activation rules. The network develops two parallel opponent channels or "chains" of lateral excitatory connections for every resolvable motion trajectory. One channel, the "On" chain or "visible" chain, is activated when a moving stimulus is visible. The other channel, the "Off" chain or "invisible" chain, is activated when a formerly visible stimulus becomes invisible due to occlusion. The On chain carries a predictive modal representation of the visible stimulus. The Off chain carries a persistent, amodal representation that predicts the motion of the invisible stimulus. The new learning rule uses disinhibitory signals emitted from the On chain to trigger learning in the Off chain. The Off chain neurons learn to interact reciprocally with other neurons that indicate the presence of occluders. The interactions let the network predict the disappearance and reappearance of stimuli moving behind occluders, and they let the unexpected disappearance or appearance of stimuli excite the representation of an inferred occluder at that location. Two results that have emerged from this research suggest how visual systems may learn to represent visual depth information. First, a visual system can learn a nonmetric representation of the depth relations arising from occlusion events. Second, parallel opponent On and Off channels that represent both modal and amodal stimuli can also be learned through the same process. [In Bowyer KW & Hall L (Eds.), Proceedings of the AAAI Fall Symposium on Machine Learning and Computer Vision, Research Triangle Park, NC, October 1993, 70-74.] ---------------------------------------------------------------------------- marshall.context.ps.Z (46 pages) ADAPTIVE PERCEPTUAL PATTERN RECOGNITION BY SELF-ORGANIZING NEURAL NETWORKS: CONTEXT, UNCERTAINTY, MULTIPLICITY, AND SCALE JONATHAN A. MARSHALL Department of Computer Science, CB 3175, Sitterson Hall University of North Carolina, Chapel Hill, NC 27599-3175, U.S.A. marshall at cs.unc.edu A new context-sensitive neural network, called an "EXIN" (excitatory+ inhibitory) network, is described. EXIN networks self-organize in complex perceptual environments, in the presence of multiple superimposed patterns, multiple scales, and uncertainty. The networks use a new inhibitory learning rule, in addition to an excitatory learning rule, to allow superposition of multiple simultaneous neural activations (multiple winners), under strictly regulated circumstances, instead of forcing winner-take-all pattern classifications. The multiple activations represent uncertainty or multiplicity in perception and pattern recognition. Perceptual scission (breaking of linkages) between independent category groupings thus arises and allows effective global context-sensitive segmentation and constraint satisfaction. A Weber Law neuron-growth rule lets the network learn and classify input patterns despite variations in their spatial scale. Applications of the new techniques include segmentation of superimposed auditory or biosonar signals, segmentation of visual regions, and representation of visual transparency. [Submitted for publication.] ---------------------------------------------------------------------------- marshall.steering.ps.Z (16 pages) CHALLENGES OF VISION THEORY: SELF-ORGANIZATION OF NEURAL MECHANISMS FOR STABLE STEERING OF OBJECT-GROUPING DATA IN VISUAL MOTION PERCEPTION JONATHAN A. MARSHALL [Invited paper, in Chen S-S (Ed.), Stochastic and Neural Methods in Signal Processing, Image Processing, and Computer Vision, Proceedings of the SPIE 1569, San Diego, July 1991, 200-215.] ---------------------------------------------------------------------------- martin.unsmearing.ps.Z (8 pages) UNSMEARING VISUAL MOTION: DEVELOPMENT OF LONG-RANGE HORIZONTAL INTRINSIC CONNECTIONS KEVIN E. MARTIN and JONATHAN A. MARSHALL [In Hanson SJ, Cowan JD, & Giles CL, Eds., Advances in Neural Information Processing Systems, 5. San Mateo, CA: Morgan Kaufmann Publishers, 1993, 417-424.] ---------------------------------------------------------------------------- RETRIEVAL INSTRUCTIONS % ftp archive.cis.ohio-state.edu Name (cheops.cis.ohio-state.edu:yourname): anonymous Password: (use your email address) ftp> cd pub/neuroprose ftp> binary ftp> get marshall.occlusion.ps.Z ftp> get marshall.context.ps.Z ftp> get marshall.steering.ps.Z ftp> get martin.unsmearing.ps.Z ftp> quit % uncompress marshall.occlusion.ps.Z ; lpr marshall.occlusion.ps % uncompress marshall.context.ps.Z ; lpr marshall.context.ps % uncompress marshall.steering.ps.Z ; lpr marshall.steering.ps % uncompress martin.unsmearing.ps.Z ; lpr martin.unsmearing.ps From tgd at chert.CS.ORST.EDU Wed Feb 2 13:02:30 1994 From: tgd at chert.CS.ORST.EDU (Tom Dietterich) Date: Wed, 2 Feb 94 10:02:30 PST Subject: some questions on training neural nets... In-Reply-To: "Charles X. Ling"'s message of Tue, 1 Feb 94 03:37:10 EST <9402010837.AA01695@godel.csd.uwo.ca> Message-ID: <9402021802.AA00565@curie.CS.ORST.EDU> From: "Charles X. Ling" Date: Tue, 1 Feb 94 03:37:10 EST Hi neural net experts, I am using backprop (and variations of it) quite often although I have not followed neural net (NN) research as well as I wanted. Some rather basic issues in training NN still puzzle me a lot, and I hope to get advice and help from the experts in the area. Sorry for being ignorant. Say we are learning a function F (such as a Boolean function of n vars). The training set (TR) and testing set (TS) are drawn randomly according to the same probability distribution, with no noise added in. 1. Is it true that, since there is no noise, the smaller the training error on TR, the better it would predict in general on TS? That is, stopping training earlier is not needed (so cross-validation is not needed). No, this is not true. Even in the noise-free case, the bias/variance tradeoff is operating and it is possible to overfit the training data. Consider for example an algorithm that just memorized the training set and guessed "false" on all unseen examples. It has obviously overfit, and it will obviously do poorly even in the absence of noise. 2. Is it true that, to get reliable prediction (good or bad), we should always choose net architecture with a minimum number of hidden units (or weights via weight decaying)? Will cross-validation help if we have too much freedom in the net (could results on the validation set be coincident)? There are many ways to manage the bias/variance tradeoff. I would say that there is nothing approaching complete agreement on the best approaches (and more fundamentally, the best approach varies from one application to another, since this is really a form of prior). The approaches can be summarized as * early stopping * error function penalties * size optimization - growing - pruning - other Early stopping usually employs cross-validation to decide when to stop training. (see below). In my experience, training an overlarge network with early stopping gives better performance than trying to find the minimum network size. It has the disadvantage that training costs are very high. Error function penalties such as weight decay and soft weight-sharing have been very effective in some applications. In my experience, they introduce additional training problems, because the error surface can develop more local minima. A solution to this is to gradually increase the penalties during training, but this requires more hands-on work than I have patience for. Size optimization attempts to find the optimal number of units and/or number of weights. Cascade-correlation and related algorithms grow the network, optimal brain damage and optimal brain surgeon prune the network, and then of course one can use cross-validation and just generate-and-test different network sizes. An advantage of "right-sizing" is that training time can be considerably reduced (at least the time per epoch). A problem with right-sizing, I believe, is that simply counting units or weights is not necessarily a good measure of network size. The work by Weigend (see 1993 summer school proceedings) suggests that early stopping provides a better method for modulating the effective number of parameters in the network. The OBD/OBS methods do not "just count weights", but instead assess the significance of the weights, so even non-zero weights that are useless can be removed. 3. If, for some reason, cross-validation is needed, and TR is split to TR1 (for training) and TR2 (for validation), what would be the proper ways to do cross-validation? Training on TR1 uses only partial information in TR, but training TR1 to find right parameters and then training on TR1+TR2 may require parameters different from the estimation of training TR1. I use the TR1+TR2 approach. On large data sets, this works well. On small data sets, the cross-validation estimates themselves are very noisy, so I have not found it to be as successful. I compute the stopping point using the sum squared error per training example, so that it scales. I think it is an open research problem to know whether this is the right thing to do. On a large speech recognition data set, after doing cross-validation training, we later checked to see if we had stopped at the right point (by monitoring using the test set). The cross-validation point was nearly exactly right. This was a case with a large data set. 4. In case the net has too much freedom (even different random seeds produce very different predictive accuracies), how can we effectively reduce the variations? Weight decaying seems to be a powerful tool, any others? What kind of "simple" functions weight decaying is biased to? Thanks very much for help Charles --Tom From karun at faline.bellcore.com Thu Feb 3 10:15:55 1994 From: karun at faline.bellcore.com (N. Karunanithi) Date: Thu, 3 Feb 1994 10:15:55 -0500 Subject: Encoding missing values Message-ID: <199402031515.KAA29100@faline.bellcore.com> > I am currently thinking about the problem of how to encode data with > a ttributes for which some of the values are missing in the data set for > neural network training and use. I am also having the same problem. I would like to get a copy responses. >1. Nominal attributes (that have n different possible values) > 1.1 encoded "1-of-n", i.e., one network input per possible value, the relevant one > being 1 all others 0. > This encoding is very general, but has the disadvantage of producing > networks with very many connections. > Missing values can either be represented as 'all zero' or by simply > treating 'is missing' as just another possible input value, resulting > in a "1-of-(n+1)" encoding. > 1.2 encoded binary, i.e., log2(n) inputs being used like the bits in a > binary representation of the numbers 0...n-1 (or 1...n). > Missing values can either be represented as just another possible input > value (probably all-bits-zero is best) or by adding an additional network > input which is 1 for 'is missing' and 0 for 'is present'. The original > inputs should probably be all zero in the 'is missing' case. > Both methods have the problem of poor scalability. If the number of missing values increases then the number of additional inputs will increase linearly in 1.1 and logarithmically in 1.2. In fact, 1-of-n encoding may be a poor choice if (1) the number of input features is large and (2) such an expanded dimensional representation does not become a (semi) linearly separable problem. Even if it becomes a linearly separable problem, the overall complexity of the network can sometimes be very high. >2. continuous attributes (or attributes treated as continuous) > 2.1 encoded as a single network input, perhaps using some monotone transformation > to force the values into a certain distribution. > Missing values are either encoded as a kind of 'best guess' (e.g. the > average of the non-missing values for this attribute) or by using > an additional network input being 0 for 'missing' and 1 for 'present' > (or vice versa) and setting the original attribute input either to 0 > or to the 'best guess'. (The 'best guess' variant also applies to > nominal attributes above) This representation requires GUESS. A nominal tranformation may not be a proper representation in some cases. Assume that the output values range over a large numerical intervel. For example, from 0.0 to 10,000.0. If you use a simple scaling like dividing by 10,000.0 to make it between 0.0 and 1.0, this will result in poor accuracy of prediction. If the attribute is on the input side, then on theory the scaling is unnecessary because the input layer weights will scale accordingly. However, in practice I had lot of problem with this approach. May be a log tranformation before scaling may not be a bad choice. If you use a closed scaling you may have problem whenever a future value exceeds the maximum value of the numerical intervel. For example, assume that the attribute is time, say in miliseconds. Any future time from the point of reference can exceed the limit. Hence any closed scaling will not work properly. > 3. binary attributes (truth values) > 3.1 encoded by one input: 0=false 1=true or vice versa > Treat like (2.1) > 3.2 encoded by one input: -1=false 1=true or vice versa > In this case we may act as for (3.1) or may just use 0 to indicate 'missing'. > 3.3 treat like nominal attribute with 2 possible values No comments. > 4. ordinal attributes (having n different possible values, which are ordered) > 4.1 treat either like continuous or like nominal attribute. > If (1.2) is chosen, a Gray-Code should be used. > Continuous representation is risky unless a 'sensible' quantification > of the possible values is available. I have compared Binary Encoding (1.2), Gray-Coded representation and straighforward scaling. Colsed scaling seems to do a good job. I have also compared open scaling and closed scaling and did find significant improvement in prediction accuracy. (Refer to: N. Karunanithi, D. Whitley and Y. K. Malaiya, "Prediction of Software Reliability Using Connectionist Models", IEEE Trans. Software Eng., July 1992, pp 563-574. N. Karunanithi and Y. K. Malaiya, "The Scaling Problem in Neural Networks for Software Reliability Prediction", Proc. IEEE Int. Symposium on Rel. Eng., Oct. 1992, pp. 776-82. ) > So far to my considerations. Now to my questions. > > a) Can you think of other encoding methods that seem reasonable ? Which ? > > b) Do you have experience with some of these methods that is worth sharing ? > > c) Have you compared any of the alternatives directly ? > > Lutz I have not found a simple solution that is general. I think representation in general and the missing information in specific are open problems within connectionist research. I am not sure we will have a magic bullet for all problems. The best approach is to come up with a specific solution for a given problem. -Karun From Thierry.Denoeux at hds.univ-compiegne.fr Thu Feb 3 03:36:47 1994 From: Thierry.Denoeux at hds.univ-compiegne.fr (Thierry.Denoeux@hds.univ-compiegne.fr) Date: Thu, 3 Feb 1994 09:36:47 +0100 Subject: Encoding missing values Message-ID: <199402030836.AA29123@kaa.hds.univ-compiegne.fr> Dear Lutz, dear connectionists, In a recent mailing, Lutz Prechelt mentioned the interesting problem of how to encode attributes with missing values as inputs to a neural network. I have recently been faced to that problem while applying neural nets to rainfall prediction using weather radar images. The problem was to classify pairs of "echoes" -- defined as groups of connected pixels with reflectivity above some threshold -- taken from successive images as corresponding to the same rain cell or not. Each pair of echoes was discribed by a list of attributes. Some of these attributes, refering to the past of a sequence, were not defined for some instances. To encode these attributes with potentially missing values, we applied two different methods actually suggested by Lutz: - the replacement of the missing value by a "best-guess" value - the addition of a binary input indicating whether the corresponding attribute was present or absent. Significantly better results were obtained by the second method. This work was presented at ICANN'93 last september: X. Ding, T. Denoeux & F. Helloco (1993). Tracking rain cells in radar images using multilayer neural networks. In Proc. of ICANN'93, Springer-Verlag, p. 962-967. Thierry Denoeux +------------------------------------------------------------------------+ | tdenoeux at hds.univ-compiegne.fr Thierry DENOEUX | | Departement de Genie Informatique | | Centre de Recherches de Royallieu | | tel (+33) 44 23 44 96 Universite de Technologie de Compiegne | | fax (+33) 44 23 44 77 B.P. 649 | | 60206 COMPIEGNE CEDEX | | France | +------------------------------------------------------------------------+ From rreilly at nova.ucd.ie Thu Feb 3 10:38:08 1994 From: rreilly at nova.ucd.ie (Ronan Reilly) Date: Thu, 3 Feb 1994 15:38:08 +0000 Subject: Fourth Irish Neural Networks Conference - INNC'94 Message-ID: FOURTH IRISH NEURAL NETWORK CONFERENCE - INNC'94 University College Dublin, Ireland September 12-13, 1994 FIRST CALL FOR PAPERS Papers are solicited for the Fourth Irish Neural Network Conference (INNC'94). They can be in any area of theoretical or applied neural networks. A non-exhaustive list of topic headings include: Learning algorithms Cognitive modelling Neurobiology Natural language processing Vision Signal processing Time series analysis Hardware implementations An extended abstract of not more than 500 words should be sent, preferably by e-mail, to: Ronan Reilly - INNC'94 Dept. of Computer Science University College Dublin Belfield Dublin 4 IRELAND e-mail: rreilly at nova.ucd.ie The deadline for receipt of abstracts is March 31, 1994. Authors will be contacted regarding acceptance by April 30, 1994. Full papers will be required by August 31, 1994. From finnoff at predict.com Thu Feb 3 11:40:51 1994 From: finnoff at predict.com (William Finnoff) Date: Thu, 3 Feb 94 09:40:51 MST Subject: some questions on training neural nets... Message-ID: <9402031640.AA01243@predict.com> Charles X. Ling writes: > Hi neural net experts, > > I am using backprop (and variations of it) quite often although I have > not followed neural net (NN) research as well as I wanted. Some rather > basic issues in training NN still puzzle me a lot, and I hope to get advice > and help from the experts in the area. Sorry for being ignorant.... In addition to Tom's pertinent comments, (tgd at chert.cs.orst.edu, Thu Feb 3) I would suggest consulting the following references which contain discussions of various issues pretaining to /model selection/overfitting/stopped training/ complexity control/bias variance dilema. (This list is by no means complete). References 2), 4), 13), 15) and 17) are particularly relevant to the questions raised. 1) Baldi, P. and Chauvin, Y. (1991). Temporal evolution of generalization during learning in linear networks, {\it Neural Computation} 3, 589-603. 2) Finnoff, W., Hergert, F. and Zimmermann, H.G., Improving generalization performance by nonconvergent model selection methods, {\it Neural Networks}, vol.6, nr.6, pp. 771-783, 1993. 3) Finnoff, W. and Zimmermann, H.G. (1991). Detecting structure in small datasets by network fitting under complexity constraints. To appear in {\it Proc. of 2nd Ann. Workshop on Computational Learning Theory and Natural Learning Systems}, Berkley. 4) Geman, S., Bienenstock, E. and Doursat R., (1992). Neural networks and the bias/variance dilemma, {\it Neural Computation} 4, 1-58. 5) Guyon, I., Vapnik, V., Boser, B., Bottou, L. and Solla, S. (1992). Structural risk minimization for character recognition. In J. Moody, J. Hanson and R. Lippmann (Eds.), {\it Advances in Neural Information Processing Systems IV} (pp. 471-479). San Mateo: Morgan Kaufman. 6) Hanson, S. J., and Pratt, L. Y. (1989). Comparing biases for minimal network construction with back-propagation, In D. S. Touretzky, (Ed.), {\it Advances in Neural Information Processing I} (pp.177-185). San Mateo: Morgan Kaufman. 7) Hergert, F., Finnoff, W. and Zimmermann, H.G. (1992). A comparison of weight elimination methods for reducing complexity in neural networks. {\it Proc. Int. Joint Conf. on Neural Networks}, Baltimore. 8) Hergert, F., Zimmermann, H.G., Kramer, U., and Finnoff, W. (1992). Domain independent testing and performance comparisons for neural networks. In I. Aleksander and J. Taylor (Eds.) {\it Artificial Neural Networks II} (pp.1071-1076). London: North Holland. 9) Le Cun, Y., Denker J. and Solla, S. (1990). Optimal Brain Damage. In D. Touretzky (Ed.) {\it Advances in Neural Information Processing Systems II} (pp.598-605). San Mateo: Morgan Kaufman. 10) MacKay, D. (1991). {\it Bayesian Modelling and Neural Networks}, Dissertation, Computational and Neural Systems, California Inst. of Tech. 139-74, Pasadena. 11) Moody, J. (1992). Generalization, weight decay and architecture selection for nonlinear learning systems. In J. Moody, J. Hanson and R. Lippmann (Eds.), {\it Advances in Neural Information Processing Systems IV} (pp. 471-479). San Mateo: Morgan Kaufman. 12) Morgan, N. and Bourlard, H. (1990). Generalization and parameter estimation in feedforward nets: Some experiments. In D. Touretzky (Ed.) {\it Advances in Neural Information Processing Systems II} (pp.598-605). San Mateo: Morgan Kaufman. 13) Sj\"oberg, J. and Ljung, L. (1992). Overtraining, regularization and searching for minimum in neural networks, {Report LiTH-ISY-I-1297, Dep. of Electrical Engineering}, Link\"oping University, S-581 83 Link\"oping, Sweden. 14) Stone, C.J. (1977). Cross-validation: A review. {\it Math. Operations res. Statist. Ser.}, 9, 1-51. 15) Vapnik, V. (1992). Principles of risk minimization for learning theory. In J. Moody, J. Hanson and R. Lippmann (Eds.), {\it Advances in Neural Information Processing Systems IV} (pp. 831-838 ). San Mateo: Morgan Kaufman. 16) Weigend, A. and Rumelhart, D. (1991). The effective dimension of the space of hidden units, in {\it Proc. Int. Joint Conf. on Neural Networks}, Singapore. 17) Weigend, A., Rumelhart, D., and Huberman, B. (1991). Generalization by weight elimination with application to forecasting. In R. Lippman, J. Moody and D. Touretzy (Eds.), {\it Advances in Neural Information Processing III} (pp.875-882). San Mateo: Morgan Kaufman. 18) White, H. (1989). Learning in artificial neural networks: A statistical perspective, {\it Neural Computation} 1, 425-464. -William %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% William Finnoff Prediction Co. 320 Aztec St., Suite B Santa Fe, NM, 87501, USA Tel.: (505)-984-3123 Fax: (505)-983-0571 e-mail: finnoff at predict.com From jlm at crab.psy.cmu.edu Thu Feb 3 11:27:41 1994 From: jlm at crab.psy.cmu.edu (James L. McClelland) Date: Thu, 3 Feb 94 11:27:41 EST Subject: CMU-Pitt Center for the Neural Basis of Cognition Message-ID: <9402031627.AA08304@crab.psy.cmu.edu.psy.cmu.edu> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Carnegie Mellon University and the University of Pittsburgh Announce the Creation of the Center for the Neural Basis of Cognition The Center is dedicated to the study of the neural basis of cognitive processes, including learning and memory, language and thought, perception, attention, and planning; to the study of the development of the neural substrate of these processes; to the study of disorders of these processes and their underlying neuropathology; and to the promotion of applications of the results of these studies to artificial intelligence, technology, and medicine. The Center will synthesize the disciplines of basic and clinical neuroscience, cognitive psychology, and computer science, combining neurobiological, behavioral, computa- tional and brain imaging methods. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Faculty Openings in the Center The Center seeks faculty and research scientists whose work relates to the mission stated above. Recruiting is beginning immediately, and will continue for several years. Appointments can be at any level and will be coordinated with one or more departments at either university. Coordinating departments include Biological Sciences, Computer Science, and Psychology at Carnegie Mellon and the departments of Behavioral Neuroscience, Neurobiology, Neurology, Psychiatry and Psychology at the University of Pittsburgh. Other affiliations may be possible. Candidates should send an application to either of the Co-Directors of the Center, listed below. The application should include a statement of interest indicating how the candidate's work fits the mission of the center and suggesting possible departmental affiliations, as well as a CV, copies of publications, and three letters of reference. Both uni- versities are EEO/AA Employers. James L. McClelland Robert Y. Moore Department of Psychology Center for Neuroscience Baker Hall 345-F Biomedical Science Tower 1656 Carnegie Mellon University University of Pittsburgh Pittsburgh, PA 15213 Pittsburgh, PA 15261 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From wahba at stat.wisc.edu Thu Feb 3 20:42:28 1994 From: wahba at stat.wisc.edu (Grace Wahba) Date: Thu, 3 Feb 94 19:42:28 -0600 Subject: nips6 paper on ss-anova in archive Message-ID: <9402040142.AA06981@hera.stat.wisc.edu> Dear Colleagues Our paper for the 1993 Neural Information Processing Society (NIPS) Proceedings is in the neuroprose archive under wahba.nips6.ps.Z Title: Structured Machine Learning For `Soft' Classification with Smoothing Spline ANOVA and Stacked Tuning, Testing and Evaluation. Authors: G. Wahba, Y. Wang, C. Gu, R. Klein and B. Klein Summary We describe the use of smoothing spline analysis of variance (SS-ANOVA) in the penalized log likelihood context, for learning (estimating) the probability $p$ of a `$1$' outcome, given a training set with attribute vectors and 0-1 outcomes. $p$ is of the form $p(t) = e^{f(t)}/(1+e^{f(t)})$, where, if $t$ is a vector of attributes, $f$ is learned as a sum of smooth functions of one attribute plus a sum of smooth functions of two attributes, etc. The smoothing parameters governing $f$ are obtained by an iterative unbiased risk or iterative GCV method. Confidence intervals for these estimates are available. The method is applied to estimate the risk of progression of diabetic retinopathy given predictor variables of age, body mass index and glycosylated hemoglobin. RETRIEVAL INSTRUCTIONS for NEUROPROSE ARCHIVE % ftp archive.cis.ohio-state.edu Name (cheops.cis.ohio-state.edu:yourname): anonymous Password: (use your email address) ftp> cd pub/neuroprose ftp> binary ftp> get wahba.nips6.ps.Z ftp> quit % uncompress wahba.nips6.ps.Z % lpr wahba.nips6.ps Some other papers of yours truly, friends and students, and an idiosyncratic bibliography of possible interest to connectionists are available by ftp. Get the (ascii) file Contents to see what's there. RETRIEVAL INSTRUCTIONS for WAHBA's public directory % ftp ftp.stat.wisc.edu Name (ftp.stat.wisc.edu:yournamehere): anonymous Password: (use your email address) ftp> binary ftp> cd pub/wahba ftp> get Contents ... read Contents and retrieve files of interest From pollack at cis.ohio-state.edu Thu Feb 3 17:17:14 1994 From: pollack at cis.ohio-state.edu (Jordan B Pollack) Date: Thu, 3 Feb 1994 17:17:14 -0500 Subject: new neuroprose/Thesis subdirectory Message-ID: <199402032217.RAA01292@dendrite.cis.ohio-state.edu> *** do not forward ** The filesystem on which neuroprose resides has overflowed. A set of very large files (all the files with *thesis* in their filename), have been moved to a new subdirectory. jordan From bill at nsma.arizona.edu Thu Feb 3 23:53:26 1994 From: bill at nsma.arizona.edu (Bill Skaggs) Date: Thu, 03 Feb 1994 21:53:26 -0700 (MST) Subject: Encoding missing values Message-ID: <9402040453.AA24599@nsma.arizona.edu> There is at least one kind of network that has no problem (in principle) with missing inputs, namely a Boltzmann machine. You just refrain from clamping the input node whose value is missing, and treat it like an output node or hidden unit. This may seem to be irrelevant to anything other than Boltzmann machines, but I think it could be argued that nothing very much simpler is capable of dealing with the problem. When you ask a network to handle missing inputs, you are in effect asking it to do pattern completion on the input layer, and for this a Boltzmann machine or some other sort of attractor network would seem to be required. -- Bill From tal at goshawk.lanl.gov Fri Feb 4 10:22:12 1994 From: tal at goshawk.lanl.gov (Tal Grossman) Date: Fri, 4 Feb 1994 08:22:12 -0700 Subject: some questions on training neural nets... Message-ID: <199402041522.IAA22945@goshawk.lanl.gov> Dear Charles X. Ling, You say: "Some rather basic issues in training NN still puzzle me a lot, and I hope to get advice and help from the experts in the area." Well... the questions you have asked still puzzle the experts as well, and good answers, where they exist, are very much case dependent. As Tom Dietterich wrote, in general "Even in the noise-free case, the bias/variance tradeoff is operating and it is possible to overfit the training data", therefore you can not expect just any large net to generalize well. It was also observed recently that... When having a large enough set of examples (so one can have a good enough sample for the training and the validation set), you can obtain better generalization with larger nets by using cross validation to decide when to stop training, as is demonstrated in the paper of A. Weigend : Weigend A.S. (1994), in the {\em Proc. of the 1993 Connectionist Models Summer School}, edited by M.C. Mozer, P. Smolensky, D.S. Touretzky, J.L. Elman and A.S. Weigend, pp. 335-342 (Erlbaum Associates, Hillsdale NJ, 1994). Rich Caruana has presented similar results in the "Complexity Issues" workshop in the last NIPS post-conference. But... Larger networks can generalize as good as, or even better than small networks even without cross-validation. A simple experiment that demonstrates that was presented in : T. Grossman, R. Meir and E. Domany, Learning by choice of Internal Representations, Complex Systems 2, 555-575 (1988). In that experiment, networks with different number of hidden units were trained to perform the symmetry task by using a fraction of the possible examples as the training set, training the net to 100% performance on the TR set and testing the performance on the rest (off training set generalization). No early stopping, no cross validation. The symmetry problem can be solved by 2 hidden units - so this is the minimal architecture required for this specific function. However, it was found that it is NOT the best generalizing architecture. The generalization rates of all the architectures (H=2..N, the size of the input) were similar, with the larger networks somewhat better. Now, this is a special case. One can explain it by observing that the symmetry problem can also be solved by a network of N hidden units, with smaller weights, and not only by effectively "zeroing" the contributions of all but two units (see an example in Minsky and Papert's Perceptrons). Probably by all the other architectures as well. So, considering the mapping from weight space to function space, it is very likely that training a large network on partial data will take you closer (in function space) to your target function F (symmetry in that case) than training a small one. The picture can be different in other cases... One has to remember that the training/generalization problem (including the bias/variance tradeoff problem) is, in general, a complex interaction between three entities: 1. The target function (or the task). 2. The learning model, and what is the class of functions that is realizable by this model (and its associated learning algorithm). 3. The training set, and how well it represents the task. Even the simple question: is my training set large enough (or good enough) ? is not simple at all. One might think that it should be larger than, say, twice the number of free parameters (weights) in your model/network architecture. It turns out that not even this is enough in general. Allow me to advertise here the paper presented by A.Lapedes and myself at the last NIPS where we present a method to test a "general" classification algorithm (i.e. any classifier such as a neural net, a decision tree, etc. and its learning algorithm, which may include pruning or net construction) by a method we call "noise sensitivity signature" NSS (see abstract below). In addition to introducing this new model selection method, which we believe can be a good alternative to cross-validation in data limited cases, we present the following experiment: the target function is a network with 20:5:1 architecture (weights chosen at random). The training set is provided by choosing M random input patterns and classifying them by the teacher net. we then train other nets with various architectures, ranging from 1 to 8 hidden units on the training set (without controlled stopping, but with tolerance in the error function). A different (and large) set of classified examples is used to determine the generalization performance of the trained nets (averaged over several realizations with different initial weights). Some of the results are : 1. With different training set sizes M=400,700,1000, the the optimal architecture is different. Smaller training set yields smaller optimal network, according to the independent test set measure. 2. Even with M=1000 (much more than twice the number of weights), the optimal learning net is still smaller than the original teacher net. 3. There are differences of up to a few percents in generalization performance of the different learning nets for all training set sizes. In particular, nets that are larger than the optimal are doing worse with size. Depends on your problem, a few percents can be insignificant or they can make a real difference. In some real applications, 1-2 % can be the difference between a contract or a paper... In such cases you would like to tune your model (i.e to identify the optimal architecture) as best as you can. 4. Using the NSS it was possible to recognize the optimal architectures for each training set, without using extra data. Some conclusions are: 1. If one uses a validation set to choose the architecture (not for stopping) - for example by using the extra 1000 examples - then the architecture that will be picked up when using the 700 training set is going to be smaller (and worse) than the one picked up when using the 1000 training set. In other words, if your data is just a 1000 examples, and you devote 300 of them to be your validation set. Then even if those 300 will give a good estimation of the generalization of the trained net, when you choose the model according to this test set, you end up with the optimal model for 700 training examples, which is less good than the optimal model that you can obtain when training with all the 1000 examples. It means that in many cases you need more examples than one might expect in order to obtain a well tuned model. Especially if you are using a considerable fraction of it as a validation set. 2. Using NSS one would find the right architecture for the total number of examples you have - paying a factor of about 30 on training effort. 3. You can use "set 1 aside" cross validation in order to select your model. This will probably overcome the bias caused by giving up a large fraction of the examples. However, in order to obtain a reliable estimate of the performance the training process will have to be repeated many times, probably more than what is needed in order to calculate the NSS. It is important to emphasize again: The above results were obtained for that specific experiment. We have obtained similar results with different tasks (e.g. DNA structure classification) and with different learning machines (e.g. decision trees), but still, these results prove nothing "in general", except may be, that life is complicated and full of uncertainty... A more careful comparison with cross validation as a stopping method, and using NSS in other scenarios (like function fitting) is under investigation. If anyone is interested in using the NSS method in combination with pruning methods (e.g. to test the stopping criteria), I will be glad to help. I will be grateful for any other information/ref about similar experiments. I hope all the above did not add too much to your puzzlement. Good luck with your training, Tal ------------------------------------------------ The paper I mentioned above is: Learning Theory seminar: Thursday Feb.10. 15:15. CNLS Conference room. title: Use of Bad Training Data For Better Predictions. by : Tal Grossman and Alan Lapedes (Complex Systems group, LANL) Abstract: We present a method for calculating the ``noise sensitivity signature'' of a learning algorithm which is based on scrambling the output classes of various fractions of the training data. This signature can be used to indicate a good (or bad) match between the complexity of the classifier and the complexity of the data and hence to improve the predictive accuracy of a classification algorithm. Use of noise sensitivity signatures is distinctly different from other schemes to avoid overtraining, such as cross-validation, which uses only part of the training data, or various penalty functions, which are not data-adaptive. Noise sensitivity signature methods use all of the training data and are manifestly data-adaptive and non-parametric. They are well suited for situations with limited training data It is going to appear in the Proc. of NIPS 6. An expanded version of it will (hopefully) be placed in the neuroprose archive within a week or two. Until then I can send a ps file of it to the interested. From sef+ at cs.cmu.edu Fri Feb 4 10:25:51 1994 From: sef+ at cs.cmu.edu (Scott E. Fahlman) Date: Fri, 04 Feb 94 10:25:51 EST Subject: Encoding missing values In-Reply-To: Your message of Thu, 03 Feb 94 21:53:26 -0700. <9402040453.AA24599@nsma.arizona.edu> Message-ID: There is at least one kind of network that has no problem (in principle) with missing inputs, namely a Boltzmann machine. You just refrain from clamping the input node whose value is missing, and treat it like an output node or hidden unit. This may seem to be irrelevant to anything other than Boltzmann machines, but I think it could be argued that nothing very much simpler is capable of dealing with the problem. When you ask a network to handle missing inputs, you are in effect asking it to do pattern completion on the input layer, and for this a Boltzmann machine or some other sort of attractor network would seem to be required. Good point, but perhaps in need of clarification for some readers: There are two ways of training a Boltzmann machine. In one (the original form), there is no distinction between input and output units. During training we alternate between an instruction phase, in which all of the externally visible units are clamped to some pattern, and a normalization phase, in which the whole network is allow to run free. The idea is to modify the weights so that, when running free, the external units assume the various pattern values in the training set in their proper frequencies. If only some subset of the externally visible units are clamped to certain values, the net will produce compatible completions in the other units, again with frequencies that match this part of the training set. A net trained in this way will (in principle -- it might take a *very* long time for anything complicated) do what you suggest: Complete an "input" pattern and produce a compatible output at the same time. This works even if the input is *totally* missing. I believe it was Geoff Hinton who realized that a Boltzmann machine could be trained more efficiently if you do make a distinction between input and output units, and don't waste any of the training effort learning to reconstruct the input. In this model, the instruction phase clamps both input and output units to some pattern, while the normalization phase clamps only the input units. Since the input units are correct in both cases, all of the networks learning power (such as it is) goes into producing correct patterns on the output units. A net trained in this way will not do input-completion. I bring this up because I think many people will only have seen the latter kind of Boltzmann training, and will therefore misunderstand your observation. By the way, one alternative method I have seen proposed for reconstructing missing input values is to first train an auto-encoder (with some degree of bottleneck to get generalization) on the training set, and then feed the output of this auto-encoder into the classification net. The auto-encoder should be able to replace any missing values with some degree of accuracy. I haven't played with this myself, but it does sound plausible. If anyone can point to a good study of this method, please post it here or send me E-mail. -- Scott =========================================================================== Scott E. Fahlman Internet: sef+ at cs.cmu.edu Senior Research Scientist Phone: 412 268-2575 School of Computer Science Fax: 412 681-5739 Carnegie Mellon University Latitude: 40:26:33 N 5000 Forbes Avenue Longitude: 79:56:48 W Pittsburgh, PA 15213 =========================================================================== From zoubin at psyche.mit.edu Fri Feb 4 11:04:32 1994 From: zoubin at psyche.mit.edu (Zoubin Ghahramani) Date: Fri, 4 Feb 94 11:04:32 EST Subject: Encoding missing values Message-ID: <9402041604.AA28037@psyche.mit.edu> Dear Lutz, Thierry, Karun, and connectionists, I have also been looking into the issue of encoding and learning from missing values in a neural network. The issue of handling missing values has been addressed extensively in the statistics literature for obvious reasons. To learn despite the missing values the data has to be filled in, or the missing values integrated over. The basic question is how to fill in the missing data. There are many different methods for doing this in stats (mean imputation, regression imputation, Bayesian methods, EM, etc.). For good reviews see (Little and Rubin 1987; Little, 1992). I do not in general recommend encoding "missing" as yet another value to be learned over. Missing means something in a statistical sense -- that the input could be any of the values with some probability distribution. You could, for example, augment the original data filling in different values for the missing data points according to a prior distribution. Then the training would assign different weights to the artificially filled-in data points depending on how well they predict the output (their posterior probability). This is essentially the method proposed by Buntine and Weigand (1991). Other approaches have been proposed by Tresp et al. (1993) and Ahmad and Tresp (1993). I have just written a paper on the topic of learning from incomplete data. In this paper I bring a statistical algorithm for learning from incomplete data, called EM, into the framework of nonlinear function approximation and classification with missing values. This approach fits the data iteratively with a mixture model and uses that same mixture model to effectively fill in any missing input or output values at each step. You can obtain the preprint by ftp psyche.mit.edu login: anonymous cd pub get zoubin.nips93.ps To obtain code for the algorithm please contact me directly. Zoubin Ghahramani zoubin at psyche.mit.edu ----------------------------------------------------------------------- Ahmad, S and Tresp, V (1993) "Some Solutions to the Missing Feature Problem in Vision." In Hanson, S.J., Cowan, J.D., and Giles, C.L., editors, Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers, San Mateo, CA. Buntine, WL, and Weigand, AS (1991) "Bayesian back-propagation." Complex Systems. Vol 5 no 6 pp 603-43 Ghahramani, Z and Jordan MI (1994) "Supervised learning from incomplete data via an EM approach" To appear in Cowan, J.D., Tesauro, G., and Alspector,J. (eds.). Advances in Neural Information Processing Systems 6. Morgan Kaufmann Publishers, San Francisco, CA, 1994. Little, RJA (1992) "Regression With Missing X's: A Review." Journal of the American Statistical Association. Volume 87, Number 420. pp. 1227-1237 Little, RJA. and Rubin, DB (1987). Statistical Analysis with Missing Data. Wiley, New York. Tresp, V, Hollatz J, Ahmad S (1993) "Network structuring and training using rule-based knowledge." In Hanson, S.J., Cowan, J.D., and Giles, C.~L., editors, Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers, San Mateo, CA. From Volker.Tresp at zfe.siemens.de Fri Feb 4 13:09:46 1994 From: Volker.Tresp at zfe.siemens.de (Volker Tresp) Date: Fri, 4 Feb 1994 19:09:46 +0100 Subject: missing data Message-ID: <199402041809.AA14305@inf21.zfe.siemens.de> In response to the questions raised by Lutz Prechelt concerning the missing data problem: In general, the solution to the missing-data problem depends on the missing-data mechanism. For example, if you sample the income of a population and rich people tend to refuse the answer the mean of your sample is biased. To obtain an unbiased solution you would have to take into account the missing-data mechanism. The missing-data mechanism can be ignored if it is independent of the input and the output (in the example: the likelihood that a person refuses to answer is independent of the person's income). Most approaches assume that the missing-data mechanism can be ignored. There exist a number of ad hoc solutions to the missing-data problem but it is also possible to approach the problem from a statistical point of view. In our paper (which will be published in the upcoming NIPS-volume and which will be available on neuroprose shortly) we discuss a systematic likelihood-based approach. NN-regression can be framed as a maximum likelihood learning problem if we assume the standard signal plus Gaussian noise model P(x, y) = P(x) P(y|x) \propto P(x) exp(-1/(2 \sigma^2) (y - NN(x))^2). By deriving the probability density function for a pattern with missing features we can formulate a likelihood function including patterns with complete and incomplete features. The solution requires an integration over the missing input. In practice, the integral is approximated using a numerical approximation. For networks of Gaussian basis functions, it is possible to obtain closed-form solutions (by extending the EM algorithm). Our paper also discusses why and when ad hoc solutions --such as substituting the mean for an unknown input-- are harmful. For example, if the mapping is approximately linear substituting the mean might work quite well. In general, although, it introduces bias. Training with missing and noisy input data is described in: ``Training Neural Networks with Deficient Data,'' V. Tresp, S. Ahmad and R. Neuneier, in Cowan, J. D., Tesauro, G., and Alspector, J. (eds.), {\em Advances in Neural Information Processing Systems 6}, Morgan Kaufmann, 1994. A related paper by Zoubin Ghahramani and Michael Jordan will also appear in the upcoming NIPS-volume. Recall with missing and noisy data is discussed in (available in neuroprose as ahmad.missing.ps.Z): ``Some Solutions to the Missing Feature Problem in Vision,'' S. Ahmad and V. Tresp, in {\em Advances in Neural Information Processing Systems 5,} S. J. Hanson, J. D. Cowan, and C. L. Giles eds., San Mateo, CA, Morgan Kaufman, 1993. Volker Tresp Subutai Ahmad Ralph Neuneier tresp at zfe.siemens.de ahmad at interval.com ralph at zfe.siemens.de From wray at ptolemy-ethernet.arc.nasa.gov Fri Feb 4 15:19:44 1994 From: wray at ptolemy-ethernet.arc.nasa.gov (Wray Buntine) Date: Fri, 4 Feb 94 12:19:44 PST Subject: Encoding missing values In-Reply-To: <199402031515.KAA29100@faline.bellcore.com> (karun@faline.bellcore.com) Message-ID: <9402042019.AA05621@ptolemy.arc.nasa.gov> regarding this missing value question raised thusly .... by Thierry Denoeux, Lutz Prechelt, and others >>>>>>>>>>>>>>> > So far to my considerations. Now to my questions. > > a) Can you think of other encoding methods that seem reasonable ? Which ? > > b) Do you have experience with some of these methods that is worth sharing ? > > c) Have you compared any of the alternatives directly ? > > Lutz + > I have not found a simple solution that is general. I think > representation in general and the missing information in specific > are open problems within connectionist research. I am not sure we will > have a magic bullet for all problems. The best approach is to come up > with a specific solution for a given problem. -> Karun >>>>>>>>>> This missing value problem is of course shared amongst all the learning communities, artificial intelligence, statistics, pattern recognition, etc., not just neural networks. A classic study in this area, which includes most suggestions I've read here so far, is inproceedings{quinlan:ml6, AUTHOR = "J.R. Quinlan", TITLE = "Unknown Attribute Values in Induction", YEAR = 1989, BOOKTITLE = "Proceedings of the Sixth International Machine Learning Workshop", PUBLISHER = "Morgan Kaufmann", ADDRESS = "Cornell, New York"} The most frequently cited methods I've seen, and they're so common amongst the different communities its hard to lay credit: 1) replace missings by their some best guess 2) fracture the example into a set of fractional examples each with the missing value filled in somehow 3) call the missing value another input value 3 is a good thing to do if they are "informative" missing, i.e. if someone leaves the entry "telephone number" blank in a questionaire, then maybe they don't have a telephone, but probably not good otherwise unless you have loads of data and don't mind all the extra example types generated (as already mentioned) 1 is a quick and dirty hack at 2. How good depends on your application. 2 is an approximation to the "correct" approach for handling "non-informative" missing values according to the standard "mixture model". The mathematics for this is general and applies to virtually any learning algorithm trees, feed-forward nets, linear regression, whatever. We do it for feed-forward nets in @article{buntine.weigend:bbp, AUTHOR = "W.L. Buntine and A.S. Weigend", TITLE = "Bayesian Back-Propagation", JOURNAL = "Complex Systems", Volume = 5, PAGES = "603--643", Number = 1, YEAR = "1991" } and see Tresp, Ahmad & Neuneier in NIPS'94 for an implementation. But no doubt someone probably published the general idea back in the 50's. I certainly wouldn't call missing values an open problem. Rather, "efficient implementations of the standard approaches" is, in some cases, an open problem. Wray Buntine NASA Ames Research Center phone: (415) 604 3389 Mail Stop 269-2 fax: (415) 604 3594 Moffett Field, CA, 94035-1000 email: wray at kronos.arc.nasa.gov From stork at cache.crc.ricoh.com Fri Feb 4 11:57:37 1994 From: stork at cache.crc.ricoh.com (David G. Stork) Date: Fri, 4 Feb 94 08:57:37 -0800 Subject: Missing features... Message-ID: <9402041657.AA12260@neva.crc.ricoh.com> There is a provably optimal method for performing classification with missing inputs, described in Chapter 2 of "Pattern Classification and Scene Analysis" (2nd ed.) by R. O. Duda, P. E. Hart and D. G. Stork, which avoids the ad-hoc heuristics that have been described by others. Those interested in obtaining Chapter two via ftp should contact me. Dr. David G. Stork Chief Scientist and Head, Machine Learning and Perception Ricoh California Research Center 2882 Sand Hill Road Suite 115 Menlo Park, CA 94025-7022 USA 415-496-5720 (w) 415-854-8740 (fax) stork at crc.ricoh.com From wray at ptolemy-ethernet.arc.nasa.gov Fri Feb 4 15:47:25 1994 From: wray at ptolemy-ethernet.arc.nasa.gov (Wray Buntine) Date: Fri, 4 Feb 94 12:47:25 PST Subject: some questions on training neural nets... In-Reply-To: <9402031640.AA01243@predict.com> (message from William Finnoff on Thu, 3 Feb 94 09:40:51 MST) Message-ID: <9402042047.AA06120@ptolemy.arc.nasa.gov> Tom Dietterich and William Finnof covered a lot of issues. I'd just like to highlight two points: * this is a contentious area * there are several opposing factors at play that confuse our understanding of this ================ detail Basically, this comment below is SO true. > There are many ways to manage the bias/variance tradeoff. I would say > that there is nothing approaching complete agreement on the best > approaches (and more fundamentally, the best approach varies from one > application to another, since this is really a form of prior). The > approaches can be summarized as The bias/variance tradeoff lies at the heart of almost all disagreements between different learning philosophies such as classical, Bayesian, minimum description length, resampling schemes (now often viewed as empirical Bayesian), statistical physics approaches, and the various "implementation" schemes. One thing to note is that there are several quite separate forces in operation here: computational and search issues: (e.g. maybe early stopping works better because its a more efficient way of searching the space of smaller networks ?) prior issues: (e.g. have you thrown in 20 attributes you happen to think might apply, but probably 15 are irrelevant; OR did a medical specialist carefully pick all 10 attributes and assures you every one is important, OR is a medical specialist able to solve the task blind, just be reading the 20 attribute values (without seeing the patient), etc.) (e.g. are 30 hidden units adequate for the structure of the task? ) asking the right question: (e.g. sometimes the question: what's the "best" network is a bit silly when you have a small amount of data, perhaps you should be trying to find 10 reasonable alternative networks and pool their results (ala. Michael Perrone's NIPS'93 workshop) understanding your representation: (e.g. with rule based systems, each rule has a good interpretation so the question of how to prune, etc., is something you can understand well BUT with a large feed-forward network, understanding the structure of the space is more involved, e.g. if I set these 2 weights to zero what the hell happens to my proposed solution) (e.g. this confuses the problem of designing good regularizes/priors/network-encodings). Problem is that theory people tend to focus on one, maybe two of these, whereas application people tend to confuse them together. Wray Buntine NASA Ames Research Center phone: (415) 604 3389 Mail Stop 269-2 fax: (415) 604 3594 Moffett Field, CA, 94035-1000 email: wray at kronos.arc.nasa.gov From kak at gate.ee.lsu.edu Fri Feb 4 17:24:34 1994 From: kak at gate.ee.lsu.edu (Subhash Kak) Date: Fri, 4 Feb 94 16:24:34 CST Subject: Encoding missing values Message-ID: <9402042224.AA23849@gate.ee.lsu.edu> Missing values in feedback networks raise interesting questions: Should these values be considered "don't know" values or should these be generated in some "most likelihood" fashion? These issues are discussed in the following paper: S.C. Kak, "Feedback neural networks: new characteristics and a generalization", Circuits, Systems, Signal Processing, vol. 12, no. 2, 1993, pp. 263-278. -Subhash Kak From moody at chianti.cse.ogi.edu Fri Feb 4 18:50:07 1994 From: moody at chianti.cse.ogi.edu (John Moody) Date: Fri, 4 Feb 94 15:50:07 -0800 Subject: PhD and Masters Programs at the Oregon Graduate Institute Message-ID: <9402042350.AA19148@chianti.cse.ogi.edu> Fellow Connectionists: The Oregon Graduate Institute of Science and Technology (OGI) has openings for a few outstanding students in its Computer Science and Electrical Engineering Masters and Ph.D programs in the areas of Neural Networks, Learning, Signal Processing, Time Series, Control, Speech, Language, and Vision. Faculty and postdocs in these areas include Etienne Barnard, Ron Cole, Mark Fanty, Dan Hammerstrom, Hynek Hermansky, Todd Leen, Uzi Levin, John Moody, David Novick, Misha Pavel, Joachim Utans, Eric Wan, and Lizhong Wu. Short descriptions of our research interests are appended below. OGI is a young, but rapidly growing, private research institute located in the Portland area. OGI offers Masters and PhD programs in Computer Science and Engineering, Applied Physics, Electrical Engineering, Biology, Chemistry, Materials Science and Engineering, and Environmental Science and Engineering. Inquiries about the Masters and PhD programs and admissions for either Computer Science or Electrical Engineering should be addressed to: Margaret Day, Director Office of Admissions and Records Oregon Graduate Institute PO Box 91000 Portland, OR 97291 Phone: (503)690-1028 Email: margday at admin.ogi.edu The final deadline for receipt of all applications materials for the Ph.D. programs is March 1, 1994, so it's not too late to apply! Masters program applications are accepted continuously. +++++++++++++++++++++++++++++++++++++++++++++++++++++++ Oregon Graduate Institute of Science & Technology Department of Computer Science and Engineering & Department of Electrical Engineering and Applied Physics Research Interests of Faculty in Adaptive & Interactive Systems (Neural Networks, Signal Processing, Control, Speech, Language, and Vision) Etienne Barnard (Assistant Professor): Etienne Barnard is interested in the theory, design and implementation of pattern-recognition systems, classifiers, and neural networks. He is also interested in adaptive control systems -- specifically, the design of near-optimal controllers for real- world problems such as robotics. Ron Cole (Professor): Ron Cole is director of the Center for Spoken Language Understanding at OGI. Research in the Center currently focuses on speaker- independent recognition of continuous speech over the telephone and automatic language identification for English and ten other languages. The approach combines knowledge of hearing, speech perception, acoustic phonetics, prosody and linguistics with neural networks to produce systems that work in the real world. Mark Fanty (Research Assistant Professor): Mark Fanty's research interests include continuous speech recognition for the telephone; natural language and dialog for spoken language systems; neural networks for speech recognition; and voice control of computers. Dan Hammerstrom (Associate Professor): Based on research performed at the Institute, Dan Hammerstrom and several of his students have spun out a company, Adaptive Solutions Inc., which is creating massively parallel computer hardware for the acceleration of neural network and pattern recognition applications. There are close ties between OGI and Adaptive Solutions. Dan is still on the faculty of the Oregon Graduate Institute and continues to study next generation VLSI neurocomputer architectures. Hynek Hermansky (Associate Professor); Hynek Hermansky is interested in speech processing by humans and machines with engineering applications in speech and speaker recognition, speech coding, enhancement, and synthesis. His main research interest is in practical engineering models of human information processing. Todd K. Leen (Associate Professor): Todd Leen's research spans theory of neural network models, architecture and algorithm design and applications to speech recognition. His theoretical work is currently focused on the foundations of stochastic learning, while his work on Algorithm design is focused on fast algorithms for non-linear data modeling. Uzi Levin (Senior Research Scientist): Uzi Levin's research interests include neural networks, learning systems, decision dynamics in distributed and hierarchical environments, dynamical systems, Markov decision processes, and the application of neural networks to the analysis of financial markets. John Moody (Associate Professor): John Moody does research on the design and analysis of learning algorithms, statistical learning theory (including generalization and model selection), optimization methods (both deterministic and stochastic), and applications to signal processing, time series, and finance. David Novick (Assistant Professor): David Novick conducts research in interactive systems, including computational models of conversation, technologically mediated communication, and human-computer interaction. A central theme of this research is the role of meta-acts in the control of interaction. Current projects include dialogue models for telephone-based information systems. Misha Pavel (Associate Professor): Misha Pavel does mathematical and neural modeling of adaptive behaviors including visual processing, pattern recognition, visually guided motor control, categorization, and decision making. He is also interested in the application of these models to sensor fusion, visually guided vehicular control, and human-computer interfaces. Joachim Utans (Post-Doctoral Research Associate): Joachim Utans's research interests include computer vision and image processing, model based object recognition, neural network learning algorithms and optimization methods, model selection and generalization, with applications in handwritten character recognition and financial analysis. Lizhong Wu (Post-Doctoral Research Associate): Lizhong Wu's research interests include neural network theory and modeling, time series analysis and prediction, pattern classification and recognition, signal processing, vector quantization, source coding and data compression. He is now working on the application of neural networks and nonparametric statistical paradigms to finance. Eric A. Wan (Assistant Professor): Eric Wan's research interests include learning algorithms and architectures for neural networks and adaptive signal processing. He is particularly interested in neural applications to time series prediction, adaptive control, active noise cancellation, and telecommunications. From hicks at cs.titech.ac.jp Sun Feb 6 17:22:17 1994 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Sun, 6 Feb 94 17:22:17 JST Subject: Methods for improving generalization (was Re: some questions on ...) Message-ID: <9402060822.AA11860@maruko.cs.titech.ac.jp> Dear Mr. Grossman, I read with great interest your analysis of overlearning and about your research into achieving better generalization with less data. However, I only want to point out an ommision in your background despcription. In the abstract of your paper "Use of Bad Training Data For Better Predictions" you write: >Use of noise sensitivity signatures is distinctly different from other schemes >to avoid overtraining, such as cross-validation, which uses only part of the >training data, or various penalty functions, which are not data-adaptive. >Noise sensitivity signature methods use all of the training data and >are manifestly data-adaptive and non-parametric. When you say penalty functions the first thing which comes to mind is a penalty on the sum of squared weights. This method is indeed not data-adaptive. However, an interesting article in Neural Computation 4, pp. 473-493, "Simplifying Neural Networks by Soft Weight-Sharing" proposes a weight penalty method which is adaptive. Basically, the weights are grouped together in Gaussian clusters whose mean and variance are allowed to adapt to the data. The experimental results they published show improvement over both cross-validation and weight decay. I am looking forward to reading your paper when it is available. Yours Respectfully, Craig Hicks Craig Hicks hicks at cs.titech.ac.jp | Kore ya kono Yuku mo kaeru mo Ogawa Laboratory, Dept. of Computer Science | Wakarete wa Shiru mo shiranu mo Tokyo Institute of Technology, Tokyo, Japan | Ausaka no seki lab:03-3726-1111 ext.2190 home:03-3785-1974 | (from hyaku-nin-issyu) fax: +81(3)3729-0685 (from abroad) 03-3729-0685 (from Japan) From pluto at cs.ucsd.edu Fri Feb 4 17:01:47 1994 From: pluto at cs.ucsd.edu (Mark Plutowski) Date: Fri, 04 Feb 1994 14:01:47 -0800 Subject: some questions on training neural nets... Message-ID: <9402042201.AA16326@odin.ucsd.edu> I have another reference to add that may be helpful to those interested in the cross-validation issue raised in the following discussion, which I have edited in what follows to focus on the particular issue this reference addresses: ------- Forwarded Message From tgd at chert.CS.ORST.EDU Wed Feb 2 13:02:30 1994 From: tgd at chert.CS.ORST.EDU (Tom Dietterich) Date: Wed, 2 Feb 94 10:02:30 PST Subject: some questions on training neural nets... In-Reply-To: "Charles X. Ling"'s message of Tue, 1 Feb 94 03:37:10 EST <9402010837.AA01695@godel.csd.uwo.ca> Message-ID: <9402021802.AA00565@curie.CS.ORST.EDU> In answer to the following: From: "Charles X. Ling" Date: Tue, 1 Feb 94 03:37:10 EST Hi neural net experts, Will cross-validation help ? [...] (could results on the validation set be coincident)? Tom Dietterich replies: [stuff deleted] There are many ways to manage the bias/variance tradeoff. I would say that there is nothing approaching complete agreement on the best approaches (and more fundamentally, the best approach varies from one application to another, since this is really a form of prior). The approaches can be summarized as * early stopping * error function penalties * size optimization - growing - pruning - other Early stopping usually employs cross-validation to decide when to stop training. (see below). In my experience, training an overlarge network with early stopping gives better performance than trying to find the minimum network size. It has the disadvantage that training costs are very high. [stuff deleted] 3. If, for some reason, cross-validation is needed, and TR is split to TR1 (for training) and TR2 (for validation), what would be the proper ways to do cross-validation? Training on TR1 uses only partial information in TR, but training TR1 to find right parameters and then training on TR1+TR2 may require parameters different from the estimation of training TR1. I use the TR1+TR2 approach. On large data sets, this works well. On small data sets, the cross-validation estimates themselves are very noisy, so I have not found it to be as successful. I compute the stopping point using the sum squared error per training example, so that it scales. I think it is an open research problem to know whether this is the right thing to do. [the reply continues..] ------- End of Forwarded Message In response to the last point, I supply a reference that provides theoretical guidance from a statistical perspective. It proves that cross-validation estimates Integrated Mean Squared Error (IMSE) within a constant due to noise. What this means: IMSE is a version of the mean squared error that accounts for the finite size of the training set. Think of it as the expected squared error obtained by training a network on random training sets of a particular size. It is an ideal (i.e., in general, unobservable) measure of generalization. IMSE embodies the bias and variance tradeoff. It can be decomposed into the sum of two terms, which directly quantify the bias + variance. Therefore, if IMSE embodies the measure of generalization that is relevant to you, (which will depend on your learning task) then, least-squares cross-validation provides a realizable estimate of generalization. Summary of the main results of the paper: It proves that two versions of cross-validation (one being the "hold-out set" version discussed above, and the other being the "delete-1" version) provide unbiased and strongly consistent estimates of IMSE This is statistical jargon meaning that, on average, the estimate is accurate, (i.e., the expectation of the estimate for given training set size equals the IMSE + a noise term) and asymtotically precise (in that as the training set and test set size grow large, the estimate converges to the IMSE within the constant factor due to noise, with probability 1.) Note that it does not say anything about the rate at which the variance of the estimate converges to the truth; therefore, it is possible that other IMSE-approximate measures may excel for small training set sizes (e.g., resampling methods such as bootstrap and jackknife.) However, it is the first result generally applicable to nonlinear regression that the authors are aware of, extending the well-known (in the statistical and econometric literature) work by C.J. Stone and others that prove similar results for particular learning tasks or for particular models. The statement of the results will appear in NIPS 6. I will post the soon-to-be-completed extended version to Neuroprose if anyone wants to see it sooner, or need access to the proofs. I hope this is helpful, = Mark Plutowski Institute for Neural Computation, and Department of Computer Science and Engineering University of California, San Diego La Jolla, California. USA. Here is the reference: Plutowski, Mark~E., Shinichi Sakata, and Halbert White. (1994). ``Cross-validation estimates IMSE.'' Cowan, J.D., Tesauro, G., and Alspector, J. (eds.), {\em Advances in Neural Information Processing Systems 6}, San Mateo, CA: Morgan Kaufmann Publishers. From esann at dice.ucl.ac.be Sun Feb 6 15:19:56 1994 From: esann at dice.ucl.ac.be (esann@dice.ucl.ac.be) Date: Sun, 6 Feb 94 21:19:56 +0100 Subject: ESANN'94: European Symposium on ANNs Message-ID: <9402062019.AA07827@ns1.dice.ucl.ac.be> ****************************************************************** * European Symposium * * on Artificial Neural Networks * * * * Brussels (Belgium) - April 20-21-22, 1994 * * * * Preliminary Program and registration form * ****************************************************************** Foreword ******** The actual developments in the field of artificial neural networks mark a watershed in its relatively young history. Far from the blind passion for disparate applications some years ago, the tendency is now to an objective assessment of this emerging technology, with a better knowledge of the basic concepts, and more appropriate comparisons and links with classical methods of computing. Neural networks are not restricted to the use of back-propagation and multi-layer perceptrons. Self-organization, adaptive signal processing, vector quantization, classification, statistics, image and speech processing are some of the domains where neural networks techniques may be successfully used; but a beneficial use goes through an in-depth examination of both the theoretical basis of the neural techniques and standard methods commonly used in the specified domain. ESANN'94 is the second symposium covering these specified aspects of neural networks computing. After a successful edition in 1993, ESANN'94 will open new perspectives, by focusing on theoretical and mathematical aspects of neural networks, biologically-inspired models, statistical aspects, and relations between neural networks and both information and signal processing (classification, vector quantization, self-organization, approximation of functions, image and speech processing,...). The steering and program committees of ESANN'94 are pleased to invite you to participate to this symposium. More than a formal conference presenting the last developments in the field, ESANN'94 will be also a forum for open discussions, round tables and opportunities for future collaborations. We hope to have the pleasure to meet you in April, in the splendid town of Brussels, and that your stay in Belgium will be as scientifically beneficial as agreeable. Symposium information ********************* Registration fees for symposium ------------------------------- registration before registration after 18th March 1994 18th March 1994 Universities BEF 14500 BEF 15500 Industries BEF 18500 BEF 19500 Registration fees include attendance to all sessions, the ESANN'94 banquet, a copy of the conference proceedings, daily lunches (20-22 April '94), and coffee breaks twice a day during the symposium. Advance registration is mandatory. Young researchers may apply for grants offered by the European Community (restricted to citizens or residents of a Western European country or, tentatively, Central or Eastern European country - deadline for applications: March 11th, 1994 - please write to the conference secretariat for details). Advance payments (see registration form) must be made to the conference secretariat by bank transfers in Belgian Francs (free of charges) or by sending a cheque (add BEF 500 for processing fees). Language -------- The official language of the conference is English. It will be used for all printed material, presentations and discussions. Proceedings ----------- A copy of the proceedings will be provided to all Conference Registrants. All technical papers will be included in the proceedings. Additional copies of the proceedings (ESANN'93 and ESANN'94) may be purchased at the following rate: ESANN'94 proceedings: BEF 2000 ESANN'93 proceedings: BEF 1500. Add BEF 500 to any order for p.&p. and/or bank charges. Please write to the conference secretariat for ordering proceedings. Conference dinner ----------------- A banquet will be offered on Thursday 21th to all conference registrants in a famous and typical place of Brussels. Additional vouchers for the banquet may be purchased on Wednesday 20th at the conference. Cancellation ------------ If cancellation is received by 25th March 1994, 50% of the registration fees will be returned. Cancellation received after this date will not be entitled to any refund. General information ******************* Brussels, Belgium ----------------- Brussels is not only the host city of the European Commission and of hundreds of multinational companies; it is also a marvelous historical town, with typical quarters, famous monuments known throughout the world, and the splendid "Grand-Place". It is a cultural and artistic center, with numerous museums. Night life in Brussels is considerable. There are of lot of restaurants and pubs open late in the night, where typical Belgian dishes can be tasted with one of the more than 1000 different beers. Hotel accommodation ------------------- Special rates for participants to ESANN'94 have been arranged at the MAYFAIR HOTEL, a De Luxe 4 stars hotel with 99 fully air conditioned guest rooms, tastefully decorated to the highest standards of luxury and comfort. The hotel includes two restaurants, a bar and private parking. Public transportation (trams n93 & 94) goes directly from the hotel to the conference center (Parc stop) Single room BEF 2800 Double room or twin room BEF 3500 Prices include breakfast, taxes and service. Rooms can only be confirmed upon receipt of booking form (see at the end of this booklet) and deposit. Located on the elegant Avenue Louise, the exclusive Hotel Mayfair is a short walk from the "uppertown" luxurious shopping district. Also nearby is the 14th century Cistercian abbey and the magnificent "Bois de la Cambre" park with its open-air cafes - ideal for a leisurely stroll at the end of a busy day. HOTEL MAYFAIR tel: +32 2 649 98 00 381 av. Louise fax: +32 2 649 22 49 1050 Brussels - Belgium Conference location ------------------- The conference will be held at the "Chancellerie" of the Generale de Banque. A map is included in the printed programme. Generale de Banque - Chancellerie 1 rue de la Chancellerie 1000 Brussels - Belgium Conference secretariat D facto conference services tel: + 32 2 245 43 63 45 rue Masui fax: + 32 2 245 46 94 B-1210 Brussels - Belgium E-mail: esann at dice.ucl.ac.be PROGRAM OF THE CONFERENCE ************************* Wednesday 20th April 1994 ------------------------- 9H30 Registration 10H00 Opening session Session 1: Neural networks and chaos Chairman: M. Hasler (Ecole Polytechnique Fdrale de Lausanne, Switzerland) 10H10 "Concerning the formation of chaotic behaviour in recurrent neural networks" T. Kolb, K. Berns Forschungszentrum Informatik Karlsruhe (Germany) 10H30 "Stability and bifurcation in an autoassociative memory model" W.G. Gibson, J. Robinson, C.M. Thomas University of Sidney (Australia) 10H50 Coffee break Session 2: Theoretical aspects 1 Chairman: C. Jutten (Institut National Polytechnique de Grenoble, France) 11H30 "Capabilities of a structured neural network. Learning and comparison with classical techniques" J. Codina, J. C. Aguado, J.M. Fuertes Universitat Politecnica de Catalunya (Spain) 11H50 "Projection learning: alternative approaches to the computation of the projection" K. Weigl, M. Berthod INRIA Sophia Antipolis (France) 12H10 "Stability bounds of momentum coefficient and learning rate in backpropagation algorithm"" Z. Mao, T.C. Hsia University of California at Davis (USA) 12H30 Lunch Session 3: Links between neural networks and statistics Chairman: J.C. Fort (Universit Nancy I, France) 14H00 "Model selection for neural networks: comparing MDL and NIC"" G. te Brake*, J.N. Kok*, P.M.B. Vitanyi** *Utrecht University, **Centre for Mathematics and Computer Science, Amsterdam (Netherlands) 14H20 "Estimation of performance bounds in supervised classification" P. Comon*, J.L. Voz**, M. Verleysen** *Thomson-Sintra Sophia Antipolis (France), **Universit Catholique de Louvain, Louvain-la-Neuve (Belgium) 14H40 "Input Parameters' estimation via neural networks" I.V. Tetko, A.I. Luik Institute of Bioorganic & Petroleum Chemistry, Kiev (Ukraine) 15H00 "Combining multi-layer perceptrons in classification problems" E. Filippi, M. Costa, E. Pasero Politecnico di Torino (Italy) 15H20 Coffee break Session 4: Algorithms 1 Chairman: J. Hrault (Institut National Polytechnique de Grenoble, France) 16H00 "Diluted neural networks with binary couplings: a replica symmetry breaking calculation of the storage capacity" J. Iwanski, J. Schietse Limburgs Universitair Centrum (Belgium) 16H20 "Storage capacity of the reversed wedge perceptron with binary connections" G.J. Bex, R. Serneels Limburgs Universitair Centrum (Belgium) 16H40 "A general model for higher order neurons" F.J. Lopez-Aligue, M.A. Jaramillo-Moran, I. Acedevo-Sotoca, M.G. Valle Universidad de Extremadura, Badajoz (Spain) 17H00 "A discriminative HCNN modeling" B. Petek University of Ljubljana (Slovenia) Thursday 21th April 1994 ------------------------ Session 5: Biological models Chairman: P. Lansky (Academy of Science of the Czech Republic) 9H00 "Biologically plausible hybrid network design and motor control" G.R. Mulhauser University of Edinburgh (Scotland) 9H20 "Analysis of critical effects in a stochastic neural model" W. Mommaerts, E.C. van der Meulen, T.S. Turova K.U. Leuven (Belgium) 9H40 "Stochastic model of odor intensity coding in first-order olfactory neurons" J.P. Rospars*, P. Lansky** *INRA Versailles (France), **Academy of Sciences, Prague (Czech Republic) 10H00 "Memory, learning and neuromediators" A.S. Mikhailov Fritz-Haber-Institut der MPG, Berlin (Germany), and Russian Academy of Sciences, Moscow (Russia) 10H20 "An explicit comparison of spike dynamics and firing rate dynamics in neural network modeling" F. Chapeau-Blondeau, N. Chambet Universit d'Angers (France) 10H40 Coffee break Session 6: Algorithms 2 Chairman: T. Denoeux (Universit Technologique de Compigne, France) 11H10 "A stop criterion for the Boltzmann machine learning algorithm" B. Ruf Carleton University (Canada) 11H30 "High-order Boltzmann machines applied to the Monk's problems" M. Grana, V. Lavin, A. D'Anjou, F.X. Albizuri, J.A. Lozano UPV/EHU, San Sebastian (Spain) 11H50 "A constructive training algorithm for feedforward neural networks with ternary weights" F. Aviolat, E. Mayoraz Ecole Polytechnique Fdrale de Lausanne (Switzerland) 12H10 "Synchronization in a neural network of phase oscillators with time delayed coupling" T.B. Luzyanina Russian Academy of Sciences, Moscow (Russia) 12H30 Lunch Session 7: Evolutive and incremental learning Chairman: T.J. Stonham (Brunel University, UK) - to be confirmed 14H00 "Reinforcement learning and neural reinforcement learning" S. Sehad, C. Touzet Ecole pour les Etudes et la Recherche en Informatique et Electronique, Nmes (France) 14H20 "Improving piecewise linear separation incremental algorithms using complexity reduction methods" J.M. Moreno, F. Castillo, J. Cabestany Universitat Politecnica de Catalunya (Spain) 14H40 "A comparison of two weight pruning methods" O. Fambon, C. Jutten Institut National Polytechnique de Grenoble (France) 15H00 "Extending immediate reinforcement learning on neural networks to multiple actions" C. Touzet Ecole pour les Etudes et la Recherche en Informatique et Electronique, Nmes (France) 15H20 "Incremental increased complexity training" J. Ludik, I. Cloete University of Stellenbosch (South Africa) 15H40 Coffee break Session 8: Function approximation Chairman: E. Filippi (Politecnico di Torino, Italy) - to be confirmed 16H20 "Approximation of continuous functions by RBF and KBF networks" V. Kurkova, K. Hlavackova Academy of Sciences of the Czech Republic 16H40 "An optimized RBF network for approximation of functions" M. Verleysen*, K. Hlavackova** *Universit Catholique de Louvain, Louvain-la-Neuve (Belgium), **Academy of Science of the Czech Republic 17H00 "VLSI complexity reduction by piece-wise approximation of the sigmoid function" V. Beiu, J.A. Peperstraete, J. Vandewalle, R. Lauwereins K.U. Leuven (Belgium) 20H00 Conference dinner Friday 22th April 1994 ---------------------- Session 9: Algorithms 3 Chairman: J. Vandewalle (K.U. Leuven, Belgium) - to be confirmed 9H00 "Dynamic pattern selection for faster learning and controlled generalization of neural networks" A. Rbel Technische Universitt Berlin (Germany) 9H20 "Noise reduction by multi-target learning" J.A. Bullinaria Edinburgh University (Scotland) 9H40 "Variable binding in a neural network using a distributed representation" A. Browne, J. Pilkington South Bank University, London (UK) 10H00 "A comparison of neural networks, linear controllers, genetic algorithms and simulated annealing for real time control" M. Chiaberge*, J.J. Merelo**, L.M. Reyneri*, A. Prieto**, L. Zocca* *Politecnico di Torino (Italy), **Universidad de Granada (Spain) 10H20 "Visualizing the learning process for neural networks" R. Rojas Freie Universitt Berlin (Germany) 10H40 Coffee break Session 10: Theoretical aspects 2 Chairman: M. Cottrell (Universit Paris I, France) 11H20 "Stability analysis of diagonal recurrent neural networks" Y. Tan, M. Loccufier, R. De Keyser, E. Noldus University of Gent (Belgium) 11H40 "Stochastics of on-line back-propagation" T. Heskes University of Illinois at Urbana-Champaign (USA) 12H00 "A lateral contribution learning algorithm for multi MLP architecture" N. Pican*, J.C. Fort**, F. Alexandre* *INRIA Lorraine, **Universit Nancy I (France) 12H20 Lunch Session 11: Self-organization Chairman: F. Blayo (EERIE Nmes, France) 14H00 "Two or three things that we know about the Kohonen algorithm" M. Cottrell*, J.C. Fort**, G. Pags*** Universits *Paris 1, **Nancy 1, ***Paris 6 (France) 14H20 "Decoding functions for Kohonen maps" M. Alvarez, A. Varfis CEC Joint Research Center, Ispra (Italy) 14H40 "Improvement of learning results of the selforganizing map by calculating fractal dimensions" H. Speckmann, G. Raddatz, W. Rosenstiel University of Tbingen (Germany) 15H00 Coffee break Session 11 (continued): Self-organization Chairman: F. Blayo (EERIE Nmes, France) 15H40 "A non linear Kohonen algorithm" J.-C. Fort*, G. Pags** *Universit Nancy 1, **Universits Pierre et Marie Curie, et Paris 12 (France) 16H00 "Self-organizing maps based on differential equations" A. Kanstein, K. Goser Universitt Dortmund (Germany) 16H20 "Instabilities in self-organized feature maps with short neighbourhood range" R. Der, M. Herrmann Universitt Leipzig (Germany) ESANN'94 Registration and Hotel Booking Form ******************************************** Registration fees ----------------- registration before registration after 18th March 1994 18th March 1994 Universities BEF 14500 BEF 15500 Industries BEF 18500 BEF 19500 University fees are applicable to members and students of academic and teaching institutions. Each registration will be confirmed by an acknowledgment of receipt, which must be given to the registration desk of the conference to get entry badge, proceedings and all materials. Registration fees include attendance to all sessions, the ESANN'94 banquet, a copy of the conference proceedings, daily lunches (20-22 April '94), and coffee breaks twice a day during the symposium. Advance registration is mandatory. Students and young researchers from European countries may apply for European Community grants. Hotel booking ------------- Hotel MAYFAIR (4 stars) - 381 av. Louise - 1050 Brussels Single room : BEF 2800 Double room (large bed) : BEF 3500 Twin room (2 beds) : BEF 3500 Prices include breakfast, service and taxes. A deposit corresponding to the first night is mandatory. Registration to ESANN'94 (please give full address and tick appropriate) ------------------------------------------------------------------------ Ms., Mr., Dr., Prof.:............................................... Name:............................................................... First Name:......................................................... Institution:........................................................ ................................................................... Address:............................................................ ................................................................... ZIP:................................................................ Town:............................................................... Country:............................................................ Tel:................................................................ Fax:................................................................ E-mail:............................................................. VAT n:............................................................. Universities: O registration before 18th March 1994: BEF 14500 O registration after 18th March 1994: BEF 15500 Industries: O registration before 18th March 1994: BEF 18500 O registration after 18th March 1994: BEF 19500 Hotel Mayfair booking (please tick appropriate) O single room deposit: BEF 2800 O double room (large bed) deposit: BEF 3500 O twin room (twin beds) deposit: BEF 3500 Arrival date: ..../..../1994 Departure date: ..../..../1994 O Additional payment if fees are paid through bank abroad check: BEF 500 Total BEF ____ Payment (please tick): O Bank transfer, stating name of participant, made payable to: Gnrale de Banque ch. de Waterloo 1341 A B-1180 Brussels - Belgium Acc.no: 210-0468648-93 of D facto (45 rue Masui, B-1210 Brussels) Bank transfers must be free of charges. EVENTUAL CHARGES MUST BE PAID BY THE PARTICIPANT. O Cheques/Postal Money Orders made payable to: D facto 45 rue Masui B-1210 Brussels - Belgium A SUPPLEMENTARY FEE OF BEF 500 MUST BE ADDED if the payment is made through bank abroad cheque or postal money order. Only registrations accompanied by a cheque, a postal money order or the proof of bank transfer will be considered. Registration and hotel booking form, together with payment, must be send as soon as possible, and in no case later than 8th April 1994, to the conference secretariat: &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& & D facto conference services - ESANN'94 & & 45, rue Masui - B-1210 Brussels - Belgium & &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&& Support ******* ESANN'94 is organized with the support of: - Commission of the European Communities (DG XII, Human Capital and Mobility programme) - IEEE Region 8 - IFIP WG 10.6 on neural networks - Region of Brussels-Capital - EERIE (Ecole pour les Etudes et la Recherche en Informatique et Electronique - Nmes) - UCL (Universit Catholique de Louvain - Louvain-la-Neuve) - REGARDS (Research Group on Algorithmic, Related Devices and Systems - UCL) Steering committee ****************** Franois Blayo EERIE, Nmes (F) Marie Cottrell Univ. Paris I (F) Nicolas Franceschini CNRS Marseille (F) Jeanny Hrault INPG Grenoble (F) Michel Verleysen UCL Louvain-la-Neuve (B) Scientific committee ******************** Luis Almeida INESC - Lisboa (P) Jorge Barreto UCL Louvain-en-Woluwe (B) Herv Bourlard L. & H. Speech Products (B) Joan Cabestany Univ. Polit. de Catalunya (E) Dave Cliff University of Sussex (UK) Pierre Comon Thomson-Sintra Sophia (F) Holk Cruse Universitt Bielefeld (D) Dante Del Corso Politecnico di Torino (I) Marc Duranton Philips / LEP (F) Jean-Claude Fort Universit Nancy I (F) Karl Goser Universitt Dortmund (D) Martin Hasler EPFL Lausanne (CH) Philip Husbands University of Sussex (UK) Christian Jutten INPG Grenoble (F) Petr Lansky Acad. of Science of the Czech Rep. (CZ) Jean-Didier Legat UCL Louvain-la-Neuve (B) Jean Arcady Meyer Ecole Normale Suprieure - Paris (F) Erkki Oja Helsinky University of Technology (SF) Guy Orban KU Leuven (B) Gilles Pags Universit Paris I (F) Alberto Prieto Universitad de Granada (E) Pierre Puget LETI Grenoble (F) Ronan Reilly University College Dublin (IRE) Tamas Roska Hungarian Academy of Science (H) Jean-Pierre Rospars INRA Versailles (F) Jean-Pierre Royet Universit Lyon 1 (F) John Stonham Brunel University (UK) Lionel Tarassenko University of Oxford (UK) John Taylor King's College London (UK) Vincent Torre Universita di Genova (I) Claude Touzet EERIE Nmes (F) Joos Vandewalle KUL Leuven (B) Eric Vittoz CSEM Neuchtel (CH) Christian Wellekens Eurecom Sophia-Antipolis (F) _____________________________ Michel Verleysen D facto conference services 45 rue Masui 1210 Brussels Belgium tel: +32 2 245 43 63 fax: +32 2 245 46 94 E-mail: esann at dice.ucl.ac.be _____________________________ From lba at ilusion.inesc.pt Mon Feb 7 04:57:07 1994 From: lba at ilusion.inesc.pt (Luis B. Almeida) Date: Mon, 7 Feb 94 10:57:07 +0100 Subject: Encoding missing values Message-ID: <9402070957.AA18932@ilusion.inesc.pt> Bill Skaggs writes: There is at least one kind of network that has no problem (in principle) with missing inputs, namely a Boltzmann machine. You just refrain from clamping the input node whose value is missing, and treat it like an output node or hidden unit. This may seem to be irrelevant to anything other than Boltzmann machines, but I think it could be argued that nothing very much simpler is capable of dealing with the problem. When you ask a network to handle missing inputs, you are in effect asking it to do pattern completion on the input layer, and for this a Boltzmann machine or some other sort of attractor network would seem to be required. The same effect, of trying to guess the missing inputs, can also be obtained with a recurrent multilayer perceptron, trained with recurrent backprop. This is the reason why the pattern completion results that I described in my 1987 ICNN paper (ref. below) were rather good. L. B. Almeida, "A learning rule for asynchronous perceptrons with feedback in a combinatorial environment", Proc IEEE First International Conference on Neural Networks, San Diego, Ca., 1987. Luis B. Almeida INESC Phone: +351-1-544607, +351-1-3100246 Apartado 10105 Fax: +351-1-525843 P-1017 Lisboa Codex Portugal lba at inesc.pt ----------------------------------------------------------------------------- *** Indonesians are killing innocent people in East Timor *** From jordan at psyche.mit.edu Mon Feb 7 20:47:09 1994 From: jordan at psyche.mit.edu (Michael Jordan) Date: Mon, 7 Feb 94 20:47:09 EST Subject: Encoding missing values Message-ID: > There is at least one kind of network that has no problem (in > principle) with missing inputs, namely a Boltzmann machine. > You just refrain from clamping the input node whose value is > missing, and treat it like an output node or hidden unit. > > This may seem to be irrelevant to anything other than Boltzmann > machines, but I think it could be argued that nothing very much > simpler is capable of dealing with the problem. The above is a nice observation that is worth emphasizing; I agree with all of it except the comment about being irrelevant to anything else. The Boltzmann machine is actually relevant to everything else. What the Boltzmann algorithm is doing with the missing value is essentially the same as what the EM algorithm for mixtures (that Ghahramani and Tresp referred to) is doing, and epitomizes the general case of an iterative "filling in" algorithm. The Boltzmann machine learning algorithm is a generalized EM (GEM) algorithm. During the E step the system computes the conditional correlation function for the nodes under the Boltzmann distribution, where the conditioning variables are the known data (the values of the clamped units) and the current values of the parameters (weights). This "fills in" the relevant statistic (the correlation function) and allows it to be used in the generalized M step (the contrastive Hebb rule). Moreover, despite the fancy terminology, these algorithms are nothing more (nor less) than maximum likelihood estimation, where the likelihood function is the likelihood of the parameters *given the data that was actually observed*. By "filling in" missing data, you're not adding new information to the problem; rather, you're allowing yourself to use all the information that is in those components of the data vector that aren't missing. (EM theory provides the justification for that statement). E.g., if only one component of an input vector is missing, it's obviously wasteful to neglect what the other components of the input vector are telling you. And, indeed, if you neglect the whole vector, you will not end up with maximum likelihood estimates for the weights (nor in general will you get maximum likelihood estimates if you fill in a value with the unconditional mean of that variable). "Filling in" is not the only way to compute ML estimates for missing data problems, but its virtue is that it allows the use of the same learning algorithms as would be used for complete data (without incurring any bias, if the filling in is done correctly). The only downside is that even if the complete-data algorithm is one-pass (which the Boltzmann algorithm and mixture fitting are not) the "filling-in" approach is generally iterative, because the parameter estimates depend on the filled-in values which in turn depend on the parameter estimates. On the other hand, there are so-called "monotone" patterns of missing data for which the filling-in approach is not necessarily iterative. This monotone case might be of interest, because it is relevant for problems involving feedforward networks in which the input vectors are complete but some of the outputs are missing. (Note that even if all the output values for a case are missing, a ML algorithm will not throw the case out; there is statistical structure in the input vector that the algorithm must not neglect). Mike (See Ghahramani's message for references; particularly the Little and Rubin book). From prechelt at ira.uka.de Tue Feb 8 07:19:16 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Tue, 08 Feb 1994 13:19:16 +0100 Subject: SUMMARY: encoding missing values Message-ID: <"irafs2.ira.957:08.01.94.12.19.58"@ira.uka.de> A few days ago, I posted some thoughts about how to represent missing input values to a neural network and asked for comments and further ideas. This message is a summary of the replies I received (some in my personal mail some in connectionists). I show the most significant comments and ideas and append versions of the messages that are trimmed to the most important parts (in case somebody wants to keep this discussion in his/her archive) This was my original message: ------------------------------------------------------------------------ From prechelt at ira.uka.de Wed Feb 2 03:58:56 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Wed, 02 Feb 1994 09:58:56 +0100 Subject: Encoding missing values Message-ID: I am currently thinking about the problem of how to encode data with attributes for which some of the values are missing in the data set for neural network training and use. An example of such data is the 'heart-disease' dataset from the UCI machine learning database (anonymous FTP on "ics.uci.edu" [128.195.1.1], directory "/pub/machine-learning-databases"). There are 920 records altogether with 14 attributes each. Only 299 of the records are complete, the others have one or several missing attribute values. 11% of all values are missing. I consider only networks that handle arbitrary numbers of real-valued inputs here (e.g. all backpropagation-suited network types etc). I do NOT consider missing output values. In this setting, I can think of several ways how to encode such missing values that might be reasonable and depend on the kind of attribute and how it was encoded in the first place: 1. Nominal attributes (that have n different possible values) 1.1 encoded "1-of-n", i.e., one network input per possible value, the relevant one being 1 all others 0. This encoding is very general, but has the disadvantage of producing networks with very many connections. Missing values can either be represented as 'all zero' or by simply treating 'is missing' as just another possible input value, resulting in a "1-of-(n+1)" encoding. 1.2 encoded binary, i.e., log2(n) inputs being used like the bits in a binary representation of the numbers 0...n-1 (or 1...n). Missing values can either be represented as just another possible input value (probably all-bits-zero is best) or by adding an additional network input which is 1 for 'is missing' and 0 for 'is present'. The original inputs should probably be all zero in the 'is missing' case. 2. continuous attributes (or attributes treated as continuous) 2.1 encoded as a single network input, perhaps using some monotone transformation to force the values into a certain distribution. Missing values are either encoded as a kind of 'best guess' (e.g. the average of the non-missing values for this attribute) or by using an additional network input being 0 for 'missing' and 1 for 'present' (or vice versa) and setting the original attribute input either to 0 or to the 'best guess'. (The 'best guess' variant also applies to nominal attributes above) 3. binary attributes (truth values) 3.1 encoded by one input: 0=false 1=true or vice versa Treat like (2.1) 3.2 encoded by one input: -1=false 1=true or vice versa In this case we may act as for (3.1) or may just use 0 to indicate 'missing'. 3.3 treat like nominal attribute with 2 possible values 4. ordinal attributes (having n different possible values, which are ordered) 4.1 treat either like continuous or like nominal attribute. If (1.2) is chosen, a Gray-Code should be used. Continuous representation is risky unless a 'sensible' quantification of the possible values is available. So far to my considerations. Now to my questions. a) Can you think of other encoding methods that seem reasonable ? Which ? b) Do you have experience with some of these methods that is worth sharing ? c) Have you compared any of the alternatives directly ? ------------------------------------------------------------------------ SUMMARY: For a), the following ideas were mentioned: 1. use statistical techniques to compute replacement values from the rest of the data set 2. use a Boltzman machine to do this for you 3. use an autoencoder feed forward network to do this for you 4. randomize on the missing values (correct in the Bayesian sense) For b), some experience was reported. I don't know how to summarize that nicely, so I just don't summarize at all. For c), no explicit quantitative results were given directly. Some replies suggest that data is not always missing randomly. The biases are often known and should be taken into account (e.g. medical tests are not carried out (resulting in missing data) for moreless healthy persons more often than for ill persons). Many replies contained references to published work on this area, from NN, machine learning, and mathematical statistics. To ease searching for these references in the replies below, I have marked them with the string ##REF## (if you have a 'grep' program that extracts whole paragraphs, you can get them all out with one command). Thanks to all who answered. These are the trimmed versions of the replies: ------------------------------------------------------------------------ From: tgd at research.CS.ORST.EDU (Tom Dietterich) [...for nominal attributes:] An alternative here is to encode them as bit-strings in a error-correcting code, so that the hamming distance between any two bit strings is constant. This would probably be better than a dense binary encoding. The cost in additional inputs is small. I haven't tried this though. My guess is that distributed representations at the input are a bad idea. One must always determine WHY the value is missing. In the heart disease data, I believe the values were not measured because other features were believed to be sufficient in each case. In such cases, the network should learn to down-weight the importance of the feature (which can be accomplished by randomizing it---see below). In other cases, it may be more appropriate to treat a missing value as a separate value for the feature, e.g., in survey research, where a subject chooses not to answer a question. [...for continuous attributes:] Ross Quinlan suggests encoding missing values as the mean observed output value when the value is missing. He has tried this in his regression tree work. Another obvious approach is to randomize the missing values--on each presentation of the training example, choose a different, random, value for each missing input feature. This is the "right thing to do" in the bayesian sense. [...for binary attributes:] I'm skeptical of the -1,0,1 encoding, but I think there is more research to be done here. [...for ordinal attributes:] I would treat them as continuous. ------------------------------------------------------------------------ From: shavlik at cs.wisc.edu (Jude W. Shavlik) We looked at some of the methods you talked about in the following article in the journal Machine Learning. ##REF## %T Symbolic and Neural Network Learning Algorithms: An Experimental Comparison %A J. W. Shavlik %A R. J. Mooney %A G. G. Towell %J Machine Learning %V 6 %N 2 %P 111-143 %D 1991 ------------------------------------------------------------------------ From: hertz at nordita.dk (John Hertz) It seems to me that the most natural way to handle missing data is to leave them out. You can do this if you work with a recurrent network (fx Boltzmann machine) where the inputs are fed in by clamping the input units to the given input values and the rest of the net relaxes to a fixed point, after which the output is read off the output units. If some of the input values are missing, the corresponding input units are just left unclamped, free to relax to values most consistent with the known inputs. I have meant for a long time to try this on some medical prognosis data I was working on, but I never got around to it, so I would be happy to hear how it works if you try it. ------------------------------------------------------------------------ From: jozo at sequoia.WPI.EDU (Jozo Dujmovic) In the case of clustering benchmark programs I frequently have the the problem of estimation of missing data. A relatively simple SW that implements a heuristic algorithm generates estimates having the average error of 8%. NN will somehow "implicitly estimate" the missing data. The two approaches might even be in some sense equivalent (?). Jozo [ I suspect that they are not: When you generate values for the missing items and put them in the training set, the network loses the information that this data is only estimated. Since estimations are not as reliable as true input data, the network will weigh inputs that have lots of generated values as less important. If it gets the 'is missing' information explicitly, it can discriminate true values from estimations instead. ] ------------------------------------------------------------------------ From: guy at cs.uq.oz.au A final year student of mine worked on the problem of dealing with missing inputs, without much success. However, the student as not very good, so take the following opinions with a pinch of salt. We (very tentatively) came to the conclusion that if the inputs were redundant, the problem was easy; if the missing input contained vital information, the problem was pretty much impossible. We used the heart disease data. I don't recommend it for the missing inputs problem. All of the inputs are very good indicators of the correct result, so missing inputs were not important. Apparently there is a large literature in statistics on dealing with missing inputs. Anthony Adams (University of Tasmania) has published a technical report on this. His email address is "A.Adams at cs.utas.edu.au". ##REF## @techreport{kn:Vamplew-91, author = "P. Vamplew and A. Adams", address = {Hobart, Tasmania, Australia}, institution = {Department of Computer Science, University of Tasmania}, number = {R1-4}, title = {Real World Problems in Backpropagation: Missing Values and Generalisability}, year = {1991} } ------------------------------------------------------------------------ From: Mike Southcott ##REF## I wrote a paper for the Australian conference on neural networks in 1993. ``Classification of Incomplete Data using neural networks'' Southcott, Bogner. You may find it interesting. You may not be able to get the proceedings for this conference, but I am in the process of digging up a postscript copy for someone in the States, so when I do that, I will send you a copy. ------------------------------------------------------------------------ From: Eric Saund I have done some work on unsupervised learning of mulitple cause clusters in binary data, for which an appropriate encoding scheme is -1 = FALSE, 1 = TRUE, and 0 = NO DATA. This has worked well for me, but my paradigm is not your standard feedforward network and uses a different activiation function from the standard weighted sum followed by sigmoid squashing. I presented the paper on this work at NIPS: ##REF## Saund, Eric; 1994; "Unsupervised Learning of Mixtures of Multiple Causes in Binary Data," in Advances in Neural Information Processing Systems -6-, Cowan, J., Tesauro, G, and Alspector, J., eds. Morgan Kaufmann, San Francisco. ------------------------------------------------------------------------ From: Thierry.Denoeux at hds.univ-compiegne.fr In a recent mailing, Lutz Prechelt mentioned the interesting problem of how to encode attributes with missing values as inputs to a neural network. I have recently been faced to that problem while applying neural nets to rainfall prediction using weather radar images. The problem was to classify pairs of "echoes" -- defined as groups of connected pixels with reflectivity above some threshold -- taken from successive images as corresponding to the same rain cell or not. Each pair of echoes was discribed by a list of attributes. Some of these attributes, refering to the past of a sequence, were not defined for some instances. To encode these attributes with potentially missing values, we applied two different methods actually suggested by Lutz: - the replacement of the missing value by a "best-guess" value - the addition of a binary input indicating whether the corresponding attribute was present or absent. Significantly better results were obtained by the second method. This work was presented at ICANN'93 last september: ##REF## X. Ding, T. Denoeux & F. Helloco (1993). Tracking rain cells in radar images using multilayer neural networks. In Proc. of ICANN'93, Springer-Verlag, p. 962-967. ------------------------------------------------------------------------ From: "N. Karunanithi" [...for nominal attributes:] Both methods have the problem of poor scalability. If the number of missing values increases then the number of additional inputs will increase linearly in 1.1 and logarithmically in 1.2. In fact, 1-of-n encoding may be a poor choice if (1) the number of input features is large and (2) such an expanded dimensional representation does not become a (semi) linearly separable problem. Even if it becomes a linearly separable problem, the overall complexity of the network can sometimes be very high. [...for continuous attributes:] This representation requires GUESS. A nominal transformation may not be a proper representation in some cases. Assume that the output values range over a large numerical interval. For example, from 0.0 to 10,000.0. If you use a simple scaling like dividing by 10,000.0 to make it between 0.0 and 1.0, this will result in poor accuracy of prediction. If the attribute is on the input side, then on theory the scaling is unnecessary because the input layer weights will scale accordingly. However, in practice I had lot of problem with this approach. Maybe a log tranformation before scaling may not be a bad choice. If you use a closed scaling you may have problem whenever a future value exceeds the maximum value of the numerical intervel. For example, assume that the attribute is time, say in miliseconds. Any future time from the point of reference can exceed the limit. Hence any closed scaling will not work properly. [...for ordinal attributes:] I have compared Binary Encoding (1.2), Gray-Coded representation and straighforward scaling. Colsed scaling seems to do a good job. I have also compared open scaling and closed scaling and did find significant improvement in prediction accuracy. ###REF### N. Karunanithi, D. Whitley and Y. K. Malaiya, "Prediction of Software Reliability Using Connectionist Models", IEEE Trans. Software Eng., July 1992, pp 563-574. From yong at cns.brown.edu Tue Feb 8 10:40:35 1994 From: yong at cns.brown.edu (Yong Liu) Date: Tue, 8 Feb 94 10:40:35 EST Subject: some questions on training neural nets Message-ID: <9402081540.AA15383@cns.brown.edu> On the discussion of cross-validation method, Dr. Plutowski referred to his paper by writing > It proves that two versions of cross-validation > (one being the "hold-out set" version discussed above, and the other > being the "delete-1" version) provide unbiased and strongly consistent > estimates of IMSE This is statistical jargon meaning that, on > average, the estimate is accurate, (i.e., the expectation > of the estimate for given training set size equals the IMSE + a noise term) > and asymtotically precise (in that as the training set and test set > size grow large, the estimate converges to the IMSE within the > constant factor due to noise, with probability 1.) Comment: This comment is on the above result about "delete-1" version cross-validation. The result must have assumed that the training data set have no outliers (corruption in Y component of a data point). Since deleting a data point that is outlier will cause a great change in the estimated neural net weights, and also the squared prediction error on this outliers will be large. This will then eventually cause a biased estimation of the IMSE. Even if a robust algorithm is used to estimate the neural net weights in order to reduce the sensitive of outlier in the estimation, the squared prediction error on the outlier will still be large. A possible correction would be to weight this outlier less in the cross-validation, or in another word, to take less attention to this outlier when delete this outlier. A weighted cross-validation like this has been discussed briefly in Liu (1994). The weighting of a data is calculated through an iterative reweighted algorithm for robust regression. One interesting thing about this version of cross-validation is its asymptotical equivalency to Moody's criterion (Moody,1992; Liu, 1993). References: Liu, Y.(1993) Neural Network Model Selection Using Asymptotic Jackknife Estimator and Cross-Validation Method. In C.L. Giles, S.J. Hanson, and and J.D. Cowan editors, {\em Advances in neural information processing system}, volume 5, pages 599-606. Morgan Kaufmann, San Mateo, CA. Liu, Y.(1994) Robust Parameter Estimation and Model Selection for Neural Network Regression. To Appear in Jack D. Cowan, Gerald Tesauro and Joshua Alspector editors, {\em Advances in neural information processing system}, volume 6. Morgan Kaufmann, San Mateo, CA. Moody, J.E. (1992).The effective number of parameters, an analysis of generalization and regularization in nonlinear learning system. In Moody, J.E., Hanson, S.J., and Lippmann, R.P., editors, {\em Advances in Neural Information Processing System 4}. Morgan Kaufmann Publication. ---------------------------- Yong Liu Box 1843 Department of Physics Institute for Brain and Neural Systems Brown University Providence, RI 02912 From pluto at cs.ucsd.edu Wed Feb 9 02:39:00 1994 From: pluto at cs.ucsd.edu (Mark Plutowski) Date: Tue, 08 Feb 1994 23:39:00 -0800 Subject: some questions on training neural nets Message-ID: <9402090739.AA07477@odin.ucsd.edu> ------- Previous Message: --------- From yong at cns.brown.edu Tue Feb 8 10:40:35 1994 From: yong at cns.brown.edu (Yong Liu) Date: Tue, 8 Feb 94 10:40:35 EST Subject: some questions on training neural nets Message-ID: <9402081540.AA15383@cns.brown.edu> On the discussion of cross-validation method, Dr. Plutowski referred to his paper by writing > It proves that two versions of cross-validation > (one being the "hold-out set" version discussed above, and the other > being the "delete-1" version) provide unbiased and strongly consistent > estimates of IMSE This is statistical jargon meaning that, on > average, the estimate is accurate, (i.e., the expectation > of the estimate for given training set size equals the IMSE + a noise term) > and asymtotically precise (in that as the training set and test set > size grow large, the estimate converges to the IMSE within the > constant factor due to noise, with probability 1.) Comment: This comment is on the above result about "delete-1" version cross-validation. The result must have assumed that the training data set have no outliers (corruption in Y component of a data point). Since deleting a data point that is outlier will cause a great change in the estimated neural net weights, and also the squared prediction error on this outliers will be large. This will then eventually cause a biased estimation of the IMSE. - ---------------------------- Yong Liu Box 1843 Department of Physics Institute for Brain and Neural Systems Brown University Providence, RI 02912 ------- End of Previous Message ------ No, actually it turns out that delete-1 cross-validation delivers unbiased estimates of IMSE under fairly reasonable conditions. (More precisely, it delivers estimates of IMSE_N + \sigma^2, for training set size N and noise variance \sigma^2.) Roughly, the noise must have variance the same everywhere in input space, (or, "homoscedasticity" as the statisticians would say,) with examples selected independently from the same, fixed environment (i.e., "i.i.d.") the expectation of the squared-target must be finite (this just ensures that conditional expectations of the target and the noise exist everywhere) plus some conditions on the network to make it behave nicely. For these same conditions, the estimate is additionally "conservative," in that it does not, (asymptotically, anyway, as N grows large) underestimate the expected squared error of the network for optimal weights. (These results and the prerequisite assumptions are of course stated more precisely in the paper.) However, we did require an additional assumption to obtain the "strong" convergence result, in that the optimal weights must be unique. This is to ensure that the weights for each of the deleted subsets of N-1 examples converge to the weights obtained by training on all N examples. As an aside: This latter condition may seem strong, but it seems to be (intuitively) applicable to a particular variant of delete-1 cross-validation commonly employed to make its computation more feasible - (in which case the global optima are in a sense "locally" unique under the right conditions.) In this variant, the network is trained on the entire training set to obtain the "base" network. These weights are then "fine-tuned" upon each of the deleted subsets of size N-1 to obtain the N cross-validated weight vectors. This tends to distribute the fine-tuned weights within a local region that seens to get tighter as the training set size increases. It tends to work well in practice, under the right conditions. (Essentially, you need to ensure that the ratio of examples to weights is sufficiently large, and it is easy to detect when this is not the case.) A bit off the original subject, I suppose, but I hope these results help clarify what cross-validation is doing, at least in that wonderfully ideal place called "asymptopia." It (apparently) turns out that these conditions suffice to ensure that the detrimental effect of a malicious outlier becomes negligible as the size of the training set grows large, at least with respect to the estimation of this particular kind of generalization by cross-validation. = Mark Plutowski UCSD: INC and CS&E P.S. Thank you for the honorable salutation! Actually, I am (still) just a student here. 8-) 8-| From lange at ira.uka.de Wed Feb 9 14:19:22 1994 From: lange at ira.uka.de (lange@ira.uka.de) Date: Wed, 9 Feb 94 14:19:22 MET Subject: Methods for improving generalization (was Re: some questions on ...) Message-ID: <"iraun1.ira.337:09.01.94.13.22.32"@ira.uka.de> Dear Mr. Hicks, in your mail to Mr. Grossman you mentioned the "Soft Weight-Sharing" algorithm and stated, that this algorithm would do some adaption to the data. I don't think, that this is right: Soft Weight-Sharing is just a bit more complicated than Weight-Decay or other things (so some improvements have been made). But Soft Weight-Sharing does not really adapt to the data, because you have to tune the same parameters as in normal Weight-Decay: the parameters, that are used to handle the strength of the penalty-term. The article of Nowlan and Hinton "Simplifying Neural Networks by Soft Weight- Sharing" does not mention a method to do this automatically - so no "real" adaption to the data is made. Maybe the methods of MacKay ("Bayesian Interpolation", Neural Comp. 4 (1992), page 415-447) could be used to get a fully-automatic adaption. A combination of this method with Weight-Decay or Soft Weight-Sharing would perhaps be data-adaptive; but Soft Weight-Sharing alone has still a parameter, that is not adapted by the data. Yours, Frank Lange From sec at ai.univie.ac.at Wed Feb 9 08:53:36 1994 From: sec at ai.univie.ac.at (sec@ai.univie.ac.at) Date: Wed, 9 Feb 1994 14:53:36 +0100 Subject: No subject Message-ID: <199402091353.AA14535@prater.ai.univie.ac.at> * * * * * TWELFTH EUROPEAN MEETING * * ON * * CYBERNETICS AND SYSTEMS RESEARCH * * (EMCSR 1994) * April 5 - 8, 1994 UNIVERSITY OF VIENNA organized by the Austrian Society for Cybernetic Studies in cooperation with Dept.of Medical Cybernetics and Artificial Intelligence, Univ.of Vienna and International Federation for Systems Research Plenary lectures: ***************** MARGARET BODEN (United Kingdom): "Artificial Intelligence and Creativity" STEPHEN GROSSBERG (USA): "Neural Networks for Learning, Recognition, and Prediction" STUART A. UMPLEBY (USA): "Twenty Years of Second Order Cybernetics" 241 papers will be presented and discussed in the following symposia: ********************************************************************* GENERAL SYSTEMS METHODOLOGY G.J.Klir (USA) ADVANCES IN MATHEMATICAL SYSTEMS THEORY J.Miro (Spain), M.Peschel (Germany), F.Pichler (Austria) FUZZY SYSTEMS, APPROXIMATE REASONING AND KNOWLEDGE-BASED SYSTEMS C.Carlsson (Finland), K.-P.Adlassnig (Austria), E.P.Klement (Austria) DESIGNING AND SYSTEMS, AND THEIR EDUCATION B.Banathy (USA), W.Gasparski (Poland), G.Goldschmidt (Israel) HUMANITY, ARCHITECTURE AND CONCEPTUALIZATION G.Pask (United Kingdom), G.de Zeeuw (Netherlands) BIOCYBERNETICS AND MATHEMATICAL BIOLOGY L.M.Ricciardi (Italy) SYSTEMS AND ECOLOGY F.J.Radermacher (Germany), K.Fedra (Austria) CYBERNETICS AND INFORMATICS IN MEDICINE G.Gell (Austria), G.Porenta (Austria) CYBERNETICS OF SOCIO-ECONOMIC SYSTEMS K.Balkus (USA), O.Ladanyi (Austria) SYSTEMS, MANAGEMENT AND ORGANIZATION G.Broekstra (Netherlands), R.Hough (USA) CYBERNETICS OF COUNTRY DEVELOPMENT P.Ballonoff (USA), T.Koizumi (USA), S.A.Umpleby (USA) COMMUNICATION AND COMPUTERS A M.Tjoa (Austria) INTELLIGENT AUTONOMOUS SYSTEMS J.Rozenblit (USA), H.Praehofer (Austria) CYBERNETIC PRINCIPLES OF KNOWLEDGE DEVELOPMENT F.Heylighen (Belgium), S.A.Umpleby (USA) CYBERNETICS, SYSTEMS AND PSYCHOTHERAPY M.Okuyama (Japan), H.Koizumi (USA) ARTIFICIAL NEURAL NETWORKS AND ADAPTIVE SYSTEMS S.Grossberg (USA), G.Dorffner (Austria) ARTIFICIAL INTELLIGENCE AND COGNITIVE SCIENCE V.Marik (Czech Republic), R.Born (Austria) TUTORIALS: ********** A SYNTACTIC APPROACH TO HEURISTIC NETWORKS: LINGUISTIC GEOMETRY Prof.Boris Stilman, University of Colorado, Denver, USA FUZZY SETS AND IMPRECISE BUT RELEVANT DECISIONS Prof.Christer Carlsson, Abo Akademi University, Abo, Finland CONTEXTUAL SYSTEMS: A NEW TECHNOLOGY FOR KNOWLEDGE BASED SYSTEM DEVELOPMENT Dr.Irina V. Ezhkova, Russian Academy of Science, Moscow TWENTY YEARS OF SECOND ORDER CYBERNETICS Prof.Stuart A. Umpleby, George Washington University, Washington, D.C., USA PROCEEDINGS: ************ Trappl R.(ed.): CYBERNETICS AND SYSTEMS '94, 2 vols, 1911 pages, World Scientific Publishing, Singapore. FOR FURTHER INFORMATION PLEASE CONTACT: *************************************** EMCSR'94 Secretariat c/o Austrian Society for Cybernetic Studies Schottengasse 3 A-1010 Vienna Austria Phone: +43-1-53532810 Fax: +43-1-5320652 E-mail: sec at ai.univie.ac.at From gert at jhunix.hcf.jhu.edu Wed Feb 9 09:32:57 1994 From: gert at jhunix.hcf.jhu.edu (Gert Cauwenberghs) Date: Wed, 9 Feb 1994 09:32:57 -0500 Subject: "A Learning Analog Neural Network Chip..." Message-ID: <94Feb9.093258edt.70280-3@jhunix.hcf.jhu.edu> FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/cauwenberghs.nips93.ps.Z A preprint of the paper: A Learning Analog Neural Network Chip with Continuous-Time Recurrent Dynamics, by Gert Cauwenberghs, 8 pages including figures, to appear in Advances in Neural Information Processing Systems, vol. 6, 1994, is available on the neuroprose repository, in compressed PostScript format: anonymous binary ftp to archive.cis.ohio-state.edu cd pub/neuroprose get cauwenberghs.nips93.ps.Z uncompress and print. The abstract follows below. --- Gert Cauwenberghs (gert at jhunix.hcf.jhu.edu) We present experimental results on supervised learning of dynamical features in an analog VLSI neural network chip. The recurrent network, containing six continuous-time analog neurons and 42 free parameters (connection strengths and thresholds), is trained to generate time-varying outputs approximating given periodic signals presented to the network. The chip implements a stochastic perturbative algorithm, which observes the error gradient along random directions in the parameter space for error-descent learning. In addition to the integrated learning functions and the generation of pseudo-random perturbations, the chip provides for teacher forcing and long-term storage of the volatile parameters. The network learns a 1 kHz circular trajectory in 100 sec. The chip occupies 2 X 2 mm in a 2 um CMOS process, and dissipates 1.2 mW. From yong at cns.brown.edu Wed Feb 9 14:42:14 1994 From: yong at cns.brown.edu (Yong Liu) Date: Wed, 9 Feb 94 14:42:14 EST Subject: some questions on training neural nets Message-ID: <9402091942.AA19342@cns.brown.edu> Plutowski (Tue, 08 Feb 1994) wrote >No, actually it turns out that delete-1 cross-validation delivers >unbiased estimates of IMSE under fairly reasonable conditions. >(More precisely, it delivers estimates of IMSE_N + \sigma^2, >for training set size N and noise variance \sigma^2.) >Roughly, the noise must have variance the same everywhere in input space, >(or, "homoscedasticity" as the statisticians would say,) with examples >selected independently from the same, fixed environment (i.e., "i.i.d.") >the expectation of the squared-target must be finite (this just ensures >that conditional expectations of the target and the noise exist everywhere) >plus some conditions on the network to make it behave nicely. >For these same conditions, the estimate is additionally "conservative," >in that it does not, (asymptotically, anyway, as N grows large) >underestimate the expected squared error of the network for optimal weights. Outliers are the data points that come in an "unexpected" way, both in the training data and in the future. For example, the data is collected so that a proportional of them are typos. So as the size of the data gets large, the number of outliers in them also gets large. Plutowski's assumption, as I understand it, is to assume the ratio of the number outliers over the size of data size is very small. One way to look at data set containing outliers is to assume noises are inhomoscedastic. Outlier data points have their noises with large variance, and good data points have their noises with small variance (Liu 1994). This is different from Plutowski's "homoscedasticity" assumption. Since we have no intention of predicting the value of outliers, robust estimation in both the parameters and the generalization error requires the "removal" of the outliers. These discussion, I hope, could convey the idea that when using cross-validation for the estimation of generalization error, some cautions should be taken as regards to the influence of Bad data in the training data set. ------------ Yong Liu Box 1843 Department of Physics Institute for Brain and Neural Systems Brown University Providence, RI 02912 From pluto at cs.ucsd.edu Wed Feb 9 17:52:55 1994 From: pluto at cs.ucsd.edu (Mark Plutowski) Date: Wed, 9 Feb 94 14:52:55 -0800 Subject: Outliers (Was: "Some questions on training..") Message-ID: <9402092252.AA14771@beowulf> ------- previous message ------- Dr. Liu writes: Outliers are the data points that come in an "unexpected" way, both in the training data and in the future. For example, the data is collected so that a proportional of them are typos. So as the size of the data gets large, the number of outliers in them also gets large. Plutowski's assumption, as I understand it, is to assume the ratio of the number outliers over the size of data size is very small. One way to look at data set containing outliers is to assume noises are inhomoscedastic. Outlier data points have their noises with large variance, and good data points have their noises with small variance (Liu 1994). This is different from Plutowski's "homoscedasticity" assumption. Since we have no intention of predicting the value of outliers, robust estimation in both the parameters and the generalization error requires the "removal" of the outliers. These discussion, I hope, could convey the idea that when using cross-validation for the estimation of generalization error, some cautions should be taken as regards to the influence of Bad data in the training data set. ------------ Yong Liu Box 1843 Department of Physics Institute for Brain and Neural Systems Brown University Providence, RI 02912 ------- end previous message ------- Dear Dr Liu, Yes, this points out the importance of examining the assumptions carefully to ensure that they apply to your particular learning task. As another example of where these results do not apply, note that the assumption of mean zero noise can be easily violated in discrimination tasks (often referred to as "classification" tasks) where the noise involves random misclassification of the target. It also points out an appealling definition of "outlier", My interpretation of this is the following: When the noise variance on the target can depends upon the input (in statistical jargon, referred to as "heteroscedasticity of the conditional variance of Y_i given X_i") there is the possibility that a plot of the conditional target variance over the input space could display discontinuous jumps, corresponding to where it is more likely to encounter targets that are much more "noisy" - as compared to targets for neighboring inputs. Is this accurate? I look forward to reading (Liu 94). Can you (or anyone else) point me to other references utilizing a similar definition of "outlier?" (IMHO) "outlier" is quite a value-laden term that I tend to avoid since I feel it has multiple and often ambiguous interpretations/definitions. I am currently doing work on detection of what I call "offliers" since I have a precise definition of what this means to me, and since I hesitate to use the term "outliers" for the reason stated above. = Mark PS: I would appreciate further opinions/references/examples of what "outlier" means (either in practice or in theory) which I will summarize and post to the mailing list. From mlsouth at cssip.levels.unisa.edu.au Wed Feb 9 21:00:23 1994 From: mlsouth at cssip.levels.unisa.edu.au (mlsouth@cssip.levels.unisa.edu.au) Date: Thu, 10 Feb 1994 12:30:23 +1030 (CST) Subject: Missing values Message-ID: <8610.9402100200@hotham.levels.unisa.edu.au> Connectionists, I did a short study on methods for classification of incomplete data 18 months ago. I compared the statistical methods of discrimination and classification and the EM algorithm to some neural methods. These methods could only be applied to an artificial data set due to the inavailability of a set of real data with missing values. Despite this, I believe that the conclusions are still sound. A copy of the paper ``Classification of incomplete data using neural networks'', M.L. Southcott, R.E. Bogner which was presented to the Fourth Australian Conference on Neural Networks (ACNN '93) is available via anonymous ftp from ftp.cssip.edu.au. The file is pub/users/michael/southcott.missing.ps Michael Southcott mlsouth at cssip.edu.au Centre for Sensor Signal and Information Processing SPRI Building, The Levels, Pooraka 5095, South Australia. From prechelt at ira.uka.de Tue Feb 8 07:19:16 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Tue, 08 Feb 1994 13:19:16 +0100 Subject: SUMMARY: encoding missing values Message-ID: <"irafs2.ira.957:08.01.94.12.19.58"@ira.uka.de> [ Due to a transmission error at our end, Lutz Prechelt's 28 Kbyte summary of the missing values discussion got truncated at about 16 Kbytes. Here is the second half of his summary. Sorry for any inconvenience. -- Dave Touretzky, CONNECTIONISTS moderator ] ------------------------------------------------------------------------ From: "N. Karunanithi" [...for nominal attributes:] Both methods have the problem of poor scalability. If the number of missing values increases then the number of additional inputs will increase linearly in 1.1 and logarithmically in 1.2. In fact, 1-of-n encoding may be a poor choice if (1) the number of input features is large and (2) such an expanded dimensional representation does not become a (semi) linearly separable problem. Even if it becomes a linearly separable problem, the overall complexity of the network can sometimes be very high. [...for continuous attributes:] This representation requires GUESS. A nominal transformation may not be a proper representation in some cases. Assume that the output values range over a large numerical interval. For example, from 0.0 to 10,000.0. If you use a simple scaling like dividing by 10,000.0 to make it between 0.0 and 1.0, this will result in poor accuracy of prediction. If the attribute is on the input side, then on theory the scaling is unnecessary because the input layer weights will scale accordingly. However, in practice I had lot of problem with this approach. Maybe a log tranformation before scaling may not be a bad choice. If you use a closed scaling you may have problem whenever a future value exceeds the maximum value of the numerical intervel. For example, assume that the attribute is time, say in miliseconds. Any future time from the point of reference can exceed the limit. Hence any closed scaling will not work properly. [...for ordinal attributes:] I have compared Binary Encoding (1.2), Gray-Coded representation and straighforward scaling. Colsed scaling seems to do a good job. I have also compared open scaling and closed scaling and did find significant improvement in prediction accuracy. ###REF### N. Karunanithi, D. Whitley and Y. K. Malaiya, "Prediction of Software Reliability Using Connectionist Models", IEEE Trans. Software Eng., July 1992, pp 563-574. From hicks at cs.titech.ac.jp Fri Feb 11 00:02:54 1994 From: hicks at cs.titech.ac.jp (hicks@cs.titech.ac.jp) Date: Fri, 11 Feb 94 00:02:54 JST Subject: Methods for improving generalization (was Re: some questions on ...) In-Reply-To: lange@ira.uka.de's message of Wed, 9 Feb 94 14:19:22 MET <"iraun1.ira.337:09.01.94.13.22.32"@ira.uka.de> Message-ID: <9402101503.AA16767@maruko.cs.titech.ac.jp> Dear Mr. Franke Lange (lange at ira.uka.de), On Wed, 9 Feb 94 14:19:22 MET you wrote: >But Soft Weight-Sharing does not really adapt to the data, >because you have to tune the same parameters as in normal Weight-Decay: >the parameters, that are used to handle the strength of the penalty-term. >The article of Nowlan and Hinton "Simplifying Neural Networks by Soft Weight- >Sharing" does not mention a method to do this automatically - so no "real" >adaption to the data is made. I say "every model is adaptive, and no model is adaptive, but some are more adaptive than others". Every model has parameters which are adjusted during learning. Penalty functions, including soft weight sharing, affects the prior distribution of weights and so can be thought of as just providing different models. All of these models adapt to data. On the other hand, every model >must< make some assumptions about which it is adamant. If it didn't there wouldn't be a model. These assumptions are non-adaptive to the data. (note1) You further wrote: >Maybe the methods of MacKay ("Bayesian Interpolation", Neural Comp. 4 (1992), >page 415-447) could be used to get a fully-automatic adaption. A combination >of this method with Weight-Decay or Soft Weight-Sharing would perhaps be >data-adaptive; but Soft Weight-Sharing alone has still a parameter, that is >not adapted by the data. The article was very enlighenting. Figure 1 on page 417 shows the 2 main steps of modeling which involve Baysian methods: (1) Fit each model to the data, (2) Assign preferences to the alternative models. The first step is the one we are all familiar with. The second one is the topic of the paper and consists of assigning objective preferences to each model: the probability of the data given the model is called the evidence for the model. Re your idea of "fully-automatic adaption". I will first review the parameters related to soft weight sharing: (a) the number of weight groups (b) the mean and variance of each group of weights. The weight penalty weighting is not arbitrary but determined by the variance of the squared error (which changes with time) divided by a factor (determined by cross-validation) to adjust to the number of free parameters. I think you mean by "fully-automatic adaption" that parameters (a) and (b) should be constant during stage (1), and after running the simulation for a large number of times with different values for (a) and (b) we should select the best ones with stage (2) methods: i.e. weighing the evidence for each model. This would take a long time BUT we might get a different answer from the one obtained by choosing (a) and (b) in stage 1. However, as to which way is best called "automatic", I would personaly favor the present stage (1) way, because it automatically (although maybe imperfectly) estimates the best parameters (a) and (b) implicitly during learning, leaving less labor for the later and harder stage (2). I realize I am getting semantic here. (note1) Mackay does give a special example of a 100% data-adaptive model: the Sure Thing hypothesis, which is that the data set will be what it is (predicted of course before seeing the data, selected afterwards), but this hypothesis has very small a priori probability. Too bad for our universe. The other example is of course stock tips, (predicted of course before seeing the money, collected afterwards), but look what happened to Micheal Milliken! Respectfully Yours, Craig Hicks Craig Hicks hicks at cs.titech.ac.jp | Kore ya kono Yuku mo kaeru mo Ogawa Laboratory, Dept. of Computer Science | Wakarete wa Shiru mo shiranu mo Tokyo Institute of Technology, Tokyo, Japan | Ausaka no seki lab:03-3726-1111 ext.2190 home:03-3785-1974 | (from hyaku-nin-issyu) fax: +81(3)3729-0685 (from abroad) 03-3729-0685 (from Japan) From terry at salk.edu Thu Feb 10 12:45:15 1994 From: terry at salk.edu (Terry Sejnowski) Date: Thu, 10 Feb 94 09:45:15 PST Subject: robust statistics Message-ID: <9402101745.AA28545@salk.edu> One man's outlier is another man's data point. Another way to handle outliers is not to remove them but to model them explicitly. Geoff Hinton has pointed out that character recognition can be made more robust by including models for background noise such as postmarks. Steve Nowlan and I recently used mixtures of expert networks to separate multiple interpenetrating flow fields -- the transparency problem for visual motion. The gating network was used to select regions of the visual field that contained reliable estimates of local velocity for which there was coherent global support. There is evidence for such selection neurons in area MT of primate visual cortex, a region of cortex that specializes in the detection of coherent motion. Terry ----- From yong at cns.brown.edu Thu Feb 10 13:39:19 1994 From: yong at cns.brown.edu (Yong Liu) Date: Thu, 10 Feb 94 13:39:19 EST Subject: outlier, robust statistics Message-ID: <9402101839.AA21430@cns.brown.edu> Plutowski wrote (Wed, 9 Feb 94) >It also points out an appealling definition of "outlier", >My interpretation of this is the following: >When the noise variance on the target can depends upon the input >(in statistical jargon, referred to as "heteroscedasticity of >the conditional variance of Y_i given X_i") >there is the possibility that a plot of the conditional >target variance over the input space could display >discontinuous jumps, corresponding to where it is more likely >to encounter targets that are much more "noisy" - as compared >to targets for neighboring inputs. Is this accurate? Yes. It is the heuristics behind modelling the error as a mixture of normal distributions in (Liu 94). In simple words, the statistical formulation regards the error for each data points as from a normal distribution with different variances, and regard the variances as missing observations. By using a prior on the variance and EM algorithm, one can estimate the variance. It turns out during the estimation, the EM algorithm looks for the data points that have larger variances and down-weights those data points. This way of modelling is in agreement with Dr. Sejnowski's view >One man's outlier is another man's data point. Another >way to handle outliers is not to remove them but to model them >explicitly. ... Plutowski also wrote (Wed, 9 Feb 94) >I look forward to reading (Liu 94). Can you (or anyone else) >point me to other references utilizing a similar definition >of "outlier?" (IMHO) "outlier" is quite a value-laden term >that I tend to avoid since I feel it has multiple and >often ambiguous interpretations/definitions. Box and Tiao (1968) hold similar views. Outlier are generated from a distribution that is a perturbation to the underlying distribution, for example, a small amount of noise with ever changing distribution in the background. Huber's (1981) book is referred as a excellent reference. Anyway, no matter what outlier is, what one really want is to use a model/method that is not sensitive to them and predict the relevant information. References Box, G.E.P. and Tiao, G.C.(1968) A Bayesian approach to some outlier problem. Biometrika, 55, 119-129 Huber (1981) Robust Statistics. John Wiley & Sons, Inc.. BTW. I will be a Phd only three month later. ------- Yong Liu Box 1843 Department of Physics Institute for Brain and Neural Systems Brown University Providence, RI 02912 From zl at venezia.rockefeller.edu Thu Feb 10 20:54:42 1994 From: zl at venezia.rockefeller.edu (Zhaoping Li) Date: Thu, 10 Feb 94 20:54:42 -0500 Subject: Paper announcement on neuroprose Message-ID: <9402110154.AA00738@venezia.rockefeller.edu> FTP-host: archive.cis.ohio-state.edu FTP-file: pub/neuroprose/li-zhaoping.stereocoding.ps.Z The file li-zhaoping.stereocoding.ps.Z is now available for copying from the Neuroprose archive. This is a 16 page paper plus 6 figures, to be published in Network: Computation in Neural Systems. --------------------------------------------------------------------------- Efficient Stereo Coding in the Multiscale Representation Zhaoping Li and Joseph J. Atick The Rockefeller University 1230 York Avenue New York, NY 10021, USA Abstract: Stereo images are highly redundant; the left and right frames of typical scenes are very similar. We explore the consequences of the hypothesis that cortical cells --- in addition to their multiscale coding strategies (Li and Atick 1994a) --- are concerned with reducing binocular redundancy due to correlations between the two eyes. We derive the most efficient coding strategies that achieve binocular decorrelation. It is shown that multiscale coding combined with a binocular decorrelation strategy leads to a rich diversity of cell types. In particular, the theory predicts monocular/binocular cells as well as a family of disparity selective cells, among which one can identify cells that are tuned-zero-excitatory, near, far, and tuned inhibitory. The theory also predicts correlations between ocular dominance, cell size, orientation, and disparity selectivities. Consequences on cortical ocular dominance column formation from abnormal developmental conditions such as strabismus and monocular eye closure are also predicted. These findings are compared with physiological measurements. Please address correspondence to Zhaoping Li ---------------------------------------------------------------------------- To obtain a copy: ftp archive.cis.ohio-state.edu login: anonymous password: cd pub/neuroprose binary get li-zhaoping.stereocoding.ps.Z quit Then at your system: uncompress li-zhaoping.stereocoding.ps lpr -P li-zhaoping.stereocoding.ps Zhaoping Li Box 272 Rockefeller University 1230 York Ave New York, NY 10021 phone: 212-327-7423 fax: 212-327-7422 zl at rockvax.rockefeller.edu From prechelt at ira.uka.de Tue Feb 8 07:19:16 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Tue, 08 Feb 1994 13:19:16 +0100 Subject: SUMMARY: encoding missing values Message-ID: <"irafs2.ira.957:08.01.94.12.19.58"@ira.uka.de> [ My attempt to forward Lutz Prechelt's summary of the missing values discussion was twice foiled by technical problems. Note to future posters: do not attempt to transmit lines containing nothing but a period and a carriage return. It confuses our FTP software. Here is my final attempt to transmit the entire summary. If this fails, Lutz will just have to dump it to neuroprose and let people access it via FTP. Sorry about the repeated postings. -- Dave Touretzky, CONNECTIONISTS moderator ] ================================================================ From prechelt at ira.uka.de Tue Feb 8 07:19:16 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Tue, 08 Feb 1994 13:19:16 +0100 Subject: SUMMARY: encoding missing values Message-ID: <"irafs2.ira.957:08.01.94.12.19.58"@ira.uka.de> A few days ago, I posted some thoughts about how to represent missing input values to a neural network and asked for comments and further ideas. This message is a summary of the replies I received (some in my personal mail some in connectionists). I show the most significant comments and ideas and append versions of the messages that are trimmed to the most important parts (in case somebody wants to keep this discussion in his/her archive) This was my original message: ------------------------------------------------------------------------ From prechelt at ira.uka.de Wed Feb 2 03:58:56 1994 From: prechelt at ira.uka.de (Lutz Prechelt) Date: Wed, 02 Feb 1994 09:58:56 +0100 Subject: Encoding missing values Message-ID: I am currently thinking about the problem of how to encode data with attributes for which some of the values are missing in the data set for neural network training and use. An example of such data is the 'heart-disease' dataset from the UCI machine learning database (anonymous FTP on "ics.uci.edu" [128.195.1.1], directory "/pub/machine-learning-databases"). There are 920 records altogether with 14 attributes each. Only 299 of the records are complete, the others have one or several missing attribute values. 11% of all values are missing. I consider only networks that handle arbitrary numbers of real-valued inputs here (e.g. all backpropagation-suited network types etc). I do NOT consider missing output values. In this setting, I can think of several ways how to encode such missing values that might be reasonable and depend on the kind of attribute and how it was encoded in the first place: 1. Nominal attributes (that have n different possible values) 1.1 encoded "1-of-n", i.e., one network input per possible value, the relevant one being 1 all others 0. This encoding is very general, but has the disadvantage of producing networks with very many connections. Missing values can either be represented as 'all zero' or by simply treating 'is missing' as just another possible input value, resulting in a "1-of-(n+1)" encoding. 1.2 encoded binary, i.e., log2(n) inputs being used like the bits in a binary representation of the numbers 0...n-1 (or 1...n). Missing values can either be represented as just another possible input value (probably all-bits-zero is best) or by adding an additional network input which is 1 for 'is missing' and 0 for 'is present'. The original inputs should probably be all zero in the 'is missing' case. 2. continuous attributes (or attributes treated as continuous) 2.1 encoded as a single network input, perhaps using some monotone transformation to force the values into a certain distribution. Missing values are either encoded as a kind of 'best guess' (e.g. the average of the non-missing values for this attribute) or by using an additional network input being 0 for 'missing' and 1 for 'present' (or vice versa) and setting the original attribute input either to 0 or to the 'best guess'. (The 'best guess' variant also applies to nominal attributes above) 3. binary attributes (truth values) 3.1 encoded by one input: 0=false 1=true or vice versa Treat like (2.1) 3.2 encoded by one input: -1=false 1=true or vice versa In this case we may act as for (3.1) or may just use 0 to indicate 'missing'. 3.3 treat like nominal attribute with 2 possible values 4. ordinal attributes (having n different possible values, which are ordered) 4.1 treat either like continuous or like nominal attribute. If (1.2) is chosen, a Gray-Code should be used. Continuous representation is risky unless a 'sensible' quantification of the possible values is available. So far to my considerations. Now to my questions. a) Can you think of other encoding methods that seem reasonable ? Which ? b) Do you have experience with some of these methods that is worth sharing ? c) Have you compared any of the alternatives directly ? ------------------------------------------------------------------------ SUMMARY: For a), the following ideas were mentioned: 1. use statistical techniques to compute replacement values from the rest of the data set 2. use a Boltzman machine to do this for you 3. use an autoencoder feed forward network to do this for you 4. randomize on the missing values (correct in the Bayesian sense) For b), some experience was reported. I don't know how to summarize that nicely, so I just don't summarize at all. For c), no explicit quantitative results were given directly. Some replies suggest that data is not always missing randomly. The biases are often known and should be taken into account (e.g. medical tests are not carried out (resulting in missing data) for moreless healthy persons more often than for ill persons). Many replies contained references to published work on this area, from NN, machine learning, and mathematical statistics. To ease searching for these references in the replies below, I have marked them with the string ##REF## (if you have a 'grep' program that extracts whole paragraphs, you can get them all out with one command). Thanks to all who answered. These are the trimmed versions of the replies: ------------------------------------------------------------------------ From: tgd at research.CS.ORST.EDU (Tom Dietterich) [...for nominal attributes:] An alternative here is to encode them as bit-strings in a error-correcting code, so that the hamming distance between any two bit strings is constant. This would probably be better than a dense binary encoding. The cost in additional inputs is small. I haven't tried this though. My guess is that distributed representations at the input are a bad idea. One must always determine WHY the value is missing. In the heart disease data, I believe the values were not measured because other features were believed to be sufficient in each case. In such cases, the network should learn to down-weight the importance of the feature (which can be accomplished by randomizing it---see below). In other cases, it may be more appropriate to treat a missing value as a separate value for the feature, e.g., in survey research, where a subject chooses not to answer a question. [...for continuous attributes:] Ross Quinlan suggests encoding missing values as the mean observed output value when the value is missing. He has tried this in his regression tree work. Another obvious approach is to randomize the missing values--on each presentation of the training example, choose a different, random, value for each missing input feature. This is the "right thing to do" in the bayesian sense. [...for binary attributes:] I'm skeptical of the -1,0,1 encoding, but I think there is more research to be done here. [...for ordinal attributes:] I would treat them as continuous. ------------------------------------------------------------------------ From: shavlik at cs.wisc.edu (Jude W. Shavlik) We looked at some of the methods you talked about in the following article in the journal Machine Learning. ##REF## %T Symbolic and Neural Network Learning Algorithms: An Experimental Comparison %A J. W. Shavlik %A R. J. Mooney %A G. G. Towell %J Machine Learning %V 6 %N 2 %P 111-143 %D 1991 ------------------------------------------------------------------------ From: hertz at nordita.dk (John Hertz) It seems to me that the most natural way to handle missing data is to leave them out. You can do this if you work with a recurrent network (fx Boltzmann machine) where the inputs are fed in by clamping the input units to the given input values and the rest of the net relaxes to a fixed point, after which the output is read off the output units. If some of the input values are missing, the corresponding input units are just left unclamped, free to relax to values most consistent with the known inputs. I have meant for a long time to try this on some medical prognosis data I was working on, but I never got around to it, so I would be happy to hear how it works if you try it. ------------------------------------------------------------------------ From: jozo at sequoia.WPI.EDU (Jozo Dujmovic) In the case of clustering benchmark programs I frequently have the the problem of estimation of missing data. A relatively simple SW that implements a heuristic algorithm generates estimates having the average error of 8%. NN will somehow "implicitly estimate" the missing data. The two approaches might even be in some sense equivalent (?). Jozo [ I suspect that they are not: When you generate values for the missing items and put them in the training set, the network loses the information that this data is only estimated. Since estimations are not as reliable as true input data, the network will weigh inputs that have lots of generated values as less important. If it gets the 'is missing' information explicitly, it can discriminate true values from estimations instead. ] ------------------------------------------------------------------------ From: guy at cs.uq.oz.au A final year student of mine worked on the problem of dealing with missing inputs, without much success. However, the student as not very good, so take the following opinions with a pinch of salt. We (very tentatively) came to the conclusion that if the inputs were redundant, the problem was easy; if the missing input contained vital information, the problem was pretty much impossible. We used the heart disease data. I don't recommend it for the missing inputs problem. All of the inputs are very good indicators of the correct result, so missing inputs were not important. Apparently there is a large literature in statistics on dealing with missing inputs. Anthony Adams (University of Tasmania) has published a technical report on this. His email address is "A.Adams at cs.utas.edu.au". ##REF## @techreport{kn:Vamplew-91, author = "P. Vamplew and A. Adams", address = {Hobart, Tasmania, Australia}, institution = {Department of Computer Science, University of Tasmania}, number = {R1-4}, title = {Real World Problems in Backpropagation: Missing Values and Generalisability}, year = {1991} } ------------------------------------------------------------------------ From: Mike Southcott ##REF## I wrote a paper for the Australian conference on neural networks in 1993. ``Classification of Incomplete Data using neural networks'' Southcott, Bogner. You may find it interesting. You may not be able to get the proceedings for this conference, but I am in the process of digging up a postscript copy for someone in the States, so when I do that, I will send you a copy. ------------------------------------------------------------------------ From: Eric Saund I have done some work on unsupervised learning of mulitple cause clusters in binary data, for which an appropriate encoding scheme is -1 = FALSE, 1 = TRUE, and 0 = NO DATA. This has worked well for me, but my paradigm is not your standard feedforward network and uses a different activiation function from the standard weighted sum followed by sigmoid squashing. I presented the paper on this work at NIPS: ##REF## Saund, Eric; 1994; "Unsupervised Learning of Mixtures of Multiple Causes in Binary Data," in Advances in Neural Information Processing Systems -6-, Cowan, J., Tesauro, G, and Alspector, J., eds. Morgan Kaufmann, San Francisco. ------------------------------------------------------------------------ From: Thierry.Denoeux at hds.univ-compiegne.fr In a recent mailing, Lutz Prechelt mentioned the interesting problem of how to encode attributes with missing values as inputs to a neural network. I have recently been faced to that problem while applying neural nets to rainfall prediction using weather radar images. The problem was to classify pairs of "echoes" -- defined as groups of connected pixels with reflectivity above some threshold -- taken from successive images as corresponding to the same rain cell or not. Each pair of echoes was discribed by a list of attributes. Some of these attributes, refering to the past of a sequence, were not defined for some instances. To encode these attributes with potentially missing values, we applied two different methods actually suggested by Lutz: - the replacement of the missing value by a "best-guess" value - the addition of a binary input indicating whether the corresponding attribute was present or absent. Significantly better results were obtained by the second method. This work was presented at ICANN'93 last september: ##REF## X. Ding, T. Denoeux & F. Helloco (1993). Tracking rain cells in radar images using multilayer neural networks. In Proc. of ICANN'93, Springer-Verlag, p. 962-967. ------------------------------------------------------------------------ From: "N. Karunanithi" [...for nominal attributes:] Both methods have the problem of poor scalability. If the number of missing values increases then the number of additional inputs will increase linearly in 1.1 and logarithmically in 1.2. In fact, 1-of-n encoding may be a poor choice if (1) the number of input features is large and (2) such an expanded dimensional representation does not become a (semi) linearly separable problem. Even if it becomes a linearly separable problem, the overall complexity of the network can sometimes be very high. [...for continuous attributes:] This representation requires GUESS. A nominal transformation may not be a proper representation in some cases. Assume that the output values range over a large numerical interval. For example, from 0.0 to 10,000.0. If you use a simple scaling like dividing by 10,000.0 to make it between 0.0 and 1.0, this will result in poor accuracy of prediction. If the attribute is on the input side, then on theory the scaling is unnecessary because the input layer weights will scale accordingly. However, in practice I had lot of problem with this approach. Maybe a log tranformation before scaling may not be a bad choice. If you use a closed scaling you may have problem whenever a future value exceeds the maximum value of the numerical intervel. For example, assume that the attribute is time, say in miliseconds. Any future time from the point of reference can exceed the limit. Hence any closed scaling will not work properly. [...for ordinal attributes:] I have compared Binary Encoding (1.2), Gray-Coded representation and straighforward scaling. Colsed scaling seems to do a good job. I have also compared open scaling and closed scaling and did find significant improvement in prediction accuracy. ###REF### N. Karunanithi, D. Whitley and Y. K. Malaiya, "Prediction of Software Reliability Using Connectionist Models", IEEE Trans. Software Eng., July 1992, pp 563-574. N. Karunanithi and Y. K. Malaiya, "The Scaling Problem in Neural Networks for Software Reliability Prediction", Proc. IEEE Int. Symposium on Rel. Eng., Oct. 1992, pp. 776-82. I have not found a simple solution that is general. I think representation in general and the missing information in specific are open problems within connectionist research. I am not sure we will have a magic bullet for all problems. The best approach is to come up with a specific solution for a given problem. ------------------------------------------------------------------------ From: Bill Skaggs There is at least one kind of network that has no problem (in principle) with missing inputs, namely a Boltzmann machine. You just refrain from clamping the input node whose value is missing, and treat it like an output node or hidden unit. This may seem to be irrelevant to anything other than Boltzmann machines, but I think it could be argued that nothing very much simpler is capable of dealing with the problem. When you ask a network to handle missing inputs, you are in effect asking it to do pattern completion on the input layer, and for this a Boltzmann machine or some other sort of attractor network would seem to be required. ------------------------------------------------------------------------ From: "Scott E. Fahlman" [Follow-up to Bill Skaggs:] Good point, but perhaps in need of clarification for some readers: There are two ways of training a Boltzmann machine. In one (the original form), there is no distinction between input and output units. During training we alternate between an instruction phase, in which all of the externally visible units are clamped to some pattern, and a normalization phase, in which the whole network is allow to run free. The idea is to modify the weights so that, when running free, the external units assume the various pattern values in the training set in their proper frequencies. If only some subset of the externally visible units are clamped to certain values, the net will produce compatible completions in the other units, again with frequencies that match this part of the training set. A net trained in this way will (in principle -- it might take a *very* long time for anything complicated) do what you suggest: Complete an "input" pattern and produce a compatible output at the same time. This works even if the input is *totally* missing. I believe it was Geoff Hinton who realized that a Boltzmann machine could be trained more efficiently if you do make a distinction between input and output units, and don't waste any of the training effort learning to reconstruct the input. In this model, the instruction phase clamps both input and output units to some pattern, while the normalization phase clamps only the input units. Since the input units are correct in both cases, all of the networks learning power (such as it is) goes into producing correct patterns on the output units. A net trained in this way will not do input-completion. I bring this up because I think many people will only have seen the latter kind of Boltzmann training, and will therefore misunderstand your observation. By the way, one alternative method I have seen proposed for reconstructing missing input values is to first train an auto-encoder (with some degree of bottleneck to get generalization) on the training set, and then feed the output of this auto-encoder into the classification net. The auto-encoder should be able to replace any missing values with some degree of accuracy. I haven't played with this myself, but it does sound plausible. If anyone can point to a good study of this method, please post it here or send me E-mail. ------------------------------------------------------------------------ From: "David G. Stork" ##REF## There is a provably optimal method for performing classification with missing inputs, described in Chapter 2 of "Pattern Classification and Scene Analysis" (2nd ed.) by R. O. Duda, P. E. Hart and D. G. Stork, which avoids the ad-hoc heuristics that have been described by others. Those interested in obtaining Chapter two via ftp should contact me. ------------------------------------------------------------------------ From: Wray Buntine This missing value problem is of course shared amongst all the learning communities, artificial intelligence, statistics, pattern recognition, etc., not just neural networks. A classic study in this area, which includes most suggestions I've read here so far, is ##REF## @inproceedings{quinlan:ml6, AUTHOR = "J.R. Quinlan", TITLE = "Unknown Attribute Values in Induction", YEAR = 1989, BOOKTITLE = "Proceedings of the Sixth International Machine Learning Workshop", PUBLISHER = "Morgan Kaufmann", ADDRESS = "Cornell, New York"} The most frequently cited methods I've seen, and they're so common amongst the different communities its hard to lay credit: 1) replace missings by their some best guess 2) fracture the example into a set of fractional examples each with the missing value filled in somehow 3) call the missing value another input value 3 is a good thing to do if they are "informative" missing, i.e. if someone leaves the entry "telephone number" blank in a questionaire, then maybe they don't have a telephone, but probably not good otherwise unless you have loads of data and don't mind all the extra example types generated (as already mentioned) 1 is a quick and dirty hack at 2. How good depends on your application. 2 is an approximation to the "correct" approach for handling "non-informative" missing values according to the standard "mixture model". The mathematics for this is general and applies to virtually any learning algorithm trees, feed-forward nets, linear regression, whatever. We do it for feed-forward nets in ##REF## @article{buntine.weigend:bbp, AUTHOR = "W.L. Buntine and A.S. Weigend", TITLE = "Bayesian Back-Propagation", JOURNAL = "Complex Systems", Volume = 5, PAGES = "603--643", Number = 1, YEAR = "1991" } and see Tresp, Ahmad & Neuneier in NIPS'94 for an implementation. But no doubt someone probably published the general idea back in the 50's. I certainly wouldn't call missing values an open problem. Rather, "efficient implementations of the standard approaches" is, in some cases, an open problem. ------------------------------------------------------------------------ From: Volker Tresp In general, the solution to the missing-data problem depends on the missing-data mechanism. For example, if you sample the income of a population and rich people tend to refuse the answer the mean of your sample is biased. To obtain an unbiased solution you would have to take into account the missing-data mechanism. The missing-data mechanism can be ignored if it is independent of the input and the output (in the example: the likelihood that a person refuses to answer is independent of the person's income). Most approaches assume that the missing-data mechanism can be ignored. There exist a number of ad hoc solutions to the missing-data problem but it is also possible to approach the problem from a statistical point of view. In our paper (which will be published in the upcoming NIPS-volume and which will be available on neuroprose shortly) we discuss a systematic likelihood-based approach. NN-regression can be framed as a maximum likelihood learning problem if we assume the standard signal plus Gaussian noise model P(x, y) = P(x) P(y|x) \propto P(x) exp(-1/(2 \sigma^2) (y - NN(x))^2). By deriving the probability density function for a pattern with missing features we can formulate a likelihood function including patterns with complete and incomplete features. The solution requires an integration over the missing input. In practice, the integral is approximated using a numerical approximation. For networks of Gaussian basis functions, it is possible to obtain closed-form solutions (by extending the EM algorithm). Our paper also discusses why and when ad hoc solutions --such as substituting the mean for an unknown input-- are harmful. For example, if the mapping is approximately linear substituting the mean might work quite well. In general, although, it introduces bias. Training with missing and noisy input data is described in: ##REF## ``Training Neural Networks with Deficient Data,'' V. Tresp, S. Ahmad and R. Neuneier, in Cowan, J. D., Tesauro, G., and Alspector, J. (eds.), {\em Advances in Neural Information Processing Systems 6}, Morgan Kaufmann, 1994. A related paper by Zoubin Ghahramani and Michael Jordan will also appear in the upcoming NIPS-volume. Recall with missing and noisy data is discussed in (available in neuroprose as ahmad.missing.ps.Z): ``Some Solutions to the Missing Feature Problem in Vision,'' S. Ahmad and V. Tresp, in {\em Advances in Neural Information Processing Systems 5,} S. J. Hanson, J. D. Cowan, and C. L. Giles eds., San Mateo, CA, Morgan Kaufman, 1993. ------------------------------------------------------------------------ From: Subhash Kak Missing values in feedback networks raise interesting questions: Should these values be considered "don't know" values or should these be generated in some "most likelihood" fashion? These issues are discussed in the following paper: ##REF## S.C. Kak, "Feedback neural networks: new characteristics and a generalization", Circuits, Systems, Signal Processing, vol. 12, no. 2, 1993, pp. 263-278. ------------------------------------------------------------------------ From: Zoubin Ghahramani I have also been looking into the issue of encoding and learning from missing values in a neural network. The issue of handling missing values has been addressed extensively in the statistics literature for obvious reasons. To learn despite the missing values the data has to be filled in, or the missing values integrated over. The basic question is how to fill in the missing data. There are many different methods for doing this in stats (mean imputation, regression imputation, Bayesian methods, EM, etc.). For good reviews see (Little and Rubin 1987; Little, 1992). I do not in general recommend encoding "missing" as yet another value to be learned over. Missing means something in a statistical sense -- that the input could be any of the values with some probability distribution. You could, for example, augment the original data filling in different values for the missing data points according to a prior distribution. Then the training would assign different weights to the artificially filled-in data points depending on how well they predict the output (their posterior probability). This is essentially the method proposed by Buntine and Weigand (1991). Other approaches have been proposed by Tresp et al. (1993) and Ahmad and Tresp (1993). I have just written a paper on the topic of learning from incomplete data. In this paper I bring a statistical algorithm for learning from incomplete data, called EM, into the framework of nonlinear function approximation and classification with missing values. This approach fits the data iteratively with a mixture model and uses that same mixture model to effectively fill in any missing input or output values at each step. You can obtain the preprint by ftp psyche.mit.edu login: anonymous cd pub get zoubin.nips93.ps To obtain code for the algorithm please contact me directly. ##REF## Ahmad, S and Tresp, V (1993) "Some Solutions to the Missing Feature Problem in Vision." In Hanson, S.J., Cowan, J.D., and Giles, C.L., editors, Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers, San Mateo, CA. Buntine, WL, and Weigand, AS (1991) "Bayesian back-propagation." Complex Systems. Vol 5 no 6 pp 603-43 Ghahramani, Z and Jordan MI (1994) "Supervised learning from incomplete data via an EM approach" To appear in Cowan, J.D., Tesauro, G., and Alspector,J. (eds.). Advances in Neural Information Processing Systems 6. Morgan Kaufmann Publishers, San Francisco, CA, 1994. Little, RJA (1992) "Regression With Missing X's: A Review." Journal of the American Statistical Association. Volume 87, Number 420. pp. 1227-1237 Little, RJA. and Rubin, DB (1987). Statistical Analysis with Missing Data. Wiley, New York. Tresp, V, Hollatz J, Ahmad S (1993) "Network structuring and training using rule-based knowledge." In Hanson, S.J., Cowan, J.D., and Giles, C.~L., editors, Advances in Neural Information Processing Systems 5. Morgan Kaufmann Publishers, San Mateo, CA. ------------------------------------------------------------------------ That's it. Lutz Lutz Prechelt (email: prechelt at ira.uka.de) | Whenever you Institut fuer Programmstrukturen und Datenorganisation | complicate things, Universitaet Karlsruhe; 76128 Karlsruhe; Germany | they get (Voice: ++49/721/608-4068, FAX: ++49/721/694092) | less simple. From n.burgess at ucl.ac.uk Fri Feb 11 05:00:20 1994 From: n.burgess at ucl.ac.uk (Neil Burgess) Date: Fri, 11 Feb 94 10:00:20 +0000 Subject: pre-print in neuroprose Message-ID: <141927.9402111000@link-1.ts.bcc.ac.uk> FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/burgess.hipmod.ps.Z *****do not forward to other groups***** Dear connectionists, the following preprint has been put on neuroprose, contact n.burgess at ucl.ac.uk with any retrieval problems, --Neil `A model of hippocampal function' Neil Burgess, Michael Recce and John O'Keefe Dept. of Anatomy, University College, London WC1E 6BT, U.K. The firing rate maps of hippocampal place cells recorded in a freely moving rat are viewed as a set of approximate radial basis functions over the (2-D) environment of the rat. It is proposed that these firing fields are constructed during exploration from `sensory inputs' (tuning curve responses to the distance of cues from the rat) and used by cells downstream to construct firing rate maps that approximate any desired surface over the environment. It is shown that, when a rat moves freely in an open field, the phase of firing of a place cell (with respect to the EEG $\theta$ rhythm) contains information as to the relative position of its firing field from the rat. A model of hippocampal function is presented in which the firing rate maps of cells downstream of the hippocampus provide a `population vector' encoding the instantaneous direction of the rat from a previously encountered reward site, enabling navigation to it. A neuronal simulation, involving reinforcement only at the goal location, provides good agreement with single cell recording from the hippocampal region, and can navigate to reward sites in open fields using sensory input from environmental cues. The system requires only brief exploration, performs latent learning, and can return to a goal location after encountering it only once. Neural Networks, to be published. 26 pages, 2Mbytes uncompressed. From eric at research.nj.nec.com Fri Feb 11 11:11:29 1994 From: eric at research.nj.nec.com (Eric B. Baum) Date: Fri, 11 Feb 94 11:11:29 EST Subject: No subject Message-ID: <9402111611.AA00562@yin> Fifth Annual NEC Research Symposium NATURAL AND ARTIFICIAL PARALLEL COMPUTATION PRINCETON, NJ MAY 4 - 5, 1994 NEC Research Institute is pleased to announce that the Fifth Annual NEC Research Symposium will be held at the Hyatt Regency Hotel in Princeton, New Jersey on May 4 and 5, 1994. The title of this year's symposium is Natural and Artificial Parallel Computation. The conference will feature ten invited talks. The speakers are: - Larry Abbott, Brandeis University, "Activity- Dependent Modulation of Intrinsic Neuronal Properties" - Catherine Carr, University of Maryland, "Time Coding in the Central Nervous System" - Bill Dally, MIT, "Bandwidth, Granularity, and Mechanisms: Key Issues in the Design of Parallel Computers" - Amiram Grinvald, Weitzmann Institute, "Architecture and Dynamics of Cell Assemblies in the Visual Cortex; New Perspectives From Fast and Slow Optical Imaging" - Akihiko Konagaya, NEC C&C Research Labs, "Knowledge Discovery in Genetic Sequences" - Chris Langton, Santa Fe Institute, "SWARM: An Agent Based Simulation System for Research in Complex Systems" - Thomas Ray, University of Delaware and ATR, "Evolution and Ecology of Digital Organisms" - Shuichi Sakai, Real World Computing Partnership, "RWC Massively Parallel ComputerProject" - Shigeru Tanaka, NEC Fundamental Research Labs, "A Mathematical Theory for the Experience- Dependent Development of Visual Cortex" - Leslie Valiant, Harvard University and NECI, "A Computational Model for Cognition" There will be no contributed papers. Registration is free of charge, but space is limited. Registrations will be accepted on a first come, first served basis. YOU MUST PREREGISTER. There will be no on-site registration. To preregister by e-mail, send a request to: symposium at research.nj.nec.com. Registrants will receive an acknowledgment, space allowing. A request for preregistration is also possible by regular mail to Mrs. Irene Parker, NEC Research Institute, 4 Independence Way, Princeton, NJ 08540. Registrants will also be invited to an Open House/Poster Session and Reception at NEC Research Institute on Tuesday, May 3. The Open House will begin at 3:30 PM and the Reception will begin at 5:30 PM. In order to estimate headcount, please indicate in your preregistration request whether you plan to attend the Open House on May 3. Registrants are expected to make their own arrangements for accommodations. Provided below is a list of hotels in the area together with daily room rates. Please ask for the NEC Corporate Rate when reserving a room. Sessions will start at 8:15 AM Wednesday, May 4 and will be scheduled to finish at approximately 3:30 PM on Thursday, May 5. Red Roof Inn, South Brunswick (908)821-8800 $37.99 Novotel Hotel, Princeton (609)520-1200 $68.00 ($74.00/w breakfast) Palmer Inn, Princeton (609)452-2500 $73.00 Marriott Residence Inn, Princeton (908)329-9600 $85.00 w/continental breakfast Summerfield Suites, Princeton (609)951-0009 $92.00 Hyatt Regency, Princeton (609)987-1234 $105.00 Marriott Hotel, Princeton (609)452-7900 $125.00 - - - - - - - - - - - - - - - - - - - - - - - - - - PLEASE RESPOND BY E-MAIL TO: symposium at research.nj.nec.com I would like to attend: _____ Open House _____ Symposium Name: ____________________________ Organization: ____________________________ E-mail address: ____________________________ Phone number: ____________________________ From bishopc at helios.aston.ac.uk Fri Feb 11 09:59:33 1994 From: bishopc at helios.aston.ac.uk (bishopc) Date: Fri, 11 Feb 94 14:59:33 GMT Subject: Postdoctoral Fellowships Message-ID: <27570.9402111459@sun.aston.ac.uk> ------------------------------------------------------------------- Aston University Neural Computing Research Group TWO POSTDOCTORAL RESEARCH FELLOWSHIPS: -------------------------------------- FUNDAMENTAL RESEARCH IN NEURAL NETWORKS Two postdoctoral fellowships, each with a duration of 3 years, will be funded by the U.K. Science and Engineering Research Council, and are to commence on or after 1 April 1994. These posts are part of a major project to be undertaken within the Neural Computing Research Group at Aston, and will involve close collaboration with Professors Chris Bishop and David Lowe, with additional input from Professor David Bounds. This interdisciplinary program requires researchers capable of extending theoretical concepts, and developing algorithmic and proof-of-principle demonstrations through software simulation. The two Research Fellows will work on distinct, though closely related, areas as follows: 1. Generalization in Neural Networks The usual approach to complexity optimisation and model order selection in neural networks makes use of computationally intensive cross-validation techniques. This project will build on recent developments in the use of Bayesian methods and the description length formalism to develop systematic techniques for model optimization in feedforward neural networks from a principled statistical perspective. In its later stages, the project will demonstrate the practical utility of the techniques which emerge, in the context of a wide range of real-world applications. 2. Dynamic Neural Networks Current embodiments of neural networks, when applied to `dynamic' events such as time series forecasting, are successful only if the underlying `generator' of the data is stationary. If the underlying generator is slowly varying in time then we do not have a principled basis for designing effective neural network structures, though ad hoc procedures do exist. This program will address some of the key issues in this area using techniques from statistical pattern processing and dynamical systems theory. In addition, application studies will be conducted which will focus on time series problems and tracking in non-stationary noise. If you wish to be considered for these positions, please send a CV and publications list, together with the names of 3 referees, to: Professor Chris M Bishop Neural Computing Research Group Aston University Birmingham B4 7ET, U.K. Tel: 021 359 3611 ext. 4270 Fax: 021 333 6215 e-mail: c.m.bishop at aston.ac.uk From ahmad at interval.com Fri Feb 11 12:04:37 1994 From: ahmad at interval.com (ahmad@interval.com) Date: Fri, 11 Feb 94 09:04:37 -0800 Subject: Computing visual feature correspondences Message-ID: <9402111704.AA28021@iris10.interval.com> The following paper is available for anonymous ftp on archive.cis.ohio-state.edu (128.146.8.52), in directory pub/neuroprose, as file "ahmad.correspondence.ps.Z": Feature Densities are Required for Computing Feature Correspondences Subutai Ahmad Interval Research Corporation 1801-C Page Mill Road, Palo Alto, CA 94304 E-mail: ahmad at interval.com Abstract The feature correspondence problem is a classic hurdle in visual object-recognition concerned with determining the correct mapping between the features measured from the image and the features expected by the model. In this paper we show that determining good correspondences requires information about the joint probability density over the image features. We propose "likelihood based correspondence matching" as a general principle for selecting optimal correspondences. The approach is applicable to non-rigid models, allows nonlinear perspective transformations, and can optimally deal with occlusions and missing features. Experiments with rigid and non-rigid 3D hand gesture recognition support the theory. The likelihood based techniques show almost no decrease in classification performance when compared to performance with perfect correspondence knowledge. To appear in: Cowan, J.D., Tesauro, G., and Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6. San Francisco CA: Morgan Kaufmann, 1994. From ahmad at interval.com Fri Feb 11 13:03:31 1994 From: ahmad at interval.com (ahmad@interval.com) Date: Fri, 11 Feb 94 10:03:31 -0800 Subject: Training NN's with missing or noisy data Message-ID: <9402111803.AA28794@iris10.interval.com> The following paper is available for anonymous ftp on archive.cis.ohio-state.edu (128.146.8.52), in directory pub/neuroprose, as file "tresp.deficient.ps.Z". (The companion paper, "Some Solutions to the Missing Feature Problem in Vision" is available as "ahmad.missing.ps.Z") Training Neural Networks with Deficient Data Volker Tresp Subutai Ahmad Siemens AG Interval Research Corporation Central Research 1801-C Page Mill Rd. 81730 Muenchen, Germany Palo Alto, CA 94304 tresp at zfe.siemens.de ahmad at interval.com Ralph Neuneier Siemens AG Central Research Otto-Hahn-Ring 6 81730 Muenchen, Germany ralph at zfe.siemens.de Abstract: We analyze how data with uncertain or missing input features can be incorporated into the training of a neural network. The general solution requires a weighted integration over the unknown or uncertain input although computationally cheaper closed-form solutions can be found for certain Gaussian Basis Function (GBF) networks. We also discuss cases in which heuristical solutions such as substituting the mean of an unknown input can be harmful. The paper will appear in: Cowan, J.D., Tesauro, G., and Alspector, J. (Eds.), Advances in Neural Information Processing Systems 6. San Francisco CA: Morgan Kaufmann, 1994. Subutai Ahmad Interval Research Corporation Phone: 415-354-3639 1801-C Page Mill Rd. Fax: 415-354-0872 Palo Alto, CA 94304 E-mail: ahmad at interval.com From mel at klab.caltech.edu Fri Feb 11 15:05:47 1994 From: mel at klab.caltech.edu (Bartlett Mel) Date: Fri, 11 Feb 94 12:05:47 PST Subject: NIPS*94 Call for Papers Message-ID: <9402112005.AA10791@plato.klab.caltech.edu> ********* PLEASE NOTE NEW SUBMISSIONS FORMAT FOR 1994 ********* CALL FOR PAPERS Neural Information Processing Systems -Natural and Synthetic- Monday, November 28 - Saturday, December 3, 1994 Denver, Colorado This is the eighth meeting of an interdisciplinary conference which brings together neuroscientists, engineers, computer scientists, cognitive scientists, physicists, and mathematicians interested in all aspects of neural processing and computation. The conference will include invited talks, and oral and poster presentations of refereed papers. There will be no parallel sessions. There will also be one day of tutorial presentations (Nov 28) preceding the regular session, and two days of focused workshops will follow at a nearby ski area (Dec 2-3). Major categories for paper submission, and examples of keywords within categories, are the following: Neuroscience: systems physiology, cellular physiology, signal and noise analysis, oscillations, synchronization, inhibition, neuromodulation, synaptic plasticity, computational models. Theory: computational learning theory, complexity theory, dynamical systems, statistical mechanics, probability and statistics, approximation theory. Implementations: VLSI, optical, parallel processors, software simulators, implementation languages. Algorithms and Architectures: learning algorithms, constructive/pruning algorithms, localized basis functions, decision trees, recurrent networks, genetic algorithms, combinatorial optimization, performance comparisons. Visual Processing: image recognition, coding and classification, stereopsis, motion detection, visual psychophysics. Speech, Handwriting and Signal Processing: speech recognition, coding and synthesis, handwriting recognition, adaptive equalization, nonlinear noise removal. Applications: time-series prediction, medical diagnosis, financial analysis, DNA/protein sequence analysis, music processing, expert systems. Cognitive Science & AI: natural language, human learning and memory, perception and psychophysics, symbolic reasoning. Control, Navigation, and Planning: robotic motor control, process control, navigation, path planning, exploration, dynamic programming. Review Criteria: All submitted papers will be thoroughly refereed on the basis of technical quality, novelty, significance and clarity. Submissions should contain new results that have not been published previously. Authors are encouraged to submit their most recent work, as there will be an opportunity after the meeting to revise accepted manuscripts before submitting final camera-ready copy. ********** PLEASE NOTE NEW SUBMISSIONS FORMAT FOR 1994 ********** Paper Format: Submitted papers may be up to eight pages in length. The page limit will be strictly enforced, and any submission exceeding eight pages will not be considered. Authors are encouraged (but not required) to use the NIPS style files obtainable by anonymous FTP at the sites given below. Papers must include physical and e-mail addresses of all authors, and must indicate one of the nine major categories listed above, keyword information if appropriate, and preference for oral or poster presentation. Unless otherwise indicated, correspondence will be sent to the first author. Submission Instructions: Send six copies of submitted papers to the address given below; electronic or FAX submission is not acceptable. Include one additional copy of the abstract only, to be used for preparation of the abstracts booklet distributed at the meeting. Submissions mailed first-class within the US or Canada must be postmarked by May 21, 1994. Submissions from other places must be received by this date. Mail submissions to: David Touretzky NIPS*94 Program Chair Computer Science Department Carnegie Mellon University 5000 Forbes Avenue Pittsburgh PA 15213-3890 USA Mail general inquiries/requests for registration material to: NIPS*94 Conference NIPS Foundation PO Box 60035 Pasadena, CA 91116-6035 USA (e-mail: nips94 at caltech.edu) FTP sites for LaTex style files "nips.tex" and "nips.sty": helper.systems.caltech.edu (131.215.68.12) in /pub/nips b.gp.cs.cmu.edu (128.2.242.8) in /usr/dst/public/nips NIPS*94 Organizing Committee: General Chair, Gerry Tesauro, IBM; Program Chair, David Touretzky, CMU; Publications Chair, Joshua Alspector, Bellcore; Publicity Chair, Bartlett Mel, Caltech; Workshops Chair, Todd Leen, OGI; Treasurer, Rodney Goodman, Caltech; Local Arrangements, Lori Pratt, Colorado School of Mines; Tutorials Chairs, Steve Hanson, Siemens and Gerry Tesauro, IBM; Contracts, Steve Hanson, Siemens and Scott Kirkpatrick, IBM; Government & Corporate Liaison, John Moody, OGI; Overseas Liaisons: Marwan Jabri, Sydney Univ., Mitsuo Kawato, ATR, Alan Murray, Univ. of Edinburgh, Joachim Buhmann, Univ. of Bonn, Andreas Meier, Simon Bolivar Univ. DEADLINE FOR SUBMISSIONS IS MAY 21, 1994 (POSTMARKED) -please post- From yamauchi at alpha.ces.cwru.edu Fri Feb 11 17:24:43 1994 From: yamauchi at alpha.ces.cwru.edu (Brian Yamauchi) Date: Fri, 11 Feb 94 17:24:43 -0500 Subject: Preprints Available Message-ID: <9402112224.AA03791@yuggoth.CES.CWRU.Edu> The following papers are available via anonymous ftp from yuggoth.ces.cwru.edu: ---------------------------------------------------------------------- Sequential Behavior and Learning in Evolved Dynamical Neural Networks Brian Yamauchi(1) and Randall Beer(1,2) Department of Computer Engineering and Science(1) Department of Biology(2) Case Western Reserve University Cleveland, OH 44106 Case Western Reserve University Technical Report CES-93-25 This paper will be appearing in Adaptive Behavior. Abstract This paper explores the use of a real-valued modular genetic algorithm to evolve continuous-time recurrent neural networks capable of sequential behavior and learning. We evolve networks that can generate a fixed sequence of outputs in response to an external trigger occurring at varying intervals of time. We also evolve networks that can learn to generate one of a set of possible sequences based upon reinforcement from the environment. Finally, we utilize concepts from dynamical systems theory to understand the operation of some of these evolved networks. A novel feature of our approach is that we assume neither an a priori discretization of states or time nor an a priori learning algorithm that explicitly modifies network parameters during learning. Rather, we merely expose dynamical neural networks to tasks that require sequential behavior and learning and allow the genetic algorithm to evolve network dynamics capable of accomplishing these tasks. Files: /pub/agents/yamauchi/seqlearn.ps.Z Article Text (73K) /pub/agents/yamauchi/seqlearn-fig.ps.Z Figures (654K) ---------------------------------------------------------------------- Integrating Reactive, Sequential, and Learning Behavior Using Dynamical Neural Networks Brian Yamauchi(1,3) and Randall Beer(1,2) Department of Computer Engineering and Science(1) Department of Biology(2) Case Western Reserve University Cleveland, OH 44106 Navy Center for Applied Research in Artificial Intelligence(3) Naval Research Laboratory Washington, DC 20375-5000 This paper has been submitted to the Third International Conference on Simulation of Adaptive Behavior. Abstract This paper explores the use of dynamical neural networks to control autonomous agents in tasks requiring reactive, sequential, and learning behavior. We use a genetic algorithm to evolve networks that can solve these tasks. These networks provide a mechanism for integrating these different types of behavior in a smooth, continuous manner. We applied this approach to three different task domains: landmark recognition using sonar on a real mobile robot, one-dimensional navigation using a simulated agent, and reinforcement-based sequence learning. For the landmark recognition task, we evolved networks capable of differentiating between two different landmarks based on the spatiotemporal information in a sequence of sonar readings obtained as the robot circled the landmark. For the navigation task, we evolved networks capable of associating the location of a landmark with a corresponding goal location and directing the agent to that goal. For the sequence learning task, we evolved networks that can learn to generate one of a set of possible sequences based upon reinforcement from the environment. A novel feature of the learning aspects of our approach is that we assume neither an a priori discretization of states or time nor an a priori learning algorithm that explicitly modifies network parameters during learning. Instead, we expose dynamical neural networks to tasks that require learning and allow the genetic algorithm to evolve network dynamics capable of accomplishing these tasks. Files: /pub/agents/yamauchi/integ.ps.Z Complete Article (233K) If your printer has problems printing the complete document as a single file, try printing the following two files: /pub/agents/yamauchi/integ-part1.ps.Z Pages 1-8 (77K) /pub/agents/yamauchi/integ-part2.ps.Z Pages 9-11 (147K) ---------------------------------------------------------------------- On the Dynamics of a Continuous Hopfield Neuron with Self-Connection Randall Beer Department of Computer Engineering and Science Department of Biology Case Western Reserve University Cleveland, OH 44106 Case Western Reserve University Technical Report CES-94-1 This paper has been submitted to Neural Computation. Continuous-time recurrent neural networks are being applied to a wide variety of problems. As a first step toward a comprehensive understanding of the dynamics of such networks, this paper studies the dynamical behavior of their basic building block: a continuous Hopfield neuron with self-connection. Specifically, we characterize the equilibria of this model neuron and the dependence of those equilibria on the parameters. We also describe the bifurcations of this model and derive very accurate approximate expressions for its bifurcation set. Finally, we indicate how the basic theory developed in this paper generalizes to a larger class of related model neurons. File: /pub/agents/beer/CTRNNDynamics1.ps.Z Complete Article (233K) ---------------------------------------------------------------------- FTP instructions: To retrieve and print a file (for example: seqlearn.ps), use the following commands: unix> ftp yuggoth.ces.cwru.edu Name: anonymous Password: (your email address) ftp> binary ftp> cd /pub/agents/yamauchi (or cd /pub/agents/beer for CTRNNDynamics1.ps.Z) ftp> get seqlearn.ps.Z ftp> quit unix> uncompress seqlearn.ps.Z unix> lpr seqlearn.ps (ls doesn't currently work properly on our ftp server. This will be fixed soon, but in the meantime, these files can still be copied, even though they don't appear in the directory listing.) _______________________________________________________________________________ Brian Yamauchi Case Western Reserve University yamauchi at alpha.ces.cwru.edu Department of Computer Engineering and Science _______________________________________________________________________________ From isabelle at neural.att.com Fri Feb 11 20:51:16 1994 From: isabelle at neural.att.com (Isabelle Guyon) Date: Fri, 11 Feb 94 20:51:16 EST Subject: robust statistics Message-ID: <9402120151.AA21483@neural> I would like to bring more arguments to Terry's remarks: > One man's outlyer is another man's data point. If the data is perfectly clean, outlyers are very valuable patterns. From mmoller at daimi.aau.dk Mon Feb 14 02:15:18 1994 From: mmoller at daimi.aau.dk (Martin Fodslette M|ller) Date: Mon, 14 Feb 1994 08:15:18 +0100 Subject: Thesis available. Message-ID: <199402140715.AA18638@titan.daimi.aau.dk> /******************* PLEASE DO NOT FORWARD ***********************/ I finally finished up my thesis: Efficient Training of Feed-Forward Neural Networks The thesis has the following content: Chapter 1. Resume in danish (should anyone need that (-:) Chapter 2. Notation and basic definitions. Chapter 3. Training Methods: An Overview Chapter 4. Calculation of Hessian Information Chapter 5. Different Error Functions. Appendix A. A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning. Appendix B. Supervised Learning on Large Redundant Training Sets. Appendix C. Exact Calculation of the Product of the Hessian Matrix and a Vector in O(N) time. Appendix D. Adaptive Preconditioning of the Hessian Matrix. Appendix E. Improving Network Solutions. The appendices concerns own work (original contributions), while the chapters provide an overview. The thesis is now available in a limited number of hard-copies. People interested in a copy should send an email with there address to me. Best Regards -martin ---------------------------------------------------------------- Martin Moller email: mmoller at daimi.aau.dk Computer Science Dept. Fax: +45 8942 3255 Aarhus University Phone: +45 8942 3371 Ny Munkegade, Build. 540, DK-8000 Aarhus C, Denmark ---------------------------------------------------------------- From edelman at wisdom.weizmann.ac.il Mon Feb 14 02:39:27 1994 From: edelman at wisdom.weizmann.ac.il (Edelman Shimon) Date: Mon, 14 Feb 1994 09:39:27 +0200 Subject: TR available: Representation of similarity in 3D ... Message-ID: <199402140739.JAA00503@eris.wisdom.weizmann.ac.il> FTP-host: eris.wisdom.weizmann.ac.il FTP-filename: /pub/tr-94-02.ps.Z URL: http://eris.wisdom.weizmann.ac.il/ Uncompressed size: 2.6 Mb. Preliminary version; comments welcome. Representation of similarity in 3D object discrimination Shimon Edelman \begin{abstract} How does the brain represent visual objects? In simple perceptual generalization tasks, the human visual system performs as if it represents the stimuli in a low-dimensional metric psychological space \cite{Shepard87}. In theories of 3D shape recognition, the role of feature-space representations (as opposed to structural \cite{Biederman87} or pictorial \cite{Ullman89} descriptions) has been for a long time a major point of contention. If shapes are indeed represented as points in a feature space, patterns of perceived similarity among different objects must reflect the structure of this space. The feature space hypothesis can then be tested by presenting subjects with complex parameterized 3D shapes, and by relating the similarities among subjective representations, as revealed in the response data by multidimensional scaling \cite{Shepard80}, to the objective parameterization of the stimuli. The results of four such tests, reported below, support the notion that discrimination among 3D objects may rely on a low-dimensional feature space representation, and suggest that this space may be spanned by explicitly encoded class prototypes. \end{abstract} From grumbach at inf.enst.fr Mon Feb 14 03:51:22 1994 From: grumbach at inf.enst.fr (grumbach@inf.enst.fr) Date: Mon, 14 Feb 94 09:51:22 +0100 Subject: papers on time and neural networks Message-ID: <9402140851.AA10372@enst.enst.fr> As guest editors of a special issue of the Sigart Bulletin about : Time and Neural Networks we are looking for 4 articles about 10 pages each. Sigart is a quarterly publication of the Association for Computing Machinery (ACM) special interest group on Artificial Intelligence. The paper may either deal with approachs of time processing using traditional connectionist architectures, or with more specific models integrating time in their basis. If you are interested, and if you can submit a paper (not already published) within a short delay (about 1 month and a half), please send a draft (if possible a Word file) : - preferably by giving ftp access to it (information via e-mail) - or sending it as "attached file" on e-mail - or posting a paper copy of it. Drafts should be received before April 1. Notification of acceptance will be sent before April 20. grumbach at enst.fr or chaps at enst.fr Alain Grumbach and Cedric Chappelier ENST dept INF 46 rue Barrault 75634 Paris Cedex 13 France From P.Refenes at cs.ucl.ac.uk Mon Feb 14 09:13:12 1994 From: P.Refenes at cs.ucl.ac.uk (P.Refenes@cs.ucl.ac.uk) Date: Mon, 14 Feb 94 14:13:12 +0000 Subject: robust statistics In-Reply-To: Your message of "Thu, 10 Feb 94 09:45:15 PST." <9402101745.AA28545@salk.edu> Message-ID: The term outliers does not mean that they are not part of the joint data probability distribution or that they contain no information for estimating the regression surface; it means rather that outliers are too small a fraction of the observations to be allowed to dominate the small-sample behaviour of the statistics to be calculated. With parametric regression modelling techniques it is easy to quantify this efefct by simply comptuing the effect that each data point has on the regression surface. This is not a trivial problem in non-parametric modelling but the statistics literature is full of methods to deal with it. Paul refenes From rsun at cs.ua.edu Mon Feb 14 12:22:20 1994 From: rsun at cs.ua.edu (Ron Sun) Date: Mon, 14 Feb 1994 11:22:20 -0600 Subject: No subject Message-ID: <9402141722.AA28238@athos.cs.ua.edu> A monograph on connectionist models is available from John Wiley and Sons, Inc. Title: Integrating Rules and Conenctionism for Robust Commonsense Reasoning ISBN 0-471-59324-9 Author: Ron Sun Assistant Professor Department of Computer Science The University of Alabama Tuscaloosa, AL 35487 contact John Wiley and Sons, Inc. at 1-800-call-wiley Or John Wiley and Sons, Inc. 605 Third Ave. New York, NY 10158-0012 USA (212) 850-6589 FAX: (212) 850-6088 ------------------------------------------------------------------ A brief description is as follows: One of the outstanding problems for artificial intelligence is the problem of better modeling commonsense reasoning and alleviating brittleness of traditional symbolic rule-based models. This work tackles this problem by trying to combining rules with connectionist models in an integrated framework. This idea leads to the development of a connectionist architecture with dual representation combining symbolic and subsymbolic (feature-based) processing for evidential robust reasoning: {\sc CONSYDERR}. Reasoning data are analyzed based on the notions of {\it rules} and {\it similarity} and modeled by the architecture which carries out rule application and similarity matching through interaction of the two levels; formal analyses are performed to understand rule encoding in connectionist models, in order to prove that it handles a superset of Horn clause logic and a nonmonotonic logic; the notion of causality is explored for the purpose of clarifying how the proposed architecture can better capture commonsense reasoning, and it is shown that causal knowledge can be well represented by {\sc CONSYDERR} and utilized in reasoning, which further justifies the design of the architecture; the variable binding problem is addressed, and a solution is proposed within this architecture and is shown to surpass existing ones; several aspects of the architecture are discussed to demonstrate how connectionist models can supplement, enhance, and integrate symbolic rule-based reasoning; large-scale application-oriented systems are prototyped. This architecture utilizes the synergy resulting from the interaction of the two different types of representation and processing, and is therefore capable of handling a large number of difficult issues in one integrated framework, such as partial and inexact information, cumulative evidential combination, lack of exact match, similarity-based inference, inheritance, and representational interactions, all of which are proven to be crucial elements of commonsense reasoning. The results show that connectionism coupled with symbolic processing capabilities can be effective and efficient models of reasoning for both theoretical and practical purposes. Table of Content 1 Introduction 1.1 Overview 1.2 Commonsense Reasoning 1.3 The Problem of Common Reasoning Patterns 1.4 What is the Point? 1.5 Some Clarifications 1.6 The Organization of the Book 1.7 Summary 2 Accounting for Commonsense Reasoning: A Framework with Rules and Similarities 2.1 Overview 2.2 Examples of Reasoning 2.3 Patterns of Reasoning 2.4 Brittleness of Rule-Based Reasoning 2.5 Towards a Solution 2.6 Some Reflections on Rules and Connectionism 2.7 Summary 3 A Connectionist Architecture for Commonsense Reasoning 3.1 Overview 3.2 A Generic Architecture 3.3 Fine-Tuning --- from Constraints to Specifications 3.4 Summary 3.5 Appendix 4 Evaluations and Experiments 4.1 Overview 4.2 Accounting for the Reasoning Examples 4.3 Evaluations of the Architecture 4.4 Systematic Experiments 4.5 Choice, Focus and Context 4.6 Reasoning with Geographical Knowledge 4.7 Applications to Other Domains 4.8 Summary 4.9 Appendix: Determining Similarities and CD representations 5 More on the Architecture: Logic and Causality 5.1 Overview 5.2 Causality in General 5.3 Shoham's Causal Theory 5.4 Defining FEL 5.5 Accounting for Commonsense Causal Reasoning 5.6 Determining Weights 5.7 Summary 5.8 Appendix: Proofs For Theorems 6 More on the Architecture: Beyond Logic 6.1 Overview 6.2 Further Analysis of Inheritance 6.3 Analysis of Interaction in Representation 6.4 Knowledge Acquisition, Learning, and Adaptation 6.5 Summary 7 An Extension: Variables and Bindings 7.1 Overview 7.2 The Variable Binding Problem 7.3 First-Order FEL 7.4 Representing Variables 7.5 A Formal Treatment 7.6 Dealing with Difficult Issues 7.7 Compilation 7.8 Correctness 7.9 Summary 7.10 Appendix 8 Reviews and Comparisons 8.1 Overview 8.2 Rule-Based Reasoning 8.3 Case-Based Reasoning 8.4 Connectionism 8.5 Summary 9 Conclusions 9.1 Overview 9.2 Some Accomplishments 9.3 Lessons Learned 9.4 Existing Limitations 9.5 Future Directions 9.6 Summary References From trevor at white.Stanford.EDU Mon Feb 14 17:37:50 1994 From: trevor at white.Stanford.EDU (Trevor Darrell) Date: Mon, 14 Feb 94 14:37:50 PST Subject: outlier, robust statistics In-Reply-To: Terry Sejnowski's message of Thu, 10 Feb 94 09:45:15 PST <9402101745.AA28545@salk.edu> Message-ID: <9402142237.AA24561@white.Stanford.EDU> [terry at salk.edu] One man's outlier is another man's data point. Another way to handle outliers is not to remove them but to model them explicitly. Geoff Hinton has pointed out that character recognition can be made more robust by including models for background noise such as postmarks. Explicitly modeling an occluding or transparently combined "outlier" process is a powerful way to build a robust estimator. As mentioned in other replies to this post, estimators which use a mixture model (either implicitly or explicitly), such as the EM algorithm, are promising methods to implement this type of strategy. One issue which often complicates matters is how to decide how many objects or processes there are in the signal, e.g. determine K in the EM estimator. I would like to ask if anyone has a pointer to work on estimating K in the context of an EM estimator or similar methods? Often the appropriate cardinality of the model is not easily known a priori. Steve Nowlan and I recently used mixtures of expert networks to separate multiple interpenetrating flow fields -- the transparency problem for visual motion. The gating network was used to select regions of the visual field that contained reliable estimates of local velocity for which there was coherent global support. There is evidence for such selection neurons in area MT of primate visual cortex, a region of cortex that specializes in the detection of coherent motion. I'd also like to add a pointer to some related work Sandy Pentland, Eero Simoncelli and I have done in this domain developing a strategy for robust estimation ("outlier exclusion") based on minimum description length theory. Our method effectively implements a clustering method to find how many processes there are (e.g. estimate K), and then iteratively refine estimates of the parameters and "support" (segmentation) of those processes. We have developed versions of this method for range and motion segmentation, both for occluded and transparently combined processes. [pluto at cs.ucsd.edu:] >I look forward to reading (Liu 94). Can you (or anyone else) >point me to other references utilizing a similar definition >of "outlier?" (IMHO) "outlier" is quite a value-laden term >that I tend to avoid since I feel it has multiple and >often ambiguous interpretations/definitions. Here are some references to conference papers on our work. A longer journal paper that combines these is in the works, email me if you would like a preprint when it becomes available. Darrell, Sclaroff and Pentland, "Segmentation by Minimal Description", Proc. 3rd Intl. Conf. Computer Vision, Osaka, Japan, 1990 (also avail. as MIT Media Lab Percom TR-163.) Darrell and Pentland, "Robust Estimation of a Multi-Layer Motion Representation", Proc. IEEE Workshop on Visual Motion, Princeton, October 1991 Darrell and Pentland, "Against Edges: Function Approximation with Multiple Support Maps", NIPS 4, 1992 Darrell and Simoncelli, "Separation of Transparent Motion into Layers using Velocity-tuned Mechanisms", Assn. for Resarch in Vision and Opthm. (ARVO) 1993, also available as MIT Media Lab Percom TR-244. (Percom TR's can be anon. ftp'ed from whitechapel.media.mit.edu) --trevor From jagota at next1.msci.memst.edu Mon Feb 14 20:18:56 1994 From: jagota at next1.msci.memst.edu (Arun Jagota) Date: Mon, 14 Feb 1994 19:18:56 -0600 Subject: DIMACS Challenge neural net papers Message-ID: <199402150118.AA02676@next1> Dear Connectionists: Expanded versions of two neural net papers presented at the DIMACS Challenge on Cliques, Coloring, and Satisfiability are now available via anonymous ftp (see below). First an excerpt from the Challenge announcement back in 1993: ---------------------- The purpose of this Challenge is to encourage high quality empirical research on difficult problems. The problems chosen are known to be difficult to solve in theory. How difficult are they to solve in practice? ---------------------- ftp ftp.cs.buffalo.edu (or 128.205.32.9 subject-to-change) Name : anonymous > cd users/jagota > binary > get DIMACS_Grossman.ps.Z > get DIMACS_Jagota.ps.Z > quit > uncompress *.Z Sorry, no hard copies. Copies may be requested by electronic mail to me (jagota at next1.msci.memst.edu) for those without access to ftp or for whom ftp fails. Please use as last resort. Applying The INN Model to the MaxClique Problem Tal Grossman, email: tal at goshawk.lanl.gov Complex Systems Group, T-13, and Center for Non Linear Studies MS B213, Los Alamos National Laboratory Los Alamos, NM 87545 Los Alamos Tech Report: LA-UR-93-3082 A neural network model, the INN (Inverted Neurons Network), is applied to the Maximum Clique problem. First, I describe the INN model and how it implements a given graph instance. The model has a threshold parameter $t$, which determines the character of the network stable states. As shown in an earlier work (Grossman-Jagota), the stable states of the network correspond to the $t$-codegree sets of its underlying graph, and, in the case of $t<1$, to its maximal cliques. These results are briefly reviewed. In this work I concentrate on improving the deterministic dynamics called $t$-annealing. The main issue is the initialization procedure and the choice of parameters. Adaptive procedures for choosing the initial state of the network and setting the threshold are presented. The result is the ``Adaptive t-Annealing" algorithm (AtA). This algorithm is tested on many benchmark problems and found to be more efficient than steepest descent or the simple t-annealing procedure. Approximately Solving Maximum Clique using Neural Network and Related Heuristics * Arun Jagota Laura Sanchis Memphis State University Colgate University Ravikanth Ganesan State University of New York at Buffalo We explore neural network and related heuristic methods for the fast approximate solution of the Maximum Clique problem. One of these algorithms, {\em Mean Field Annealing}, is implemented on the Connection Machine CM-5 and a fast annealing schedule is experimentally evaluated on random graphs, as well as on several benchmark graphs. The other algorithms, which perform certain randomized local search operations, are evaluated on the same benchmark graphs, and on {\bf Sanchis} graphs. One of our algorithms adjusts its internal parameters as its computation evolves. On {\bf Sanchis} graphs, it finds significantly larger cliques than the other algorithms do. Another algorithm, GSD$(\emptyset)$, works best overall, but is slower than the others. All our algorithms obtain significantly larger cliques than other simpler heuristics but run slightly slower; they obtain significantly smaller cliques on average than exact algorithms or more sophisticated heuristics but run considerably faster. All our algorithms are simple and inherently parallel. * - 24 pages in length (twice as long as its previous version). Arun Jagota From terry at salk.edu Tue Feb 15 02:56:04 1994 From: terry at salk.edu (Terry Sejnowski) Date: Mon, 14 Feb 94 23:56:04 PST Subject: outlier, robust statistics Message-ID: <9402150756.AA17907@salk.edu> I have received many requests for a reference to the motion model I mentioned recently in the context of robust statistics. An early version can be found in: Nowlan, S. J. and Sejnowski, T. J., Filter selection model for generating visual motion signals, In: C. L. Giles, S. J. Hanson and J. D. Cowan (Eds.) Advances in Neural Information Processing Systems 5, San Mateo, CA: Morgan Kaufman Publishers, 369-376 (1993). Two longer papers on the computational theory and the biological consequences are in review. Darrell and Pentland have an interesting iterative approach in which multiple hypotheses compete to include motion samples within their regions of support. A relaxation scheme must decide on the number of objects and the correct velocity assignments. Our approach to motion estimation is simpler in that hypotheses do not correspond to objects, but to distinct velocities, and the number of hypotheses is always fixed. This allows the selection of regions of support to be performed non-iteratively. The architecture of the model is feedforward with soft-max within layers, so it is quite fast. Mixtures of experts was used to optimize the weights in the network. Terry ----- From schmidhu at informatik.tu-muenchen.de Tue Feb 15 04:06:19 1994 From: schmidhu at informatik.tu-muenchen.de (Juergen Schmidhuber) Date: Tue, 15 Feb 1994 10:06:19 +0100 Subject: postdoctoral thesis Message-ID: <94Feb15.100623met.42337@papa.informatik.tu-muenchen.de> ---------------- postdoctoral thesis ---------------- Juergen Schmidhuber Technische Universitaet Muenchen (submitted April 1993, accepted October 1993) ----------------------------------------------------- NETZWERKARCHITEKTUREN, ZIELFUNKTIONEN UND KETTENREGEL Es gibt relativ neuartige, auf R"uckkopplung basierende k"unstliche neuronale Netze (KNN), deren F"ahigkeiten betr"achtlich "uber simple Musterassoziation hinausge- hen. Diese KNN gestatten im Prinzip die Implementierung beliebiger auf einem herk"ommlichen sequentiell arbei- tenden Digitalrechner berechenbarer Funktionen. Im Ge- gensatz zu herk"ommlichen Rechnern l"a"st sich dabei jedoch die Qualit"at der Ausgaben (formal spezifiziert durch eine sinnvolle Zielfunktion) bez"uglich der ``Software'' (bei KNN die Gewichtsmatrix) mathematisch differenzieren, was die Anwendung der Kettenregel zur Herleitung gradientenbasierter Software"anderungsalgo- rithmen erm"oglicht. Die Arbeit verdeutlicht dies durch formale Herleitung einer Reihe neuartiger Lernalgorith- men aus folgenden Bereichen: (1) "uberwachtes Lernen sequentiellen Ein/Ausgabeverhaltens mit zyklischen und azyklischen Architekturen, (2) ``Reinforcement Lernen'' und Subzielgenerierung ohne informierten Lehrer, (3) un"uberwachtes Lernen zur Redundanzextraktion aus Ein- gaben und Eingabestr"omen. Zahlreiche Experimente zei- gen M"oglichkeiten und Schranken dieser Lernalgorithmen auf. Zum Abschluss wird ein ``selbstreferentielles'' neuronales Netzwerk pr"asentiert, welches theoretisch lernen kann, seinen eigenen Software"anderungsalgorith- mus zu "andern. ----------------------------------------------------- The postdoctoral thesis above is now available (in unrevised form) via ftp. To obtain a copy, follow the instructions at the end of this message. Here is additional information for those who are interested but don't understand German (or are unfamiliar with Germany's academic system): The postdoctoral thesis is part of a process called ``Habilitation'' which is seen as a qualification for tenure. The thesis is about learning algorithms derived by the chain rule. It addresses supervised sequence learning, variants of reinforcement learning, and unsupervised learning (for redundancy reduction). Unlike some previous papers of mine, it contains lots of experiments and lots of figures. Here is a very brief summary based on pointers to recent English publications upon which the thesis elaborates: Chapters 2 and 3 are on supervised sequence learning and extend publications [1] and [4]. Chapter 4 is on variants of learning with a ``distal teacher'' and extends publication [7] (robot experiments in chapter 4 were conducted by Eldracher and Baginski, see e.g. [9]). Chapters 5, 6 and 7 describe unsupervised learning algorithms based on detection of redundant information in input patterns and pattern sequences: Chapter 5 elaborates on publication [5], and chapter 6 extends publication [3]. Chapter 6 includes a result by Peter Dayan, Richard Zemel and A. Pouget (SALK Institute) who demonstrated that equation (4.3) in [3] with $\beta = 0, \alpha = = \gamma =1$ is essentially equivalent to equation (5.1). Chapter 6 also includes experiments conducted by Stefanie Lindstaedt who successfully applied the method in [3] to redundant images of letters presented according to the probabilities of English language, see [10]. Chapter 7 extends publications [2] and [8]. Experiments show how sequence processing neural nets using algorithms for redundancy reduction can learn to bridge time lags (between correlated events) of more than 1000 discrete time steps. Other experiments use neural nets for text compression and compare them to standard data compression algorithms. Finally, chapter 8 elaborates on publication [6]. -------------------------- References ------------------------------- [1] J. H. Schmidhuber. A fixed size storage O(n^3) time complexity learning algorithm for fully recurrent continually running networks. Neural Computation, 4(2):243--248, 1992. [2] J. H. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234--242, 1992. [3] J. H. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863--879, 1992. [4] J. H. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131--139, 1992. [5] J. H. Schmidhuber and D. Prelinger. Discovering predictable classifications. Neural Computation, 5(4):625--635, 1993. [6] J. H. Schmidhuber. A self-referential weight matrix. In Proc. of the Int. Conf. on Artificial Neural Networks, Amsterdam, pages 446--451. Springer, 1993. [7] J. H. Schmidhuber and R. Wahnsiedler. Planning simple trajectories using neural subgoal generators. In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, Proc. of the 2nd Int. Conf. on Simulation of Adaptive Behavior, pages 196--202. MIT Press, 1992. [8] J. H. Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Huening, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87--95. Augustinus, 1993. [9] M. Eldracher and B. Baginski. Neural subgoal generation using backpropagation. In George G. Lendaris, Stephen Grossberg and Bart Kosko, editors, Proc. of WCNN'93, Lawrence Erlbaum Associates, Inc., Hillsdale, pages = III-145--III-148, 1993. [10] S. Lindstaedt. Comparison of unsupervised neural networks for redundancy reduction. In M. C. Mozer, P. Smolensky, D. S. Touretzky, J. L. Elman and A. S. Weigend, editors, Proc. of the 1993 Connectionist Models Summer School, pages 308-315. Hillsdale, NJ: Erlbaum Associates, 1993. ---------------------------------------------------------------------- The thesis comes in three parts. To obtain a copy, do: unix> ftp 131.159.8.35 Name: anonymous Password: (your email address, please) ftp> binary ftp> cd pub/fki ftp> get schmidhuber.habil.1.ps.Z ftp> get schmidhuber.habil.2.ps.Z ftp> get schmidhuber.habil.3.ps.Z ftp> bye unix> uncompress schmidhuber.habil.1.ps.Z unix> lpr schmidhuber.habil.1.ps . . . Note: The layout is designed for conventional European DINA4 format. Expect 145 pages. ---------------------------------------------------------------------- Dr. habil. J. H. Schmidhuber, Fakultaet fuer Informatik, Technische Universitaet Muenchen, 80290 Muenchen, Germany schmidhu at informatik.tu-muenchen.de --------- postdoctoral thesis (unrevised) ----------- NETZWERKARCHITEKTUREN, ZIELFUNKTIONEN UND KETTENREGEL Juergen Schmidhuber, TUM From Petri.Myllymaki at cs.Helsinki.FI Tue Feb 15 04:52:42 1994 From: Petri.Myllymaki at cs.Helsinki.FI (Petri Myllymaki) Date: Tue, 15 Feb 1994 11:52:42 +0200 Subject: Thesis in neuroprose Message-ID: <199402150952.LAA01783@keos.Helsinki.FI> FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/Thesis/myllymaki.thesis.ps.Z The following report has been placed in the neuroprose archive. ----------------------------------------------------------------------- Bayesian Reasoning by Stochastic Neural Networks Petri Myllymaki Ph.Lic. Thesis Department of Computer Science, University of Helsinki Report C-1993-67, Helsinki, December 1993 78 pages This work has been motivated by problems in several research areas: expert system design, uncertain reasoning, optimization theory, and neural network research. From the expert system design point of view, our goal was to develop a generic expert system shell capable of handling uncertain data. The theoretical framework used here for handling uncertainty is probabilistic reasoning, in particular the theory of Bayesian belief network representations. The probabilistic reasoning task we are interested in is, given a Bayesian network representation of a probability distribution on a set of discrete random variables, to find a globally maximal probability state consistent with given initial constraints. To solve this NP-hard problem approximatively, we use an iterative stochastic method, Gibbs sampling. As this method can be quite inefficient when implemented on a conventional sequential computer, we show how to construct a Gibbs sampling process for a given Bayesian network on a massively parallel architecture, a harmony neural network, which is a special case of the Boltzmann machine architecture. To empirically test the method developed, we implemented a hybrid neural-symbolic expert system shell, NEULA. The symbolic part of the system consists of a high-level conceptual description language and a compiler, which can be used for constructing Bayesian networks and providing them with the corresponding parameters (conditional probabilities). As the number of parameters needed for a given network may generally be quite large, we restrict ourselves to Bayesian networks having a special hierarchical structure. The neural part of the system consists of a neural network simulator which performs massively parallel Gibbs sampling. The performance of the NEULA system was empirically tested by using a small artificial test example. Computing Reviews (1991) Categories and Subject Descriptors: G.3 [Probability and statistics]: Probabilistic algorithms F.1.1 [Models of computation]: Neural networks G.1.6 [Optimization]: Constrained optimization I.2.5 [Programming languages and software]: Expert system tools and techniques General Terms: Algorithms, Theory. Additional Key Words and Phrases: Monte Carlo algorithms, Gibbs sampling, simulated annealing, Bayesian belief networks, connectionism, massive parallelism ----------------------------------------------------------------------- To obtain a copy: ftp archive.cis.ohio-state.edu login: anonymous password: cd pub/neuroprose/Thesis binary get myllymaki.thesis.ps.Z quit Then at your system: uncompress myllymaki.thesis.ps.Z lpr myllymaki.thesis.ps ----------------------------------------------------------------------- Petri Myllymaki Petri.Myllymaki at cs.Helsinki.FI Department of Computer Science Int.+358 0 708 4212 (tel.) P.O.Box 26 (Teollisuuskatu 23) Int.+358 0 708 4441 (fax) FIN-00014 University of Helsinki, Finland ----------------------------------------------------------------------- From thrun at uran.cs.bonn.edu Tue Feb 15 08:25:02 1994 From: thrun at uran.cs.bonn.edu (Sebastian Thrun) Date: Tue, 15 Feb 1994 14:25:02 +0100 Subject: 2 papers on robot learning Message-ID: <199402151325.OAA17317@carbon.informatik.uni-bonn.de> This is to announce two recent papers in the connectionists' archive. Both papers deal with robot learning issues. The first paper describes two learning approaches (EBNN with reinforcement learning, COLUMBUS), and the second paper gives some empirical results for learning robot navigation using reinforcement learning and EBNN. Both approaches have been evaluated using real robot hardware. Enjoy reading! Sebastian ------------------------------------------------------------------------ LIFELONG ROBOT LEARNING Sebastian Thrun Tom Mitchell University of Bonn Carnegie Mellon University Learning provides a useful tool for the automatic design of autonomous robots. Recent research on learning robot control has predominantly focussed on learning single tasks that were studied in isolation. If robots encounter a multitude of control learning tasks over their entire lifetime, however, there is an opportunity to transfer knowledge between them. In order to do so, robots may learn the invariants of the individual tasks and environments. This task-independent knowledge can be employed to bias generalization when learning control, which reduces the need for real-world experimentation. We argue that knowledge transfer is essential if robots are to learn control with moderate learning times in complex scenarios. Two approaches to lifelong robot learning which both capture invariant knowledge about the robot and its environments are reviewed. Both approaches have been evaluated using a HERO-2000 mobile robot. Learning tasks included navigation in unknown indoor environments and a simple find-and-fetch task. (Technical Report IAI-TR-93-7, Univ. of Bonn, CS Dept.) ------------------------------------------------------------------------ AN APPROACH TO LEARNING ROBOT NAVIGATION Sebastian Thrun. Univ. of Bonn Designing robots that can learn by themselves to perform complex real-world tasks is still an open challenge for the fields of Robotics and Artificial Intelligence. In this paper we describe an approach to learning indoor robot navigation through trial-and-error. A mobile robot, equipped with visual, ultrasonic and infrared sensors, learns to navigate to a designated target object. In less than 10 minutes operation time, the robot is able to learn to navigate to a marked target object in an office environment. The underlying learning mechanism is the explanation-based neural network (EBNN) learning algorithm. EBNN initially learns function from scratch using neural network representations. With increasing experience, EBNN employs domain knowledge to explain and to analyze training data in order to generalize in a knowledgeable way. (to appear in: Proceedings of the IEEE Conference on Intelligent Robots and Systems 1994) ------------------------------------------------------------------------ Postscript versions of both papers may be retrieved from Jordan Pollack's neuroprose archive by following the instructions below. unix> ftp archive.cis.ohio-state.edu ftp login name> anonymous ftp password> xxx at yyy.zzz ftp> cd pub/neuroprose ftp> bin ftp> get thrun.lifelong-learning.ps.Z ftp> get thrun.learning-robot-navg.ps.Z ftp> bye unix> uncompress thrun.lifelong-learning.ps.Z unix> uncompress thrun.learning-robot-navg.ps.Z unix> lpr thrun.lifelong-learning.ps.Z unix> lpr thrun.learning-robot-navg.ps.Z From chaps at inf.enst.fr Tue Feb 15 09:22:03 1994 From: chaps at inf.enst.fr (Cedric Chappelier) Date: Tue, 15 Feb 94 15:22:03 +0100 Subject: papers on time and neural networks (Correction) Message-ID: <9402151422.AA03059@ulysse.enst.fr.enst.fr> Yesterday we send the following announcement. We want to make a little correction : the format of the paper can either be Word file (as mentioned in the first mail) OR A LATEX FILE. > > As guest editors of a special issue of the Sigart Bulletin about : > > Time and Neural Networks > > we are looking for 4 articles about 10 pages each. > > Sigart is a quarterly publication of the Association for Computing > Machinery (ACM) special interest group on Artificial Intelligence. > > The paper may either deal with approachs of time processing using > traditional connectionist architectures, or with more specific models > integrating time in their basis. > > If you are interested, and if you can submit a paper (not already > published) within a short delay (about 1 month and a half), please send a > draft (if possible a Word file) : ^^^^^^^^^^^^^^^^^^^^^^^ OR A LATEX FILE > - preferably by giving ftp access to it (information via e-mail) > - or sending it as "attached file" on e-mail > - or posting a paper copy of it. > > Drafts should be received before April 1. > Notification of acceptance will be sent before April 20. > > grumbach at enst.fr or chaps at enst.fr > > Alain Grumbach and Cedric Chappelier > ENST dept INF > 46 rue Barrault > 75634 Paris Cedex 13 > France > > Sorry for the negligence. --- E-mail: chaps at inf.enst.fr || Cedric.Chappelier at enst.fr P-mail: Telecom Paris 46, rue Barrault - 75634 Paris cedex 13 From COTTRLL at FRMOP22.CNUSC.FR Tue Feb 15 18:42:00 1994 From: COTTRLL at FRMOP22.CNUSC.FR (COTTRELL) Date: Tue, 15 Feb 94 18:42 Subject: Available paper : Kohonen algorithm Message-ID: <"94-02-15-18:42:21.90*COTTRLL"@FRMOP22.CNUSC.FR> The following paper is available from anonymous ftp on archive.cis.ohio-state.edu (128.146.8.52) in directory pub/neuroprose as file cottrell.things.ps "Two or three things that we know about the Kohonen algorithm" 10 pages by Marie Cottrell, Jean-Claude Fort, Gilles Pages SAMOS, Universite Paris 1 90, rue de Tolbiac 75634 PARIS Cedex 13 FRANCE ABSTRACT Many theoretical papers are published about the Kohonen algorithm. It is not not easy to understand what is exactly proved, because of the great variety of mathematical methods. Despite all these efforts, many problems remain without solution. In this small review paper, we intend to sum up the situation. To appear in the Proceedings of ESANN 94, Bruxelles To retrieve >ftp archive.cis.ohio-state.edu name : anonymous password: (use your e-mail address) ftp> cd pub/neuroprose ftp> get cottrell.things.ps ftp> quit From platt at synaptics.com Tue Feb 15 20:13:14 1994 From: platt at synaptics.com (John Platt) Date: Tue, 15 Feb 94 17:13:14 PST Subject: Neuroprose paper available Message-ID: <9402160113.AA18442@synaptx.synaptics.com> ****** PAPER AVAILABLE VIA NEUROPROSE *************************************** ****** AVAILABLE VIA FTP ONLY *********************************************** ****** PLEASE DO NOT FORWARD TO OTHER MAILING LISTS OR BOARDS. ************** FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/wolf.address-block.ps.Z The following paper has been placed in the Neuroprose archives at Ohio State. The file is wolf.address-block.ps.Z . Only the electronic version of this paper is available. This paper is 8 pages in length. NOTE: The uncompressed postscript file is approximately 2.7 megabytes in length, so it may take a while to print out. Also, you may have to tell the lpr program to use a symbolic link to copy into the spool directory (lpr -s under SunOS). ----------------------------------------------------------------------------- Postal Address Block Location Using A Convolutional Locator Network Ralph Wolf and John C. Platt Synaptics, Inc. 2698 Orchard Parkway San Jose, CA 95134 ABSTRACT: This paper describes the use of a convolutional neural network to perform address block location on machine-printed mail pieces. Locating the address block is a difficult object recognition problem because there is often a large amount of extraneous printing on a mail piece and because address blocks vary dramatically in size and shape. We used a convolutional locator network with four outputs, each trained to find a different corner of the address block. A simple set of rules was used to generate ABL candidates from the network output. The system performs very well: when allowed five guesses, the network will tightly bound the address delivery information in 98.2% of the cases. ----------------------------------------------------------------------------- John Platt platt at synaptics.com From terry at salk.edu Tue Feb 15 22:44:00 1994 From: terry at salk.edu (Terry Sejnowski) Date: Tue, 15 Feb 94 19:44:00 PST Subject: Telluride Workshops Message-ID: <9402160344.AA25170@salk.edu> CALL FOR PARTICIPATION IN TWO WORKSHOPS ON "NEUROMORPHIC ENGINEERING" JULY 3 - 9, 1994 AND JULY 10 - 16, 1994 TELLURIDE, COLORADO Christof Koch (Caltech) and Terry Sejnowski (Salk Institute/UCSD) invite applications for two different workshops that will be held in Telluride, Colorado in July 1994. Travel and housing expenses will be provided for ten to twenty active researchers for each workshop. Deadline for application is March 10, 1994. GOALS: Carver Mead has introduced the term "Neuromorphic Engineering" for a new field based on the design and fabrication of artificial neural systems, such as vision systems, head-eye systems, and roving robots, whose architecture and design principles are based on those of biological nervous systems. The goal of these workshops is to bring together young investigators and more established researchers from academia with their counterparts in industry and national laboratories, working on both neurobiological as well as engineering aspects of sensory systems and sensory-motor integration. The focus of the workshop will be on ``active" participation, with demonstration systems and hands-on-experience for all participants. Neuromorphic engineering has a wide range of applications from nonlinear adaptive control of complex systems to the design of smart sensors. Many of the fundamental principles in this field, such as the use of learning methods and the design of parallel hardware, are inspired by biological systems. However, existing applications are modest and the challenge of scaling up from small artificial neural networks and designing completely autonomous systems at the levels achieved by biological systems lies ahead. The assumption underlying these workshops is that the next generation of neuromorphic systems would benefit from closer attention to the principles found through experimental and theoretical studies of brain systems. WORKSHOPS: NEUROMORPHIC ANALOG VLSI SYSTEMS Sunday, July 3 to Saturday, July 9, 1994 Organized by Rodney Douglas (Oxford), Misha Mahowald (Oxford) and Stephen Lisberger (UCSF). The goal of this week is to bring together biologists and engineers who are interested in exploring neuromorphic systems through the medium of analog VLSI. The workshop will cover methods for the design and fabrication of multi-chip neuromorphic systems. This framework is suitable both for creating analogs of specific biological systems, which can serve as a modeling environment for biologists, and as a tool for engineers to create cooperative circuits based on biological principles. The workshop will provide the community with a common formal language for describing neuromorphic systems. Equipment will be present for participants to evaluate existing neuromorphic chips (including silicon retina, silicon neurons, oculomotor system). SYSTEMS LEVEL MODELS OF VISUAL BEHAVIOR Sunday, July 10 to Saturday, July 16, 1994 Organized by Dana Ballard (Rochester) and Richard Andersen (Caltech). The goal of this week is to bring together biologists and engineers who are interested in systems level modeling of visual behaviors and their interactions with the motor systems. Sessions will cover issues of sensory-motor integration in the mammalian brain. Special emphasis will be placed on understanding neural algorithms used by the brain which can provide insights into constructing electrical circuits which can accomplish similar tasks. Issues to be covered will include spatial localization and constancy, attention, motor planning, eye movements, and the use of visual motion information for motor control. Two or three prominent neuroscientists will be invited to give lectures on the above subjects. These researchers will also be asked to bring their own demonstrations, classroom experiments, and software for computer models. Demonstrations include recording eye movements and simple eye movement psychophysical experiments, neural network models for coordinate transformations and the representation of space, visual attention psychophysical experiments. Participants can conduct their own experiments using the Virtual Reality equipment. FORMAT: Time in both workshops will be divided between planned presentation, free interaction, and contributed material. Each day will consist of a lecture in the morning that covers the theory behind the hands-on investigation in the afternoon. Following each lecture, there will be a demonstration that introduces participants to the equipment that will be available in the afternoon session. Participants will be free to explore and play with whatever they choose in the afternoon. Participants are encouraged to bring their own material to share with others. After dinner, time for participants to provide an informal lecture/demonstration is reserved. LOCATION AND ARRANGEMENTS: The two workshops will take place at the "Telluride Summer Research Center," located in the small town of Telluride, 9000 feet high in Southwest Colorado, about 6 hours away from Denver (350 miles) and 4 hours from Aspen. Continental and United Airlines provide many daily flights directly into Telluride. Participants will be housed in shared condominiums, within walking distance of the Center. The workshop is intended to be very informal and hands-on. Participants are not required to have had previous experience in analog VLSI circuit design, computational or machine vision, systems level neurophysiology or modeling the brain at the systems level. However, we strongly encourage active researchers with relevant backgrounds from academia, industry and national laboratories to apply, in particular if they are prepared to talk about their work or to bring demonstrators to Telluride (e.g. robots, chips, software). We expect to be able to pay for shipping necessary equipment to Telluride and will have at least three technical staff present throughout both workshops to assist us with software and hardware problems. We will have a network of SUN workstations running UNIX and connected to the Internet at the Center available to us. All domestic travel and housing expenses will be provided. Participants are expected to pay for food and incidental expenses. HOW TO APPLY: The deadline for receipt of applications is March 10, 1994 Applicants should be at the level of graduate students or above (i.e. post- doctoral fellows, faculty, research and engineering staff and the equivalent positions in industry and national laboratories). We actively encourage qualified women and minority candidates to apply. Each participant can apply for only one workshop and the application should include: 1. Name, address, telephone, e-mail, FAX, and and minority status (optional). 2. Resume. 3. One page summary of background and interests relevant to the workshop. 4. Description of special equipment needed for demonstrations. 5. Two letters of recommendation Complete applications should be sent to: Prof. Terrence Sejnowski The Salk Institute Post Office Box 85800 San Diego, CA 92186-5800 Applicants will be notified by April 15, 1994. From venu at pixel.mipg.upenn.edu Wed Feb 16 17:28:00 1994 From: venu at pixel.mipg.upenn.edu (Venugopal) Date: Wed, 16 Feb 94 17:28:00 EST Subject: Paper available on ftp Message-ID: <9402162228.AA00373@pixel.mipg.upenn.edu> *** PLEASE DO NOT FORWARD TO OTHER GROUPS *** Preprint of the following paper (to appear in Circuits, Systems and Signal Processing) is available on ftp from neuroprose archive: AN IMPROVED SCHEME FOR THE DIRECT ADAPTIVE CONTROL OF DYNAMICAL SYSTEMS USING BACKPROPAGATION NEURAL NETWORKS K. P. Venugopal, R. Sudhakar and A. S. Pandya Department of Electrical Eng. Department of Computer Science and Eng. Florida Atlantic University Abstract: This paper presents an improved direct control architecture for the on-line learning control of dynamical systems using backpropagation neural networks. The proposed architecture is compared with the other direct control schemes. In the present scheme, the neural network interconnection strengths are updated based on the output error of the dynamical system directly, rather than using a transformed version of the error employed in other schemes. The ill effects of the controlled dynamics on the on-line updating of the network weights are moderated by including a compensating gain layer. An error feedback is introduced to improve the dynamic response of the control system. Simulation studies are performed using the nonlinear dynamics of an underwater vehicle and the promising results support the effectiveness of the proposed scheme. ----------------------------------------- The file at archive.cis.ohio-state.edu is venugopal.css.ps.Z (34 pages) to ftp the files: unix> ftp archive.cis.ohio-state.edu Name (archive.cis.ohio-state.edu:xxxxx): anonymous Password: your address ftp> cd pub/neuroprose ftp> binary ftp> get venugopal.css.ps.Z uncompress the file after transfering to your machine. unix> uncompress venugopal.css.ps.Z ________________________________________________________________ K. P. Venugopal Medical Image Processing Group University of Pennsylvania 423 Blockley Hall Philadelphia, PA 19104 (venu at pixel.mipg.upenn.edu) From anandan at sarnoff.com Wed Feb 16 09:22:51 1994 From: anandan at sarnoff.com (P. Anandan x3249) Date: Wed, 16 Feb 94 09:22:51 EST Subject: outlier, robust statistics In-Reply-To: <9402150756.AA17907@salk.edu> (message from Terry Sejnowski on Mon, 14 Feb 94 23:56:04 PST) Message-ID: <9402161422.AA13890@peanut.sarnoff.com> Hi Terry, It may be worth mentioning that a simple extension of your "fixed velocity" formulation leads to something quite powerful and is a decent approximation for many real situations. This is to look formulate the hypothesis space as 2-D affine transforms of the image plane. Most of the references below have not used robust estimators but have focussed on the layered representation problem. However, recent extensions of all these algorithms at Sarnoff have included several different types of robust estimators as options. One noteworthy omission (simply because I have not yet updated my bib file, is the paper by Black and Jepson, CVPR93.) I also did not inlude the paper by Wang and Adelson at CVPR93, because that can be viewed as falling into either category (affine hypotheses or object hypotheses). In general, when you use a parametric motion model (translation, affine, 8-parameter quadratic for planar surface motion), you have the choice of working with motion-parameters as hypotheses or the objects as hypotheses. But if you are working with non-parametric motion fields (e.g., smooth flow), it is not obvious how to work with motion parameters as hypotheses. Last but not least, I should mention a recent paper that we have written which is under review that goes beyond parametric layers to include residual flow to fully account for the scene motion. This is an alternative approach to the standard formulation of the spatial-coherence assumption as a "smoothness" constraint (e.g., minimum quadratic variation, etc.). This paper also describes a computational framework that identifies the critical choice points for layered motion estimation and shows how different algorithms fit into that framework. I should be in a position to send you a copy of the paper in a couple of weeks or so. -- anandan @article{Irani-Peleg:IJCV, author = {M. Irani and S. Peleg}, title = {Computing Occluding and Transparent Motions}, journal = IJCV, year = {accepted for publication, 1993}, } @inproceedings{Bergen-etal:AICV91, author = {J.R. Bergen and P.J. Burt and K. Hanna and R. Hingorani and P. Jeanne and S. Peleg}, title = {Dynamic Multiple-Motion Computation}, booktitle = {Artificial Intelligence and Computer Vision: Proceedings of the Israeli Conference}, publisher = {Elsevier}, editor = {Y.A. Feldman and A. Bruckstein}, year = {1991}, pages = {147--156} } @inproceedings{Burt-etal:WVM89, title = {Object tracking with a moving camera, an application of dynamic motion analysis}, author ={P.J. Burt and J.R. Bergen and R. Hingorani and R. Kolczynski and W.A. Lee and A. Leung and J. Lubin and H. Shvaytser}, booktitle = WVM, address = {Irvine, CA}, month = {March}, year = {1989}, pages = {2--12} } @article{Bergen-etal:PAMI92, author = {J.R. Bergen and P.J. Burt and R. Hingorani and S. Peleg}, title = {A Three Frame Algorithm for Estimating Two-Component Image Motion}, journal = PAMI, month = {September}, year = {1992}, volume = {14}, pages = {886--896} } From M.Cooke at dcs.shef.ac.uk Wed Feb 16 09:22:17 1994 From: M.Cooke at dcs.shef.ac.uk (Martin Cooke) Date: Wed, 16 Feb 94 14:22:17 GMT Subject: missing values Message-ID: <9402161427.AA10510@dcs.shef.ac.uk> I've only just seen the discussion on missing values, so forgive this late response. The issue of training the Kohonen self-organising feature map with partial data is covered in Samad & Harp (1992) Self-organisation with partial data Network, 3, 205-212. Essentially, weight changes are restricted to the subspace of available data. Samad & Harp report three experiments using partial training data, and demonstrate that performance is essentially unchanged up to about 60% missing data. This is presumably due to the n -> 2 dimensionality reduction. We recently applied this result to training a speech recogniser on partial data, and got similar results [tech. rep. in preparation]. We're coming at this from the field of auditory scene analysis, where the result of source segregation is an inherently partial description of one or other source. I'd be happy to supply further details on request. Martin Cooke Computer Science Sheffield University UK From mmoller at daimi.aau.dk Wed Feb 16 11:10:00 1994 From: mmoller at daimi.aau.dk (Martin Fodslette M|ller) Date: Wed, 16 Feb 1994 17:10:00 +0100 Subject: copy of thesis. Message-ID: <199402161610.AA28147@titan.daimi.aau.dk> To all that have requested a copy of my thesis (and apologies to those that did not for sending this message). Thank you all for your interest in my thesis. Since so many have requested a copy (about 200), I will not be able to answer you all separately right now. Please accept my apologies. You will all receive a copy of the thesis in a few weeks. Best Regards -martin ---------------------------------------------------------------- Martin Moller email: mmoller at daimi.aau.dk Computer Science Dept. Fax: +45 8942 3255 Aarhus University Phone: +45 8942 3371 Ny Munkegade, Build. 540, DK-8000 Aarhus C, Denmark ---------------------------------------------------------------- From venu at pixel.mipg.upenn.edu Wed Feb 16 17:15:31 1994 From: venu at pixel.mipg.upenn.edu (Venugopal) Date: Wed, 16 Feb 94 17:15:31 EST Subject: Thesis available on ftp Message-ID: <9402162215.AA00370@pixel.mipg.upenn.edu> The following thesis is available on ftp from neuroprose archive: LEARNING IN CONNECTIONIST NETWORKS USING THE ALOPEX ALGORITHM K. P. Venugopal Florida Atlantic University Abstract: The ALOPEX algorithm is presented as a `universal' learning algorithm for connectionist models. It is shown that the ALOPEX procedure can be used efficiently as a supervised learning algorithm for such models. The algorithm is demonstrated successfully on a variety of network architectures. Such architectures include multi- layered perceptrons, time-delay models, asymmetric fully recurrent networks and memory neurons. The learning performance as well as the generalization capability of the ALOPEX algorithm, are compared with those of the backpropagation procedure, concerning a number of benchmark problems, and it is shown that the ALOPEX has specific advantages. Results on the MONKS problems are the best reported ones so far. Two new architectures are proposed for the on-line, direct adaptive control of dynamical systems using neural networks. The proposed schemes are shown to provide better response and tracking characteristics, than the other existing direct control schemes. A velocity reference scheme is introduced to improve the dynamic response of on-line learning controllers. The proposed learning algorithm and architectures are also studied on three practical problems: (i) classification of handwritten digits using Fourier descriptors, (ii) recognition of underwater targets from sonar returns, conidering temporal dependencies of consecutive returns, and (iii) on-line learning control of autonomous underwater vehicles, starting from random initial conditions. Detailed studies are conducted on the learning control applications. Also, the ability of the neural network controllers to adapt to slow and sudden varying parameter disturbances and measurement noise is studied in detail. --------------------- Some of the related papers: K. P. Venugopal, A. S. Pandya and R. Sudhakar, 'A recurrent neural network controller and learning algorithm for the on-line learning control of autonomous underwater vehicles', to appear in Neural Networks (1994) K. P. Venugopal, R. Sudhakar and A. S. Pandya, 'On-line learning control of autonomous underwater vehicles using feedforward neural networks', IEEE Journal of Oceanic Engineering, vol. 17 (1992) K. P. Venugopal, R. Sudhakar and A. S. Pandya, 'An improved scheme for the direct adaptive control of dynamical systems using backpropagation neural networks' to appear in Circuits, Systems and Signal Processing (1994) K. P. Venugopal and S. M. Smith, 'Improving the dynamic response of neural network controllers using velocity reference feedback' IEEE Trans. on Neural Networks, vol. 4, (1993) K. P. Unnikrishnan and K. P. Venugopal, 'Alopex: a correlation based learning algorithm for feedforward and feedback neural networks' to appear in Neural Computation, vol. 6, (1994) A. S. Pandya and K. P. Venugopal, 'A stochastic parallel algorithm for learning in neural networks', to appear in IEICE Transactions on Information Processing (1994) ----------------------------------------- The files at archive.cis.ohio-state.edu are venugopal.thesis1.ps.Z venugopal.thesis2.ps.Z venugopal.thesis3.ps.Z venugopal.thesis4.ps.Z venugopal.thesis5.ps.Z venugopal.thesis6.ps.Z venugopal.thesis7.ps.Z (total 200 pages) to ftp the files: unix> ftp archive.cis.ohio-state.edu Name (archive.cis.ohio-state.edu:xxxxx): anonymous Password: your address ftp> cd pub/neuroprose/Thesis ftp> binary ftp> mget venugopal.thesis* uncompress the files after transfering to your machine. unix> uncompress venugopal* ------------------------------------------------- K. P. Venugopal Medical Image Processing Group University of Pennsylvania 423 Blockley Hall Philadelphia, PA 19104 (venu at pixel.mipg.upenn.edu) From minton at ptolemy.arc.nasa.gov Wed Feb 16 21:03:21 1994 From: minton at ptolemy.arc.nasa.gov (Steve Minton) Date: Wed, 16 Feb 94 18:03:21 PST Subject: JAIR article Message-ID: <9402170203.AA27856@ptolemy.arc.nasa.gov> Readers of this newsgroup may be interested the following article, which was recently published in the Journal of Artificial Intelligence Research: Ling, C.X. (1994) "Learning the Past Tense of English Verbs: The Symbolic Pattern Associator vs. Connectionist Models", Volume 1, pages 209-229 Postscript: volume1/ling94a.ps (247K) Online Appendix: volume1/ling-appendix.Z (109K) data file, compressed Appendix: Learning the past tense of English verbs - a seemingly minor aspect of language acquisition - has generated heated debates since 1986, and has become a landmark task for testing the adequacy of cognitive modeling. Several artificial neural networks (ANNs) have been implemented, and a challenge for better symbolic models has been posed. In this paper, we present a general-purpose Symbolic Pattern Associator (SPA) based upon the decision-tree learning algorithm ID3. We conduct extensive head-to-head comparisons on the generalization ability between ANN models and the SPA under different representations. We conclude that the SPA generalizes the past tense of unseen verbs better than ANN models by a wide margin, and we offer insights as to why this should be the case. We also discuss a new default strategy for decision-tree learning algorithms. JAIR's server can be accessed by WWW, FTP, gopher, or automated email. For further information, check out our WWW server (URL is gopher://p.gp.cs.cmu.edu/) or one of our FTP sites (/usr/jair/pub at p.gp.cs.cmu.edu), or send email to jair at cs.cmu.edu with the subject AUTORESPOND and the message body HELP. From COTTRLL at FRMOP22.CNUSC.FR Thu Feb 17 10:04:00 1994 From: COTTRLL at FRMOP22.CNUSC.FR (COTTRELL) Date: Thu, 17 Feb 94 10:04 Subject: Paper available Message-ID: <"94-02-17-10:04:06.72*COTTRLL"@FRMOP22.CNUSC.FR> Dear connectionnits Some people report that they cannot retrieve the paper cottrell.things.ps that I put in the neuroprose archive some days ago I will try to solve the problem as soon as possible Please wait a little before trying again Yours sincerely Marie Cottrell SAMOS Universite Paris1 90, rue de Tolbiac F-75634 PARIS 13 FRANCE E-mail : cottrll at frmop22.cnusc.fr From COTTRLL at FRMOP22.CNUSC.FR Thu Feb 17 19:54:00 1994 From: COTTRLL at FRMOP22.CNUSC.FR (COTTRELL) Date: Thu, 17 Feb 94 19:54 Subject: Paper available : Kohonen algorithm Message-ID: <"94-02-17-19:54:08.03*COTTRLL"@FRMOP22.CNUSC.FR> Dear connectionnists The problem that some of you encounter in retrieving the paper Two or three... file cottrell.things.ps in neuroprose repository comes from a change in its name its name is now : cottrell.things.ps.Z in pub/neuroprose in archive.cis.ohio-state.edu It has been compressed. Sorry for the delay Yours sincerely Marie Cottrell From reza at ai.mit.edu Thu Feb 17 09:03:53 1994 From: reza at ai.mit.edu (Reza Shadmehr) Date: Thu, 17 Feb 94 09:03:53 EST Subject: Tech reports from CBCL at MIT Message-ID: <9402171403.AA02835@corpus-callosum> Hello, Following is a list of recent technical reports from the Center for Biological and Computational Learning at M.I.T. These reports are available via anonymous ftp. (see end of this message for details) -------------------------------- :CBCL Paper #78/AI Memo #1405 :author Amnon Shashua :title On Geometric and Algebraic Aspects of 3D Affine and Projective Structures from Perspective 2D Views :date July 1993 :pages 14 :keywords visual recognition, structure from motion, projective geometry, 3D reconstruction We investigate the differences --- conceptually and algorithmically --- between affine and projective frameworks for the tasks of visual recognition and reconstruction from perspective views. It is shown that an affine invariant exists between any view and a fixed view chosen as a reference view. This implies that for tasks for which a reference view can be chosen, such as in alignment schemes for visual recognition, projective invariants are not really necessary. We then use the affine invariant to derive new algebraic connections between perspective views. It is shown that three perspective views of an object are connected by certain algebraic functions of image coordinates alone (no structure or camera geometry needs to be involved). -------------- :CBCL Paper #79/AI Memo #1390 :author Jose L. Marroquin and Federico Girosi :title Some Extensions of the K-Means Algorithm for Image Segmentation and Pattern Classification :date January 1993 :pages 21 :keywords K-means, clustering, vector quantization, segmentation, classification We present some extensions to the k-means algorithm for vector quantization that permit its efficient use in image segmentation and pattern classification tasks. We show that by introducing a certain set of state variables it is possible to find the representative centers of the lower dimensional manifolds that define the boundaries between classes; this permits one, for example, to find class boundaries directly from sparse data or to efficiently place centers for pattern classification. The same state variables can be used to determine adaptively the optimal number of centers for clouds of data with space-varying density. Some examples of the application of these extensions are also given. -------------- :CBCL Paper #80/AI Memo #1431 :title Example-Based Image Analysis and Synthesis :author David Beymer, Amnon Shashua and Tomaso Poggio :date November, 1993 :pages 21 :keywords computer graphics, networks, computer vision, teleconferencing, image compression, computer interfaces Image analysis and graphics synthesis can be achieved with learning techniques using directly image examples without physically-based, 3D models. In our technique: 1) the mapping from novel images to a vector of ``pose'' and ``expression'' parameters can be learned from a small set of example images using a function approximation technique that we call an analysis network; 2) the inverse mapping from input ``pose'' and ``expression'' parameters to output images can be synthesized from a small set of example images and used to produce new images using a similar synthesis network. The techniques described here have several applications in computer graphics, special effects, interactive multimedia and very low bandwidth teleconferencing. -------------- :CBCL Paper #81/AI Memo #1432 :title Conditions for Viewpoint Dependent Face Recognition :author Philippe G. Schyns and Heinrich H. B\"ulthoff :date August 1993 :pages 6 :keywords face recognition, RBF Network Symmetry Face recognition stands out as a singular case of object recognition: although most faces are very much alike, people discriminate between many different faces with outstanding efficiency. Even though little is known about the mechanisms of face recognition, viewpoint dependence, a recurrent characteristic of many research on faces, could inform algorithms and representations. Poggio and Vetter's symmetry argument predicts that learning only one view of a face may be sufficient for recognition, if this view allows the computation of a symmetric, "virtual," view. More specifically, as faces are roughly bilaterally symmetric objects, learning a side-view---which always has a symmetric view--- should give rise to better generalization performances that learning the frontal view. It is also predicted that among all new views, a virtual view should be best recognized. We ran two psychophysical experiments to test these predictions. Stimuli were views of 3D models of laser-scanned faces. Only shape was available for recognition; all other face cues--- texture, color, hair, etc.--- were removed from the stimuli. The first experiment tested wqhich single views of a face give rise to best generalization performances. The results were compatible with the symmetry argument: face recognition from a single view is always better when the learned view allows the computation 0f a symmetric view. -------------- :CBCL Paper #82/AI Memo #1437 :author Reza Shadmehr and Ferdinando A. Mussa-Ivaldi :title Geometric Structure of the Adaptive Controller of the Human Arm :date July 1993 :pages 34 :keywords Motor learning, reaching movements, internal models, force fields, virtual environments, generalization, motor control The objects with which the hand interacts with may significantly change the dynamics of the arm. How does the brain adapt control of arm movements to this new dynamics? We show that adaptation is via composition of a model of the task's dynamics. By exploring generalization capabilities of this adaptation we infer some of the properties of the computational elements with which the brain formed this model: the elements have broad receptive fields and encode the learned dynamics as a map structured in an intrinsic coordinate system closely related to the geometry of the skeletomusculature. The low--level nature of these elements suggests that they may represent a set of primitives with which movement are represented in the CNS. -------------- :CBCL Paper #83/AI Memo #1440 :author Michael I. Jordan and Robert A. Jacobs :title Hierarchical Mixtures of Experts and the EM Algorithm :date August 1993 :pages 29 :keywords supervised learning, statistics, decision trees, neural networks We present a tree-structured architecture for supervised learning. The statistical model underlying the architecture is a hierarchical mixture model in which both the mixture coefficients and the mixture components are generalized linear models (GLIM's). Learning is treated as a maximum likelihood problem; in particular, we present an Expectation-Maximization (EM) algorithm for adjusting the parameters of the architecture. We also develop an on-line learning algorithm in which the parameters are updated incrementally. Comparative simulation results are presented in the robot dynamics domain. -------------- :CBCL Paper #84/AI Memo #1441 :title On the Convergence of Stochastic Iterative Dynamic Programming Algorithms :author Tommi Jaakkola, Michael I. Jordan and Satinder P. Singh :date August 1993 :pages 15 :keywords reinforcement learning, stochastic approximation, convergence, dynamic programming Recent developments in the area of reinforcement learning have yielded a number of new algorithms for the prediction and control of Markovian environments. These algorithms, including the TD(lambda) algorithm of Sutton (1988) and the Q-learning algorithm of Watkins (1989), can be motivated heuristically as approximations to dynamic programming (DP). In this paper we provide a rigorous proof of convergence of these DP-based learning algorithms by relating them to the powerful techniques of stochastic approximation theory via a new convergence theorem. The theorem establishes a general class of convergent algorithms to which both TD (lambda) and Q-learning belong. -------------- :CBCL Paper #86/AI Memo #1449 :title Formalizing Triggers: A Learning Model for Finite Spaces :author Patha Niyogi and Robert Berwick :pages 14 :keywords language learning, parameter systems, Markov chains, convergence times, computational learning theory :date November 1993 In a recent seminal paper, Gibson and Wexler (1993) take important steps to formalizing the notion of language learning in a (finite) space whose grammars are characterized by a finite number of {\it parameters\/}. They introduce the Triggering Learning Algorithm (TLA) and show that even in finite space convergence may be a problem due to local maxima. In this paper we explicitly formalize learning in finite parameter space as a Markov structure whose states are parameter settings. We show that this captures the dynamics of TLA completely and allows us to explicitly compute the rates of convergence for TLA and other variants of TLA e.g. random walk. Also included in the paper are a corrected version of GW's central convergence proof, a list of ``problem states'' in addition to local maxima, and batch and PAC-style learning bounds for the model. -------------- :CBCL Paper #87/AI Memo #1458 :title Convergence Results for the EM Approach to Mixtures of Experts Architectures :author Michael Jordan and Xei Xu :pages 33 :date September 1993 The Expectation-Maximization (EM) algorithm is an iterative approach to maximum likelihood parameter estimation. Jordan and Jacobs (1993) recently proposed an EM algorithm for the mixture of experts architecture of Jacobs, Jordan, Nowlan and Hinton (1991) and the hierarchical mixture of experts architecture of Jordan and Jacobs (1992). They showed empirically that the EM algorithm for these architectures yields significantly faster convergence than gradient ascent. In the current paper we provide a theoretical analysis of this algorithm. We show that the algorithm can be regarded as a variable metric algorithm with its searching direction having a positive projection on the gradient of the log likelihood. We also analyze the convergence of the algorithm and provide an explicit expression for the convergence rate. In addition, we describe an acceleration technique that yields a significant speedup in simulation experiments. -------------- :CBCL Paper #89/AI Memo #1461 :title Face Recognition under Varying Pose :author David J. Beymer :pages 14 :date December 1993 :keywords computer vision, face recognition, facial feature detection, template matching While researchers in computer vision and pattern recognition have worked on automatic techniques for recognizing faces for the last 20 years, most systems specialize on frontal views of the face. We present a face recognizer that works under varying pose, the difficult part of which is to handle face rotations in depth. Building on successful template-based systems, our basic approach is to represent faces with templates from multiple model views that cover different poses from the viewing sphere. Our system has achieved a recognition rate of 98% on a data base of 62 people containing 10 testing and 15 modelling views per person. -------------- :CBCL Paper #90/AI Memo #1452 :title Algebraic Functions for Recognition :author Amnon Shashua :pages 11 :date January 1994 In the general case, a trilinear relationship between three perspective views is shown to exist. The trilinearity result is shown to be of much practical use in visual recognition by alignment --- yielding a direct method that cuts through the computations of camera transformation, scene structure and epipolar geometry. The proof of the central result may be of further interest as it demonstrates certain regularities across homographies of the plane and introdues new view invariants. Experiments on simulated and real image data were conducted, including a comparative analysis with epipolar intersection and the linear combination methods, with results indicating a greater degree of robustness in practice and higher level of performance in re-projection tasks. ============================ How to get a copy of a report: The files are in compressed postscript format and are named by their AI memo number. They are put in a directory named as the year in which the paper was written. Here is the procedure for ftp-ing: unix> ftp publications.ai.mit.edu (128.52.32.22, log-in as anonymous) ftp> cd ai-publications/1993 ftp> binary ftp> get AIM-number.ps.Z ftp> quit unix> zcat AIM-number.ps.Z | lpr Best wishes, Reza Shadmehr Center for Biological and Computational Learning M. I. T. Cambridge, MA 02139 From mel at klab.caltech.edu Thu Feb 17 21:00:32 1994 From: mel at klab.caltech.edu (Bartlett Mel) Date: Thu, 17 Feb 94 18:00:32 PST Subject: NIPS*94 Call for Workshops Message-ID: <9402180200.AA20549@plato.klab.caltech.edu> CALL FOR PROPOSALS NIPS*94 Post-Conference Workshops December 2 and 3, 1994 Vail, Colorado Following the regular program of the Neural Information Processing Systems 1994 conference, workshops on current topics in neural information processing will be held on December 2 and 3, 1994, in Vail, Colorado. Proposals by qualified individuals interested in chairing one of these workshops are solicited. Past topics have included: active learning and control, architectural issues, attention, bayesian analysis, benchmarking neural network applications, computational complexity issues, computational neuroscience, fast training techniques, genetic algorithms, music, neural network dynamics, optimization, recurrent nets, rules and connectionist models, self- organization, sensory biophysics, speech, time series prediction, vision and audition, implementations, and grammars. The goal of the workshops is to provide an informal forum for researchers to discuss important issues of current interest. Sessions will meet in the morning and in the afternoon of both days, with free time in between for ongoing individual exchange or outdoor activities. Concrete open and/or controversial issues are encouraged and preferred as workshop topics. Representation of alternative viewpoints and panel-style discussions are particularly encouraged. Individuals proposing to chair a workshop will have responsibilities including: 1) arranging short informal presentations by experts working on the topic, 2) moderating or leading the discussion and reporting its high points, findings, and conclusions to the group during evening plenary sessions (the ``gong show''), and 3) writing a brief summary. Submission Procedure: Interested parties should submit a short proposal for a workshop of interest postmarked by May 21, 1994. (Express mail is not necessary. Submissions by electronic mail will also be accepted.) Proposals should include a title, a description of what the workshop is to address and accomplish, the proposed length of the workshop (one day or two days), and the planned format. It should motivate why the topic is of interest or controversial, why it should be discussed and what the targeted group of participants is. In addition, please send a brief resume of the prospective workshop chair, a list of publications and evidence of scholarship in the field of interest. Mail submissions to: Todd K. Leen, NIPS*94 Workshops Chair Department of Computer Science and Engineering Oregon Graduate Institute of Science and Technology P.O. Box 91000 Portland Oregon 97291-1000 USA (e-mail: tleen at cse.ogi.edu) Name, mailing address, phone number, fax number, and e-mail net address should be on all submissions. PROPOSALS MUST BE POSTMARKED BY MAY 21, 1994 Please Post From scheler at informatik.tu-muenchen.de Fri Feb 18 11:10:21 1994 From: scheler at informatik.tu-muenchen.de (Gabriele Scheler) Date: Fri, 18 Feb 1994 17:10:21 +0100 Subject: TR announcement: Adaptive Distance Measures Message-ID: <94Feb18.171027met.42273@papa.informatik.tu-muenchen.de> FTP-host: archive.cis.ohio-state.edu FTP-file: pub/neuroprose/scheler.adaptive.ps.Z The file scheler.adaptive.ps.Z is now available for copying from the Neuroprose repository: Pattern Classification with Adaptive Distance Measures Gabriele Scheler Technische Universit"at M"unchen (25 pages) also available as Report FKI-188-94 from Institut f"ur Informatik TU M"unchen D 80290 M"unchen ftp-host: flop.informatik.tu-muenchen.de ftp-file: pub/fki/fki-188-94.ps.gz ABSTRACT: In this paper, we want to explore the notion of learning the classification of patterns from examples by synthesizing distance functions. A working implementation of a distance classifier is presented. Its operation is illustrated with the problem of classification according to parity (highly non-linear) and a classification of feature vectors which involves dimension reduction (a linear problem). A solution to these problems is sought in two steps: (a) a parametrized distance function (called a `distance function scheme') is chosen, (b) setting parameters to values according to the classification of training patterns results in a specific distance function. This induces a classification on all remaining patterns. The general idea of this approach is to find restricted functional shapes in order to model certain cognitive functions of classification exactly, i.e. performing classifications that occur as well as excluding classifications that do not naturally occur and may even be experimentally proven to be excluded from learnability by a living organism. There are also certain technical advantages in using restricted function shapes and simple learning rules, such as reducing learning time, generating training sets and individual patterns to set certain parameters, determining the learnability of a specific problem with a given function scheme or providing additions to functions for individual exceptions, while retaining the general shape for generalization. From soller at asylum.cs.utah.edu Fri Feb 18 19:13:34 1994 From: soller at asylum.cs.utah.edu (Jerome Soller) Date: Fri, 18 Feb 94 17:13:34 -0700 Subject: 2nd An. Utah Workshop on the Applicat. of Intelligent and Adap. Systems Message-ID: <9402190013.AA09689@asylum.cs.utah.edu> ------------------------------------------------ 2nd Annual Utah Workshop on: "Applications of Intelligent and Adaptive Systems" Sponsored by: The University of Utah Cognitive Science Industrial Advisory Board and The Joint Services Software Technology Conference '94 -------------------------------------------------- Date: April 15, 1994 Time: 8:00 a.m.-2:30 p.m. Cost: contact Jerome Soller or Dale Sanders for the cost for non-conference attendees, free for conference attendees Location: Salt Lake City Marriott, Salon E, 75 South and West Temple -------------------------------------------------- Talk 1: "The Use of Genetic Algorithms and Neural Networks in the Automatic Interpretation of Medical Images", Dr. Charles Rosenberg Research Investigator, VA Geriatric, Research, Education, and Clinical Center and Adjunct Assistant Professor, Department of Psychology, University of Utah (crr at cogsci.psych.utah.edu) ((801) 582-1565, x-2458) -------------------------------------------------- Talk 2: "A Hybrid On-line Handwriting Recognition System" Dr. Nicholas S. Flann. Assistant Professor, Computer Science Department, Utah State University. (flann at nick.cs.usu.edu) ((801) 750-2451) -------------------------------------------------- Talk 3: "Prototyping Activities in Robotics, Control, and Manufacturing" Dr. Tarek M. Sobh Research Assistant Professor Computer Science Department University of Utah (sobh at wingate.cs.utah.edu) ((801) 585-5047) -------------------------------------------------- Talk 4: "Software Architecture and Unmanned Ground Vehicles" Dr. David Morgenthaler Program Manager Sarcos Research Corporation Salt Lake City, UT (David_Morgenthaler at ced.utah.edu) ((801) 581-0155) -------------------------------------------------- Lunch Break: 11:45 a.m.-12:45 p.m. -------------------------------------------------- Talk 5: "Use of Decision Support in a Hospital Information System" Dr. Allan Pryor Professor of Medical Informatics University of Utah and Assistant Vice President of Informatics Intermountain Health Care Salt Lake City UT (tapryor at cc.utah.edu) ((801) 321-2128) -------------------------------------------------- Talk 6: "Applications of Neural Networks in Critical Care Monitoring" Dr. Joe Orr Research Instructor Department of Anesthesiology University of Utah (jorr at soma.med.utah.edu) ((801) 581-6393) -------------------------------------------------- Pre-registration required; For registration, copies of the abstracts, or references for publications relating to these talks, please contact: Jerome Soller, Veterans Affairs Medical Center and University of Utah Computer Science (801) 582-1565, ext 2469; (801) 581-7977 soller at cs.utah.edu or Dale Sanders, TRW Inc., Ogden Engineering Services (801) 625-8343 dale_sanders at oz.bmd.trw.com -------------------------------------------------- We wish to thank the following for their support of this workshop: Applied Information and Management Systems, Inc.; Intermountain Health Care; The Joint Services Software Technology Conference; Salt Lake Veterans Affairs Geriatric Research, Education, and Clinical Center; Sarcos Corporation; 3M Health Information Systems; TRW Systems Integration Group; University of Utah Departments of Computer Science, Medical Informatics, and Physiology; Utah Information Technology Association From judd at scr.siemens.com Fri Feb 18 21:31:24 1994 From: judd at scr.siemens.com (Stephen Judd) Date: Fri, 18 Feb 1994 21:31:24 -0500 Subject: Optimal Stopping Time paper Message-ID: <199402190231.VAA27524@tern.siemens.com> ***Do not forward to other bboards*** FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/wang.optistop.ps.Z The file wang.optistop.ps.Z is now available for copying from the Neuroprose repository: Optimal Stopping and Effective Machine Complexity in Learning Changfeng Wang U.Penn Santosh S. Venkatesh U.Penn J. Stephen Judd Siemens Abstract: We study the problem of when to stop training a class of feedforward networks -- networks with fixed input weights, one hidden layer, and a linear output -- when they are trained with a gradient descent algorithm on a finite number of examples. Under general regularity conditions, it is shown analytically that there are, in general, three distinct phases in the generalization performance in the learning process. In particular, the network has better generalization performance when learning is stopped at a certain time before the global minimum of the empirical error is reached. A notion of "effective size" of a machine is defined and used to explain the trade-off between the complexity of the machine and the training error in the learning process. The study leads naturally to a network size selection criterion, which turns out to be a generalization of Akaike's Information Criterion for the learning process. It is shown that stopping learning before the global minimum of the empirical error has the effect of network size selection. (8 pages) To appear in NIPS-6- (1993) sj Stephen Judd Siemens Corporate Research, (609) 734-6573 755 College Rd. East, fax (609) 734-6565 Princeton, judd at learning.scr.siemens.com NJ usa 08540 From mjolsness-eric at CS.YALE.EDU Mon Feb 21 10:58:26 1994 From: mjolsness-eric at CS.YALE.EDU (Eric Mjolsness) Date: Mon, 21 Feb 94 10:58:26 EST Subject: clustering & matching papers Message-ID: <199402211558.AA05604@NEBULA.SYSTEMSZ.CS.YALE.EDU> ****** PLEASE DO NOT FORWARD TO OTHER MAILING LISTS OR BOARDS. ************** ****** PAPER AVAILABLE VIA NEUROPROSE *************************************** FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/gold.object-clustering.ps.Z FTP-filename: /pub/neuroprose/lu.object-matching.ps.Z The following two NIPS papers have been placed in the Neuroprose archive at Ohio State. The files are "gold.object-clustering.ps.Z" and "lu.object-matching.ps.Z". Each is 8 pages in length. The uncompressed postscript file for the second paper, "lu.object-matching.ps.Z", contains images and is 4.3 megabytes long. So you may need to use a symbolic link in printing it: "lpr -s" under SunOS. ----------------------------------------------------------------------------- Clustering with a Domain-Specific Distance Measure Stephen Gold, Eric Mjolsness and Anand Rangarajan Yale Computer Science Department With a point matching distance measure which is invariant under translation, rotation and permutation, we learn 2-D point-set objects, by clustering noisy point-set images. Unlike traditional clustering methods which use distance measures that operate on feature vectors - a representation common to most problem domains - this object-based clustering technique employs a distance measure specific to a type of object within a problem domain. Formulating the clustering problem as two nested objective functions, we derive optimization dynamics similar to the Expectation-Maximization algorithm used in mixture models. ----------------------------------------------------------------------------- Two-Dimensional Object Localization by Coarse-to-Fine Correlation Matching Chien-Ping Lu and Eric Mjolsness Yale Computer Science Department We present a Mean Field Theory method for locating two-dimensional objects that have undergone rigid transformations. The resulting algorithm is a coarse-to-fine correlation matching. We first consider problems of matching synthetic point data, and derive a point matching objective function. A tractable line segment matching objective function is derived by considering each line segment as a dense collection of points, and approximating it by a sum of Gaussians. The algorithm is tested on real images from which line segments are extracted and matched. ----------------------------------------------------------------------------- - Eric Mjolsness mjolsness at cs.yale.edu ------- From pkso at castle.ed.ac.uk Tue Feb 22 13:54:42 1994 From: pkso at castle.ed.ac.uk (P Sollich) Date: Tue, 22 Feb 94 18:54:42 GMT Subject: Preprint on query learning in Neuroprose archive Message-ID: <9402221854.aa28409@uk.ac.ed.castle> FTP-host: archive.cis.ohio-state.edu FTP-filename: /pub/neuroprose/sollich.queries.ps.Z The file sollich.queries.ps.Z (16 pages) is now available via anonymous ftp from the Neuroprose archive. Title and abstract are given below. We regret that hardcopies are not available. --------------------------------------------------------------------------- Query Construction, Entropy and Generalization in Neural Network Models Peter Sollich Department of Physics, University of Edinburgh, Kings Buildings, Mayfield Road, Edinburgh EH9 3JZ, U.K. (To appear in Physical Review E) Abstract We study query construction algorithms, which aim at improving the generalization ability of systems that learn from examples by choosing optimal, non-redundant training sets. We set up a general probabilistic framework for deriving such algorithms from the requirement of optimizing a suitable objective function; specifically, we consider the objective functions entropy (or information gain) and generalization error. For two learning scenarios, the high-low game and the linear perceptron, we evaluate the generalization performance obtained by applying the corresponding query construction algorithms and compare it to training on random examples. We find qualitative differences between the two scenarios due to the different structure of the underlying rules (nonlinear and `non-invertible' vs.linear); in particular, for the linear perceptron, random examples lead to the same generalization ability as a sequence of queries in the limit of an infinite number of examples. We also investigate learning algorithms which are ill-matched to the learning environment and find that in this case, minimum entropy queries can in fact yield a lower generalization ability than random examples. Finally, we study the efficiency of single queries and its dependence on the learning history, i.e. on whether the previous training examples were generated randomly or by querying, and the difference between globally and locally optimal query construction. --------------------------------------------------------------------------- Peter Sollich Dept. of Physics University of Edinburgh e-mail: P.Sollich at ed.ac.uk Kings Buildings Tel. +44-31-650 5236 Mayfield Road Edinburgh EH9 3JZ, U.K. --------------------------------------------------------------------------- From B344DSL at UTARLG.UTA.EDU Tue Feb 22 22:18:10 1994 From: B344DSL at UTARLG.UTA.EDU (B344DSL@UTARLG.UTA.EDU) Date: Tue, 22 Feb 1994 21:18:10 -0600 (CST) Subject: Conference announcement Message-ID: <01H9786W7CBM0004O8@UTARLG.UTA.EDU> ANNOUNCEMENT AND CALL FOR ABSTRACTS Conference on Oscillations in Neural Systems, Sponsored by the Metroplex Institute for Neural Dynamics (MIND) and the University of Texas at Arlington. To be held Thursday through Saturday, MAY 5-7, 1994 Location: UNIVERSITY OF TEXAS AT ARLINGTON MAIN LIBRARY, 6TH FLOOR PARLOR Official Conference Motel: Park Inn 703 Benge Drive Arlington, TX 76013 1-800-777-0100 or 817-860-2323 A block of rooms has been reserved at the Park Inn for $35 a night (single or double). Room sharing arrangements are possible. Reservations should be made directly through the motel. Official Conference Travel Agent: Airline reservations to Dallas-Fort Worth airport should be made through Dan Dipert travel in Arlington, 1-800-443-5335. For those who wish to fly on American Airlines, a Star File account has been set up for a 5% discount off lowest available fares (two week advance, staying over Saturday night) or 10% off regular coach fare; arrangements for Star File reservations should be made through Dan Dipert. Please let the conference organizers know (by e-mail or telephone) when you plan to arrive: some people can be met at the airport (about 30 minutes from Arlington), others can call Super Shuttle at 817-329-2000 upon arrival for transportation to the Park Inn (about $14-$16 per person). Registration for the conference is $25 for students, $65 for non- student oral or poster presenters, $85 for others. MIND members will have $20 (or $10 for students) deducted from the registration. A registration form is attached to this announcement. Registrants will receive the MIND monthly newsletter (on e-mail when possible) for the remainder of 1994. Invited speakers: Bill Baird (University of California, Berkeley) Adi Bulsara (Naval Research Laboratories, San Diego) Alianna Maren (Accurate Automation Corporation) George Mpitsos (Oregon State University) Martin Stemmler (California Institute of Technology) Roger Traub (IBM, Tarrytown, New York) Robert Wong (Downstate Medical Center, Brooklyn) Geoffrey Yuen (Northwestern University) Those interested in presenting are invited to submit abstracts (1-2 paragraphs) any time between now and March 15, 1994, of any work related to the theme of the conference. The topic of neural oscillation is currently of great interest to psychologists and neuroscientists alike. Recently it has been observed that neurons in separate areas of the brain will oscillate in synchrony in response to certain stimuli. One hypothesized function for such synchronized oscillations is to solve the "binding problem," that is, how is it that disparate features of objects (e.g., a person's face and their voice) are tied together into a single unitary whole. Some bold speculators (such as Francis Crick in his recent book, The Astonishing Hypothesis) even argue that synchronized neural oscillations form the basis for consciousness. Talks will be 1 hour for invited speakers and 45 minutes for contributed speakers including questions. There will be no parallel sessors. Contributors whose work is considered worthy of presentation but who cannot be fit into the schedule will be invited to present posters. Presenters will not be required to write complete papers. After the conference is over, we will attempt to obtain a contract with a publisher for a book based on the conference. Oral and poster presenters will be invited to submit chapters to this book, although it is not a precondition for being a speaker. Two books based on previous MIND conferences (Motivation, Emotion, and Goal Direction in Neural Networks and Neural Networks for Knowledge Representation and Inference) have been published by Lawrence Erlbaum Associates, and a book based on our last conference (Optimality in Biological and Artificial Networks?) is now in progress, under contract with Erlbaum as part of their joint series with INNS. Abstracts should submitted, by e-mail, snail mail, or fax, to: Professor Daniel S. Levine Department of Mathematics, University of Texas at Arlington 411 S. Nedderman Drive Arlington, TX 76019-0408 Office telephone: 817-273-3598, fax: 817-794-5802 e-mail: b344dsl at utarlg.uta.edu Further inquiries about the conference can be addressed to Professor Levine or to the other two conference organizers: Professor Vincent Brown Mr. Timothy Shirey 817-273-3247 214-495-3500 or 214-422-4570 b096vrb at utarlg.uta.edu 73353.3524 at compuserve.com Please distribute this announcement to anyone you think may be interested in the conference. REGISTRATION FOR MIND/INNS CONFERENCE ON OSCILLATIONS IN NEURAL SYSTEMS, UNIVERSITY OF TEXAS AT ARLINGTON, MAY 5-7, 1994 Name ______________________________________________________________ Address ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ____________________________________________________________ E-Mail __________________________________________________________ Telephone _________________________________________________________ Registration fee enclosed: _____ $15 Student, member of MIND _____ $25 Student _____ $65 Non-student oral or poster presenter _____ $65 Non-student member of MIND _____ $85 All others Will you be staying at the Park Inn? ____ Yes ____ No Are you planning to share a room with someone you know? ____ Yes ____ No If so, please list that person's name __________________________ If not, would be you be interested in sharing a room with another conference attendee to be assigned? ____ Yes ____ No PLEASE REMEMBER TO CALL THE PARK INN DIRECTLY FOR YOUR RESERVATION (WHETHER SINGLE OR DOUBLE) AT 1-800-777-0100 OR 817-860-2323. From fellous at selforg.usc.edu Tue Feb 22 23:31:06 1994 From: fellous at selforg.usc.edu (Jean-Marc Fellous) Date: Tue, 22 Feb 94 20:31:06 PST Subject: Research Associate Message-ID: <9402230431.AA00747@selforg.usc.edu> Could you please post this announcement ? Thanks, Jean-Marc >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< TENNESSEE STATE UNIVERSITY CENTER FOR NEURAL ENGINEERING RESEARCH ASSOCIATE Applications are invited for a research associate position, for a unique consortium involving a medical school, and an engineering college, Oak Ridge National Laboratory and a private high-tech industry. Ph.D in Biomedical/Electrical Engineering (or related fields) with strong intrest in artificial and biological neural networks is required, inthe areas of auditory system modeling and sensory motor control. This position will be supported for at least two years and possibly longer. Teaching of a graduate or an undergraduate course is optional. Send resume to : Dr. Mohan J. Malkani Director, Center for Neural Engineering Tennessee State University 3500 John Merritt Blvd. Nashville, TN 37209-1561 (615)320-3550 Fax: (615)320-3554 e-mail: malkani at harpo.tnstate.edu From sbh at eng.cam.ac.uk Tue Feb 22 12:00:33 1994 From: sbh at eng.cam.ac.uk (S.B. Holden) Date: Tue, 22 Feb 94 17:00:33 GMT Subject: PhD dissertation available by anonymous ftp Message-ID: <5730.199402221700@tw700.eng.cam.ac.uk> The following PhD dissertation is available by anonymous ftp from the archive of the Speech, Vision and Robotics Group at the Cambridge University Engineering Department. On the Theory of Generalization and Self-Structuring in Linearly Weighted Connectionist Networks Sean B. Holden Technical Report CUED/F-INFENG/TR161 Cambridge University Engineering Department Trumpington Street Cambridge CB2 1PZ England Abstract The study of connectionist networks has often been criticized for an overall lack of rigour, and for being based on excessively ad hoc techniques. Even though connectionist networks have now been the subject of several decades of study, the available body of research is characterized by the existence of a significant body of experimental results, and a large number of different techniques, with relatively little supporting, explanatory theory. This dissertation addresses the theory of {\em generalization performance\/} and {\em architecture selection\/} for a specific class of connectionist networks; a subsidiary aim is to compare these networks with the well-known class of multilayer perceptrons. After discussing in general terms the motivation for our study, we introduce and review the class of networks of interest, which we call {\em $\Phi$-networks\/}, along with the relevant supervised training algorithms. In particular, we argue that $\Phi$-networks can in general be trained significantly faster than multilayer perceptrons, and we demonstrate that many standard networks are specific examples of $\Phi$-networks. Chapters 3, 4 and 5 consider generalization performance by presenting an analysis based on tools from computational learning theory. In chapter 3 we introduce and review the theoretical apparatus required, which is drawn from {\em Probably Approximately Correct (PAC) learning theory\/}. In chapter 4 we investigate the {\em growth function\/} and {\em VC dimension\/} for general and specific $\Phi$-networks, obtaining several new results. We also introduce a technique which allows us to use the relevant PAC learning formalism to gain some insight into the effect of training algorithms which adapt architecture as well as weights (we call these {\em self-structuring training algorithms\/}). We then use our results to provide a theoretical explanation for the observation that $\Phi$-networks can in practice require a relatively large number of weights when compared with multilayer perceptrons. In chapter 5 we derive new necessary and sufficient conditions on the number of training examples required when training a $\Phi$-network such that we can expect a particular generalization performance. We compare our results with those derived elsewhere for feedforward networks of Linear Threshold Elements, and we extend one of our results to take into account the effect of using a self-structuring training algorithm. In chapter 6 we consider in detail the problem of designing a good self-structuring training algorithm for $\Phi$-networks. We discuss the best way in which to define an optimum architecture, and we then use various ideas from linear algebra to derive an algorithm, which we test experimentally. Our initial analysis allows us to show that the well-known {\em weight decay\/} approach to self-structuring is not guaranteed to provide a network which has an architecture close to the optimum one. We also extend our theoretical work in order to provide a basis for the derivation of an improved version of our algorithm. Finally, chapter 7 provides conclusions and suggestions for future research. ************************ How to obtain a copy ************************ a) Via FTP: unix> ftp svr-ftp.eng.cam.ac.uk Name: anonymous Password: (type your email address) ftp> cd reports ftp> binary ftp> get holden_tr161.ps.Z ftp> quit unix> uncompress holden_tr161.ps.Z unix> lpr holden_tr161.ps (or however you print PostScript) b) Via postal mail: Request a hardcopy from Dr. Sean B. Holden, Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 1PZ, England. or email me: sbh at eng.cam.ac.uk From viola at salk.edu Wed Feb 23 14:17:52 1994 From: viola at salk.edu (Paul Viola) Date: Wed, 23 Feb 94 11:17:52 PST Subject: Heinous Patent Message-ID: <9402231917.AA24448@salk.edu> From: Vision-List moderator Phil Kahn VISION-LIST Digest Tue Feb 22 11:26:42 PDT 94 Volume 13 : Issue 8 Date: Thu, 17 Feb 1994 22:23:00 GMT From: eledavis at ubvms.cc.buffalo.edu (Elliot Davis) Organization: University at Buffalo Subject: Error Reduction I would greatly appreciate your thoughts on the: ERROR TEMPLATE TECHNIQUE The "Error Template" technique (patent 4,802,231) provides an alternative method for reducing false alarms in pattern recognition systems. In this approach, a pattern representing a mismatched pattern is stored in the reference lexicon. It is a reference pattern to an error rather then to what is desired. THIS IS DONE WITH THE EXPECTATION THAT IF THE ERROR PATTERN OR A VARIATION OF IT IS REPEATED IT WILL TEND TO BE CLOSER TO ITSELF THEN TO THE PATTERN THAT IT FALSED OUT TO. ... Unless this patent is very old, I find it terrifying. It is a concept that is clearly part of the pattern recognition literature of the 70's. Essentially pattern classification works by finding clusters that represent classes. These clusters along with a measurement model define a probability density over the pattern space. All this technique is doing is adding an additional cluster which represents a particular type of measurement error sensing a class. Pattern classification theory tells us that this should be done whenever there is a particular measurement error that is not modeled well by our measurement model. You add a cluster when the distribution of data is different from the probability density predicted by the model -- i.e. a particular measurement error is more common than your model predicts. You can add these clusters by hand, as the patent suggests, or you can let a density estimation scheme discover them for you (a mixture of gaussians model trained with EM works nicely). End of story. So remember, anytime someone adds another cluster to a pattern classification model, they owe the owner of this patent money. I wonder what the date of this fine patent is?? Paul Viola From cohn at psyche.mit.edu Wed Feb 23 18:15:17 1994 From: cohn at psyche.mit.edu (David Cohn) Date: Wed, 23 Feb 94 18:15:17 EST Subject: Paper available: Exploration using optimal experiment design Message-ID: <9402232315.AA21110@psyche.mit.edu> Those who find Peter Sollich's paper on query construction of interest may also wish to look at the following paper, now available by anonymous ftp. This is a slightly revised version of the paper that is to appear in Advances in Neural Information Processing Systems 6, but includes a correction to Equation 2 that was made too late to be included in the NIPS volume. ##################################################################### Neural Network Exploration Using Optimal Experiment Design David A. Cohn Dept. of Brain and Cognitive Sciences Massachusetts Inst.\ of Technology Cambridge, MA 02139 Consider the problem of learning input/output mappings through exploration, e.g. learning the kinematics or dynamics of a robotic manipulator. If actions are expensive and computation is cheap, then we should explore by selecting a trajectory through the input space which gives us the most amount of information in the fewest number of steps. I discuss how results from the field of optimal experiment design may be used to guide such exploration, and demonstrate its use on a simple kinematics problem. ##################################################################### The paper may be retrieved by anonymous ftp to "psyche.mit.edu" using the following protocol: unix> ftp psyche.mit.edu Name (psyche.mit.edu:joebob): anonymous <- use "anonymous" here 331 Guest login ok, send ident as password. Password: joebob at machine.univ.edu <- use your email address here 230 Guest login ok, access restrictions apply. ftp> cd pub/cohn <- go to the directory 250 CWD command successful. ftp> binary <- change to binary transfer 200 Type set to I. ftp> get cohn.explore.ps.Z <- get the file 200 PORT command successful. 150 Binary data connection for cohn.explore.ps.Z ... 226 Binary Transfer complete. local: cohn.explore.ps.Z remote: cohn.explore.ps.Z 301099 bytes received in 2.8 seconds (1e+02 Kbytes/s) ftp> quit <- all done 221 Goodbye. From terry at salk.edu Thu Feb 24 05:49:35 1994 From: terry at salk.edu (Terry Sejnowski) Date: Thu, 24 Feb 94 02:49:35 PST Subject: Shakespeare and Neural Nets Message-ID: <9402241049.AA02725@salk.edu> from New Scientist 22 january 1994 p. 23 In an interesting article on the use of statistical measures to assess the attribution of texts to authors, Robert Matthews and Tom Merrriam report that: "Applying our neural network to disputed works such as 'The Two Noble Kinsman' has produced some interesting results and helped to settle some bitter arguments over authorship of controversial texts. ... "The first task was to train the network. This we did by exposing it to data extracted from a large number of samples of Shakespeare's undisputed work, together with that of his successor with The King's Men [a theater], John Fletcher. ... We then set the network loose on 'The Two Noble Kinsman'. Drawing on a wide variety of essentially subjective evidence, scholars have claimed that Shakespeare's hand dominates Acts I and V, with much of the rest appearing to be by Fletcher. In March last year, our neural network agreed with these attributions -- and proferred the extra opinion that Fletcher may have received considerable help from Shakespeare in Act IV. In short, our neural network quantitatively supports the subjective view of its much more sophisticated human counterparts that 'The Two Noble Kinsman' is a genuine collaboration between Shakespeare and one of his contemporaries." These results will appear in the journal 'Literary and Linguistic Computing'. A similar approach might be used to determine the contributions of coauthors to scientific papers. Terry ----- From efiesler at maya.idiap.ch Fri Feb 25 09:16:09 1994 From: efiesler at maya.idiap.ch (E. Fiesler) Date: Fri, 25 Feb 94 15:16:09 +0100 Subject: NN Formalization paper available by ftp. Message-ID: <9402251416.AA04305@maya.idiap.ch> PLEASE POST ----------- The following paper is available via anonymous ftp from the neuroprose archive. It counts 13 A4-size PostScript pages, and replaces a shorter preliminary ver- sion. Instructions for retrieval follow the abstract. NEURAL NETWORK CLASSIFICATION AND FORMALIZATION E. Fiesler IDIAP c.p. 609 CH-1920 Martigny Switzerland This paper has been accepted for publication in the special issue on Neural Network Standards of "Computer Standards & Interfaces", volume 16, edited by J. Fulcher. Elsevier Science Publishers, Amsterdam, 1994. ABSTRACT In order to assist the field of neural networks in maturing, a formalization and a solid foundation are essential. Additionally, to permit the introduction of formal proofs, it is essential to have an all-encompassing formal mathemat- ical definition of a neural network. This publication offers a neural network formalization consisting of a topological taxonomy, a uniform nomenclature, and an accompanying consistent mnemonic notation. Supported by this formalization, both a flexible mathemat- ical definition are presented. ------------------------------ To obtain a copy of this paper, please follow the following FTP instructions: unix> ftp archive.cis.ohio-state.edu (or: ftp 128.146.8.52) login: anonymous password: ftp> cd pub/neuroprose ftp> binary ftp> get fiesler.formalization.ps.Z ftp> bye unix> zcat fiesler.formalization.ps.Z | lpr (or however you uncompress and print postscript) For convenience of those outside the US, the paper has also been placed on the IDIAP ftp site: unix> ftp Maya.IDIAP.CH (or: ftp 192.33.221.1) login: anonymous password: ftp> cd pub/papers/neural ftp> binary ftp> get fiesler.formalization.ps.Z (OR get fiesler.formalization.ps) ftp> bye unix> zcat fiesler.formalization.ps.Z | lpr OR unix> lpr fiesler.formalization.ps (Hard copies of the paper are unfortunately not available.) P.S. Thanks for the update, Jordan ! From giles at research.nj.nec.com Fri Feb 25 18:28:59 1994 From: giles at research.nj.nec.com (Lee Giles) Date: Fri, 25 Feb 94 18:28:59 EST Subject: Available Message-ID: <9402252328.AA28936@fuzzy> ******************************************************************************** Reprint:USING RECURRENT NEURAL NETWORKS TO LEARN THE STRUCTURE OF INTERCONNECTION NETWORKS The following reprint is available via the University of Maryland Department of Computer Science Technical Report archive: ________________________________________________________________________________ "Using Recurrent Neural Networks to Learn the Structure of Interconnection Networks" UNIVERSITY OF MARYLAND TECHNICAL REPORT UMIACS-TR-94-20 AND CS-TR-3226 G.W. Goudreau(a) and C.L. Giles(b,c) goudreau at cs.ucf.edu, giles at research.nj.nec.com (a) Department of Computer Science, U. of Central Florida, Orlando, FL 32816 (b) NEC Research Inst.,4 Independence Way, Princeton, NJ 08540 (c) Inst. for Advanced Computer Studies, U. of Maryland, College Park, MD 20742 A modified Recurrent Neural Network (RNN) is used to learn a Self-Routing Interconnection Network (SRIN) from a set of routing examples. The RNN is modified so that it has several distinct initial states. This is equivalent to a single RNN learning multiple different synchronous sequential machines. We define such a sequential machine structure as "augmented" and show that a SRIN is essentially an Augmented Synchronous Sequential Machine (ASSM). As an example, we learn a small six-switch SRIN. After training we extract the network's internal representation of the ASSM and corresponding SRIN. -------------------------------------------------------------------------------- FTP INSTRUCTIONS unix> ftp cs.umd.edu (128.8.128.8) Name: anonymous Password: (your_userid at your_site) ftp> cd pub/pub/papers/TRs ftp> binary ftp> get 3226.ps.Z ftp> quit unix> uncompress 3226.ps.Z --------------------------------------------------------------------------------- -- C. Lee Giles / NEC Research Institute / 4 Independence Way Princeton, NJ 08540 / 609-951-2642 / Fax 2482 == From terry at salk.edu Fri Feb 25 12:59:53 1994 From: terry at salk.edu (Terry Sejnowski) Date: Fri, 25 Feb 94 09:59:53 PST Subject: NEURAL COMPUTATION 6:2 Message-ID: <9402251759.AA18225@salk.edu> Neural Computation March 1994 Volume 6 Issue 2 Article: Hierarchical Mixtures of Experts and the EM Algorithm Michael I. Jordan and Robert A. Jacobs Notes: TD-Gammon, A Self-Teaching Backgammon Program, Achieves Master-Level Play Gerald Tesauro Correlated Attractors from Uncorrelated Stimuli L.F. Cugliandolo Letters: Learning of Phase-lags in Coupled Neural Oscillators Bard Ermentrout and Nancy Kopell A Mechanism for Neuronal Gain Control by Descending Pathways Mark E. Nelson The Role of Weight Normalization in Competitive Learning Geoffrey J. Goodhill and Harry G. Barrow A Probabilistic Resource Allocating Network for Novelty Detection Stephen Roberts and Lionel Tarassenko Diffusion Approximations for the Constant Learning Rate Backpropagation Algorithm and Resistance to Local Minima William Finnoff Relating Real-time Backpropagation and Back-propagation Through Time: An Application of Flow Graph Interreciprocity Francoise Beaufays and Eric A. Wan Smooth On-line Learning Algorithms for Hidden Markov Models Pierre Baldi and Yves Chauvin On Functional Approximation with Normalized Gaussian Units Michel Benaim Statistical Physics, Mixtures of Distributions and the EM Algorithm Yuille, A.L., Stolorz, P., and Utans, J. ----- SUBSCRIPTIONS - 1994 - VOLUME 6 - BIMONTHLY (6 issues) ______ $40 Student and Retired ______ $65 Individual ______ $166 Institution Add $22 for postage and handling outside USA (+7% GST for Canada). (Back issues from Volumes 1-5 are regularly available for $28 each to institutions and $14 each for individuals Add $5 for postage per issue outside USA (+7% GST for Canada) MIT Press Journals, 55 Hayward Street, Cambridge, MA 02142. Tel: (617) 253-2889 FAX: (617) 258-6779 e-mail: hiscox at mitvma.mit.edu ----- From heger at Informatik.Uni-Bremen.DE Mon Feb 28 07:27:12 1994 From: heger at Informatik.Uni-Bremen.DE (Matthias Heger) Date: Mon, 28 Feb 94 13:27:12 +0100 Subject: paper available Message-ID: <9402281227.AA06748@Informatik.Uni-Bremen.DE> FTP-host: ftp.gmd.de FTP-filename: /Learning/rl/papers/heger.consider-risk.ps.Z The file heger.consider-risk.ps.Z is now available for copying from the RL papers repository: *************************************************** * Consideration of Risk in Reinforcement Learning * *************************************************** (Revised submission to the 11th International Conference on Machine Learning (ML94), 15 pages) Abstract -------- Most Reinforcement Learning (RL) work supposes policies for sequential decision tasks to be optimal that minimize the expected total discounted cost (e.g. Q-Learning [Wat 89], AHC [Bar Sut And 83]). On the other hand, it is well known that it is not always reliable and can be treacherous to use the expected value as a decision criterion [Tha 87]. A lot of alter- native decision criteria have been suggested in decision theory to get a more sophisticated consideration of risk but most RL researchers have not concerned themselves with this subject until now. The purpose of this paper is to draw the reader's attention to the problems of the expected value criterion in Markov Decision Processes and to give Dynamic Pro- gramming algorithms for an alternative criterion, namely the Minimax cri- terion. A counterpart to Watkins' Q-Learning related to the Minimax cri- terion is presented. The new algorithm, called Q^-Learning (Q-hat-Learning), finds policies that minimize the >>worst-case<< total discounted costs. Most mathematical details aren't presented here but can be found in [Heg 94]. ---------------------------------------------------------------------------- Here is an example of retrieving and printing the file: -> ftp ftp.gmd.de Connected to gmdzi.gmd.de. 220 gmdzi FTP server (Version 5.72 Fri Nov 20 20:35:05 MET 1992) ready. Name (ftp.gmd.de:heger): anonymous 331 Guest login ok, send your email-address as password. Password: 230-This is an experimental FTP Server. See /README for details. This site is in Germany, Europe. Please restrict downloads to our non-working hours (i.e outside of 08:00-18:00 MET, Mo-Fr) *** Local time is 12:25:22 MET 230 Guest login ok, access restrictions apply. ftp> cd Learning/rl/papers 250 CWD command successful. ftp> binary 200 Type set to I. ftp> get heger.consider-risk.ps.Z 200 PORT command successful. 150 Opening BINARY mode data connection for heger.consider-risk.ps.Z (100477 bytes). 226 Transfer complete. local: heger.consider-risk.ps.Z remote: heger.consider-risk.ps.Z 100477 bytes received in 3.2e+02 seconds (0.3 Kbytes/s) ftp> quit 221 Goodbye. -> uncompress heger.consider-risk.ps.Z -> lpr heger.consider-risk.ps ------------------------------------------------------------------------------- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + Matthias Heger + + Zentrum fuer Kognitionswissenschaften, Universitaet Bremen, + + Postfach 330 440 + + D-28334 Bremen, Germany + + + + email: heger at informatik.uni-bremen.de + + Tel.: +49 (0) 421 218 4659 + + Fax: +49 (0) 421 218 3054 + +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ From gerda at ai.univie.ac.at Mon Feb 28 10:42:04 1994 From: gerda at ai.univie.ac.at (Gerda Helscher) Date: Mon, 28 Feb 1994 16:42:04 +0100 Subject: EMCSR'94 Message-ID: <199402281542.AA23377@anif.ai.univie.ac.at> After the general info which appeared in this mailing list recently about the T W E L F T H E U R O P E A N M E E T I N G O N C Y B E R N E T I C S A N D S Y S T E M S R E S E A R C H ( E M C S R ' 9 4 ) here is the detailed programme of Neural Network-related events: Plenary Lecture by S t e p h e n G r o s s b e r g : "Neural Networks for Learning, Recognition and Prediction" Wednesday, April 6, 9:00 a.m., University of Vienna, Main Building, Room 47 Symposium A r t i f i c i a l N e u r a l N e t w o r k s a n d A d a p t i v e S y s t e m s Chairpersons: S.Grossberg, USA, and G.Dorffner, Austria Tuesday, April 5, and Wednesday, April 6, Univ. of Vienna, Main Building, Room 47 Tuesday, April 5: 14.00-14.30: Synchronization in a Large Neural Network of Phase Oscillators with the Central Element Y.Kazanovich, Russian Academy of Sciences, Moscow, Russia 14.30-15.00: Synchronization in a Neural Network Model with Time Delayed Coupling T.B.Luzyanina, Russian Academy of Sciences, Moscow, Russia 15.00-15.30: Reinforcement Learning in a Network Model of the Basal Ganglia R.M.Borisyuk, J.R.Wickens, R.Koetter, University of Otago, New Zealand Wednesday, April 6: 11.00-11.30: Adaptive High Performance Classifier Based on Random Threshold Neurons E.M.Kussul, T.N.Baidyk, V.V.Lukovich, D.A.Rachkovskij, Ukrainian Academy of Science, Kiev, Ukraine 11.30-12.00: Dynamics of Ordering for One-dimensional Topological Mappings R.Folk, A.Kartashov, University of Linz, Austria 12.00-12.30: Informational Properties of Willshaw-like Neural Networks Capable of Autoassociative Learning A.Kartashov, R.Folk, A.Goltsev, A.Frolov, University of Linz, Austria 12.30-13.00: Relaxing the Hyperplane Assumption in the Analysis and Modification of Back-propagation Neural Networks L.Y.Pratt, A.N.Christensen, Colorado School of Mines, Golden, CO, USA 14.00-14.30: Improving Discriminability Based Transfer by Modifying the IM Metric to Use Sigmoidal Activations L.Y.Pratt, V.I.Gough, Colorado School of Mines, Golden, CO, USA 14.30-15.00: Order-theoretic View of Families of Neural Network Architectures M.Holena, University of Paderborn, Germany 15.00-15.30: A New Class of Neural Networks: Recognition Invariant to Arbitrary Transformation Groups A.Kartashov, K.Erman, University of Linz, Austria 16.00-16.30: Neural Assembly Architecture for Texture Recognition A.Goltsev, A.Kartashov, R.Folk, University of Linz, Austria 16.30-17.00: A Neural System for Character Recognition on Isovalue Maps E.P.L.Passos, L.E.S.Varella, M.A.Santos, R.L.de Araujo, Engineering Military Institute, Rio de Janeiro, Brazil 17.00-17.30: Neurocomputing Model Inference for Nonlinear Signal Processing Z.Zografski, T.Durrani, University of Strathclyde, Glasgow, United Kingdom 17.30-18.00: Learning from Examples and VLSI Implementation of Neural Networks V.Beiu, J.A.Peperstraete, J.Vandewalle, R.Lauwereins, Catholic University of Leuven, Heverlee, Belgium For more information please contact: sec at ai.univie.ac.at From ZECCHINA at to.infn.it Mon Feb 28 13:22:01 1994 From: ZECCHINA at to.infn.it (Riccardo Zecchina - tel.11-5647358, fax. 11-5647399) Date: Mon, 28 Feb 1994 19:22:01 +0100 (WET) Subject: role of response functions in ANN's. Message-ID: <940228192201.20800db9@to.infn.it> FTP-host: archive.cis.ohio-state.edu FTP-file: pub/neuroprose/zecchina.response.ps.Z The file zecchina.response.ps.Z is available for copying from the Neuroprose repository: "Response Functions Improving Performance in Analog Attractor Neural Networks" N .Brunel, R. Zecchina (13 pages, to appear in Phys. Rev. E Rapid Comm.) ABSTRACT: In the context of attractor neural networks, we study how the equilibrium analog neural activities, reached by the network dynamics during memory retrieval, may improve storage performance by reducing the interferences between the recalled pattern and the other stored ones. We determine a simple dynamics that stabilizes network states which are highly correlated with the retrieved pattern, for a number of stored memories that does not exceed $\alpha_{\star} N$, where $\alpha_{\star}\in[0,0.41]$ depends on the global activity level in the network and $N$ is the number of neurons.  From andre at physics.uottawa.ca Mon Feb 28 12:13:53 1994 From: andre at physics.uottawa.ca (Andre Longtin) Date: Mon, 28 Feb 94 12:13:53 EST Subject: Hebb Symposium Message-ID: <9402281713.AA23088@miro.physics.uottawa.ca.physics.uottawa.ca> ******* Preliminary Announcement ******* THE FIELDS INSTITUTE FOR RESEARCH IN MATHEMATICAL SCIENCES HEBB SYMPOSIUM ON NEURONS AND BIOLOGICAL DYNAMICS Sunday, May 15 to Friday May 20, 1994 Koffler Pharmaceutical Center University of Toronto D.O. Hebb's classic, "The Organization of Behavior" published in 1949, sketched out how behavior might emerge from the properties of nerve cells and assemblies of nerve cells. This book was a landmark achievement in neurophysiological psychology. The modifiable synapse, discussed at length by Hebb and now known as the "Hebb synapse", was a lasting contribution. Hebb was from Nova Scotia and spent most of his professional life at McGill in the Psychology Department. We are having this symposium in his honor. Topics will range from cellular level to systems level, with an eye towards interesting dynamics and connections between dynamics and functions. We will bring together physiological and mathematical researchers with some didactic and research talks oriented towards graduate students and postdoctoral fellows. SCIENTIFIC PROGRAM: Lectures will be presented by Nancy Kopell (Boston University) and David Mumford (Harvard) in the Institute's Distinguished Lecture Series. Invited talks by Larry Abbott (Brandeis), *Moshe Abeles (Hebrew U., Jerusalem), Harold Atwood (U. Toronto), David Brillinger (Berkeley), Jos Eggermont (U. Calgary), Bard Ermentrout (U. Pittsburg), Leon Glass (McGill), Ilona Kovacs (Rutgers), Gilles Laurent (Caltech), Andre Longtin (U. Ottawa), Leonard Maler (U. Ottawa), Karl Pribram (Radford U.), Paul Rapp (Med. Coll. Penn.), John Rinzel (NIH), Mike Shadlin (Stanford), Matt Wilson (Tucson), Martin Wojtowicz (U. Toronto), Steve Zucker (McGill). Invited Attendees: Jose Segundo (UCLA), Alessandro Villa (Lausanne) The meeting will emphasize poster sessions as well as discussion groups where participants can give short oral presentations of their work. (*=tentative) TOPICS Larry Abbott: Population vectors and Hebbian learning Moshe Abeles: Information processing of synchronized activity Harold Atwood: Synaptic transmission and plasticity David Brillinger: Statistical analysis of neurophysiological data Jos Eggermont: Spatial and temporal interactions in auditory cortex Bard Ermentrout: Patterns in visual cortex Leon Glass: Nonlinear dynamics of neural networks Ilona Kovacs: Visual psychophysics/perceptual organization Gilles Laurent: Oscillations in olfaction Andre Longtin: Stochastic nonlinear dynamics of sensory transduction Leonard Maler: Bursting and recurrent feedback in electroreception Karl Pribram: Behavioral neurodynamics Paul Rapp: Dynamical characterization of neurological data John Rinzel: Thalamic rhythmogenesis in sleep and epilepsy Mike Shadlin: Analysis of visual motion Matt Wilson: Behaviorally induced changes in hippocampal connectivity Martin Wojtowicz: Membranes, channels and synapses Steve Zucker: Neural networks and visual computations IMPORTANT DATES: Monday April 11: Last date to return questionnaire Friday April 22: Cut-off for registrations and Deadline for hotel/residence booking Sunday May 15: Arrival and registration (9 am - 12 noon) Sunday May 15 to Friday May 20 Scientific program (ending Friday noon) INFORMATION ON SCIENTIFIC PROGRAM: David Brillinger (brill at stat.berkeley.edu) Andre Longtin (andre at physics.uottawa.ca) REGISTRATION AND ORGANIZATIONAL INFORMATION: To receive registration information, please fill out the questionnaire below and return it to: Sheri Albers The Fields Institute 185 Columbia St. W. Waterloo, Ontario, Canada N2L 5Z5 Phone: (519) 725-0096 Fax: (519) 725-0704 e-mail: hebb at fields.uwaterloo.ca ------------------------------------------------------------- ******* Questionnaire ******* TO BE COMPLETED BY ANYONE WISHING TO ATTEND THE HEBB SYMPOSIUM ON NEURONS AND BIOLOGICAL DYNAMICS Name: Institution: Department: Address: Phone: Fax: E-mail: I plan to attend: Yes ( ) No ( ) Maybe ( ) I plan to participate in the discussion groups: Yes ( ) No ( ) Maybe ( ) I plan to present a poster: Yes ( ) No ( ) Maybe ( ) Topic or tentative title: Arrival and departure dates (if other than May 14-20): FAX TO: (519)725-0704 or e-mail: hebb at fields.uwaterloo.ca