Tal Grossman Memorial Workshop in Vail (NIPS95)

Wed Nov 15 13:45:09 EST 1995

                    NIPS95 TAL GROSSMAN MEMORIAL WORKSHOP

         MACHINE LEARNING APPROACHES IN COMPUTATIONAL MOLECULAR BIOLOGY 

                             December 1, 1995
                                  Vail, CO 

CURRENT LIST OF SCHEDULED PRESENTATIONS:

Alan Lapedes 
Neural Network Representations of Empirical Protein Potentials.

Gary Stormo 
The Use of Neural Networks for Identification of Common Domains by 
Maximizing Specificity.

Ajay N. Jain 
Machine Learning Techniques for Drug Design: Lead Discovery, Lead
Optimization, and Screening Strategies.

Anders Krogh 
Maximum Entropy Weighting of Aligned Sequences of Proteins or DNA.

Paul Stolorz 
Applying Dynamic Programming Ideas to Monte Carlo Sampling.

Soren Brunak 
Bendability of Exons and Introns in Human DNA.

Pierre Baldi 
Mining Data Bases of Fragments with HMMs.

CURRENT LIST OF ABSTRACTS:

Alan Lapedes (Los Alamos National Laboratory)
asl at t13.lanl.gov

Neural Network Representations of Empirical Protein Potentials.

Recently, there has been considerable interest in deriving and applying
knowledge-based, empirical potential functions for proteins.  These empirical
potentials have been derived from the statistics of interacting, spatially
neighboring residues, as may be obtained from databases of known protein
crystal structures. 

We employ neural networks to redefine empirical potential functions from the
point of view of discrimination functions.  This approach generalizes
previous work, in which simple frequency counting statistics are used on a
database of known protein structures. This generalization allows us to avoid
restriction to strictly pairwise interactions. Instead of frequency counting
to fix adjustable parameters, one now optimizes an objective function
involving a parameterized probability distribution. 

We show how our method reduces to previous work in special situations,
illustrating in this context the relationship of neural networks to
statistical methodology. A key feature in the approach we advocate is the
development of a representation to describe the location of interacting
residues that exist in a sphere of small fixed radius around each residue.
This is a natural ``shape representation'' for the interaction neighborhoods
of protein residues. We demonstrate that this shape representation and the
network's improved abilities enhances discrimination over that obtained by
previous methodologies. 

This work is with Robert Farber and the late Tal Grossman (Los Alamos
National Laboratory). 

Gary Stormo (University of Colorado, Boulder)
stormo at exon.biotech.washington.edu

The Use of Neural Networks for Identification of Common Domains by Maximizing
Specificity. 

We describe an unsupervised learning procedure in which the objective to be
maximized is ``specificity'', defined as the probability of obtaining a
particular set of strings within a much larger collection of background
strings.  We demonstrate its use for identifying protein binding sites on
unaligned DNA sequences, common sequence/structure motifs in RNA and common
motifs in protein sequences.  The idea behind the ``specificity'' criterion
it to discover a probability distribution for strings such that the
difference between the probabilities of the particular strings and the
background strings is maximized.  Both the probability distribution and the
set of particular strings need to be discovered; the probability distribution
can be any allowable distribution over the string alphabet, and the
particular strings are contained within a set of longer strings, but their
locations are not known in advance. Previous methods have viewed this problem
as one of multiple alignment, whereas our method is more flexible in the
types of patterns that can be allowed and in the treatment of the background
strings. When the patterns are linearly separable from the background, a
simple Perceptron works well to identify the patterns.  We are currently
testing more complicated networks for more complicated patterns. 

This work is in collaboration with Alan Lapedes of Los Alamos National
Laboratory and the Santa Fe Institute, and John Heumann of Hewlett-Packard. 

Ajay N. Jain (Arris Pharmaceutical Corporation)
jain at arris.com

Machine Learning Techniques for Drug Design: Lead Discovery, Lead
Optimization, and Screening Strategies. 

At its core, the drug discovery process involves designing small organic
molecules that satisfy the physical constraints of binding to a specific site
on a particular protein (usually an enzyme or receptor).  Machine learning
techniques can play a significant role in all phases of the process.  When
the structure of the protein is known, it is possible to "dock" candidate
molecules into the structure and compute the likelihood that a molecule will
bind well. Fundamentally, this is a thermodynamic event that is too
complicated to simulate accurately.  Machine learning techniques can be used
to empirically construct functions that are predictive of binding affinities. 
Similarly, when no protein structure is known, but there exists some data on
molecules exhibiting a range of binding affinities, it is possible to use
machine learning techniques to capture the 3D pattern that is responsible for
binding.  Lastly, in cases where one has capacity to make large numbers of
small molecules (libraries) to screen against multiple diverse protein
targets, one can use clustering techniques to design maximally diverse
libraries. This talk will briefly discuss each of these techniques in the
context of drug discovery at Arris Pharmaceutical Corporation. 

Anders Krogh (The Sanger Centre)
krogh at sanger.ac.uk

Maximum Entropy Weighting of Aligned Sequences of Proteins or DNA.

In a family of proteins or other biological sequences like DNA the various
subfamilies are often very unevenly represented.  For this reason a scheme
for assigning weights to each sequence can greatly improve performance at
tasks such as database searching with profiles or other consensus models
based on multiple alignments.  A new weighting scheme for this type of
database search is proposed.  In a statistical description of the searching
problem it is derived from the maximum entropy principle.  It can be proved
that, in a certain sense, it corrects for uneven representation.  It is shown
that finding the maximum entropy weights is an easy optimization problem for
which standard techniques are applicable. 

Paul Stolorz (Jet Propulsion Laboratory, Caltech)
stolorz at telerobotics.jpl.nasa.gov

Applying Dynamic Programming Ideas to Monte Carlo Sampling.

Monte Carlo sampling methods developed originally for physics and chemistry
calculations have turned out to be very useful heuristics for problems in
fields such as computational biology, traditional computer science and
statistics. Macromolecular structure prediction and alignment, combinatorial
optimization, and more recently probabilistic inference, are classic examples
of their use. This talk will swim against the tide a bit by showing that
computer science, in the guise of dynamic programming, can in turn supply
substantial insight into the Monte Carlo process. This insight allows the
construction of powerful novel Monte Carlo methods for a range of
calculations in areas such as computational biology, computational vision and
statistical inference. The methods are especially useful for problems plagued
by multiple modes in the integrand, and for problems containing important,
though not overwhelming, long-range information. Applications to protein
folding, and to generalized Hidden Markov Models, will be described to
illustrate how to systematically implement and test these algorithms. 

Soren Brunak (The Technical University of Denmark)
brunak at cbs.dtu.dk

Bendability of Exons and Introns in Human DNA.

We analyze the sequential structure of human exons and introns by hidden
Markov models. We find that exons -- besides the reading frame -- hold a
specific periodic pattern. The pattern has the triplet consensus: 
non-T(A/T)G and a minimal periodicity of roughly 10 nucleotides. It is not a
consequence of the nucleotide statistics in the three codon positions, nor of
the previously well known periodicity caused by the encoding of alpha-helices
in proteins. Using DNA triplet bendability parameters from DNase I
experiments, we show that the pattern corresponds to a periodic `in-phase'
bending potential towards the major groove of the DNA. Similarly, nucleosome
positioning data show that the consensus triplets have a preference for
locations on a bent double helix where the major groove faces inward and is
compressed.  We discuss the relation between the bending potential of coding
regions and its importance for the recognition of genes by the
transcriptional machinery. 

This work is in collaboration with P. Baldi (Caltech), Y. Chauvin (Net-ID,
Inc.), Anders Krogh (The Sanger Centre). 

Pierre Baldi (Caltech)
pfbaldi at ccosun.caltech.edu

Mining Data Bases of Fragments with HMMs. 

Hidden Markov Model (HMM) techniques are applied to the problem of mining
large data bases of protein fragments. The study is focused on one particular
protein family, the G-Protein-Coupled Receptors (GPCR). A large data base is
first constructed, by randomly extracting fragments from the entire
SWISS-PROT data base, at different lengths, positions, and simulated noise
levels, in a way that roughly matches other existing, but not always publicly
accessible, data bases. A HMM trained on the GPCR family is then used to
score all the fragments, in terms of their negative log-likelihood. The
discrimination power of the HMM is assessed, and quantitative results are
derived on how performance degrades, as a function of fragment length,
truncation position, and noise level, and on how to set discrimination
thresholds. The raw score performance is further improved by deriving
additional filters, based on the structure of the alignments of the fragments
to the HMM. 

This work is in collaboration with Y. Chauvin (Net-ID, Inc.), F. Tobin and A.
Williams (SmithKline Beecham).