Tech report available on hidden Markov models for proteins

Fri Oct 2 14:52:12 EDT 1992

University of California at Santa Cruz
Department of Computer and Information Sciences

The following technical report is available electronically or as
a paper copy.  Instructions for getting either follow the abstract.

PROTEIN MODELING USING HIDDEN MARKOV MODELS: ANALYSIS OF GLOBINS
David Haussler, Anders Krogh, Saira Mian, Kimmen Sjolander
UCSC-CRL-92-23   (available electronically as ucsc-crl-92-23.ps.Z)
June 1992, revised September 1992
(Shorter version will appear in Proc. of 26th Hawaii Int. Conf. on System
Sciences, Biocomputing technology track, Jan. 5-8, 1993)

Abstract:  We apply Hidden Markov Models (HMMs) to the problem of
statistical modeling and multiple alignment of protein families.  In a
detailed series of experiments, we have taken 625 unaligned globin
sequences from the Swiss Protein database, and produced a statistical
model entirely automatically from the primary (unaligned) sequences
using no prior knowledge of globin structure.  The produced model
includes all the known positions in the 7 major alpha-helices, along
with the distribution for the 20 amino acids for each of these 
positions, as well as the probability of and average length of 
insertions between these positions, and the probability that each 
position is not present at all.  Using this model, we obtained a
multiple alignment of all 625 sequences that agrees almost perfectly
with the structural alignment given in [1].  In our tests, we have
found that 400 of the 625 globins (selected at random) are enough to
produce a model of the same quality.  This model based on 400 globins
can discriminate the remaining (228) globins from nonglobin protein
sequences with greater than 99% accuracy, and can thus be used for
database searches.  The method we use to obtain the statistical
model from the unaligned sequences is a variant of the Expectation
Maximization (EM) algorithm known as the Viterbi algorithm.  This
method starts with an initial "neutral" model (same amino acid
distribution in each position, fixed probabilities for insertions and
deletions), optimally aligns the training sequences to this model
(using dynamic programming), and then reestimates the probability
parameters of the model.  These last two steps are iterated until no
further changes are made.  A simple heuristic is used to automatically
adjust the number of positions that are modeled by deleting positions
that are not being used and inserting new positions where needed.  After
this, we then iterate the whole process above again on the new model.
Our method is more general and more flexible than previous applications
of HMMs and the EM algorithm to alignment and modeling problems in
molecular biology.

This technical report is available electronically through either 
of the following methods:
1.  through anonymous ftp from ftp.cse.ucsc.edu, in /pub/tr. Log in 
    as "anonymous", use your email address as your password, specify 
    "binary" before getting the file.  Uncompress before printing.
2.  by mail to automatic mail server rnalib at ftp.cse.ucsc.edu.
    Put this command on the subject line or in the body of the message:
	@@ send ucsc-crl-92-23.ps.Z from tr
    To get the index or abstract list:
	@@ send INDEX from tr
	@@ send ABSTRACTS.1992 from tr
    To get the list of the tr directory:
	@@ list tr
    To get the list of commands and their syntax:
	@@ help commands

Order paper copies from:  Technical Library, Baskin Center for Computer 
Engineering & Information Sciences, UCSC, Santa Cruz  CA  95064.

Questions:  jean at cse.ucsc.edu