New genetic database and paper available
Salvatore Rampone
rampon at tin.it
Sun Sep 9 18:02:29 EDT 2001
Dear Connectionists,
This new genetic dataset may be of interest to you:
Database Name: HS3D - Homo Sapiens Splice Site Dataset
URL: http://www.sci.unisannio.it/docenti/rampone/
The data base is described in the paper
HS3D - Homo Sapiens Splice Site Dataset
by Pollastro, P., Rampone, S.
Universit del Sannio - ITALY
Accepted in Nucleic Acids Research 2002 Database Issue
The extended abstract is available at http://space.tin.it/scienza/srampone/
(publication page) or directly from
http://space.tin.it/scienza/srampone/ramp0201.pdf.
-------
Abstract: In the last years many computational tools for gene
identification and characterization, mostly based on machine learning
approaches, have been used. In the machine learning approach, a learning
algorithm receives a set of training examples, each labelled as belonging to
a particular class. The algorithm's goal is to produce a classification rule
for correctly assigning new examples to these classes. The success of these
methods depends largely on the quality of the data sets that are used as the
training set. Furthermore a common data set is necessary when the prediction
accuracy of different programs needs to be comparatively assessed.
The Irvine Primate Splice Junctions Dataset (UCI Machine Learning Repository
http://www.ics.uci.edu/~mlearn/MLRepository.html) is a standard "de facto"
in the machine learning community, but it is now very out of date and does
not include sufficient material for the most learning algorithm needs. A
recent and EST confirmed data set has the same limitation in the data
extend. More recently Burset et al. developed an extensive data base, but
the data do not include false splice sites (negative examples), and,
specifically, proximal false splice sites. The latter data form a well known
critical point of classification systems.
We developed a new database (HS3D - Homo Sapiens Splice Site Dataset) of
Homo Sapiens Exon, Intron and Splice regions. The aim of this data set is to
give standardized material to train and to assess the prediction accuracy of
computational approaches for gene identification and characterization.
More information about the Connectionists
mailing list