PhD Thesis on Machine Learning/Information Access
Mehran Sahami
sahami at Robotics.Stanford.EDU
Tue Jan 19 22:01:39 EST 1999
[Apologies if you receive this more than once.]
Dear colleagues,
I am very pleased to announce the availability of my PhD thesis,
entitled "Using Machine Learning to Improve Information Access" at
the following URL:
http://robotics.stanford.edu/users/sahami/papers-dir/thesis.ps
The dissertation examines the the use of novel clustering, feature
selection and classification algorithms applied to text data (as well
as some non-text domains). It also presents a working system, SONIA,
that makes use of these technologies to enable the automatic topical
organization of retrieval results.
The table of contents and a more detailed abstract are appended below.
Best,
Mehran
------------------+----------------------------------
Mehran Sahami | http://xenon.stanford.edu/~sahami
Systems Scientist | phone: (650) 496-2399
Epiphany, Inc. | http://www.epiphany.com
------------------+----------------------------------
----------------------------------------------------------------------
Using Machine Learning to Improve Information Access
Part I: Preliminaries
Chapter 1: Introduction
1.1 Challenges of Information Access
1.2 System Overview
1.3 Reader's Guide
Chapter 2: Document Representation
2.1 Defining a Vector Space
2.2 Controlling Dimensionality
Chapter 3: Probabilistic Framework
3.1 Bayesian Networks
3.2 Machine Learning Overview
Chapter 4: Related Work in Information Access
4.1 Probabilistic Retrieval
4.2 Feature Selection for Text
4.3 Document Clustering
4.4 Document Classification
Part II: Clustering
Chapter 5: Feature Selection for Clustering
5.1 Introduction
5.2 Mixture Modeling Revisited
5.3 Theoretical Underpinnings
5.4 Feature Selection Algorithms
5.5 Empirical Results
5.6 Conclusions
Chapter 6: A New Model for Document Clustering
6.1 Introduction
6.2 Probabilistic Document Overlap
6.3 Clustering Algorithms
6.4 Results
6.5 Comparison With Mixture Modeling
6.6 Conclusion
Part III: Classification
Chapter 7: Feature Selection for Classification
7.1 Introduction
7.2 Theoretical Framework
7.3 An Approximate Algorithm
7.4 Initial Results on Non-Text Domains
7.5 Results on Text Domains
7.6 Conclusions
Chapter 8: Limited Dependence Bayesian Classifiers
8.1 Introduction
8.2 Probabilistic Classification Models
8.3 The KDB Algorithm
8.4 Initial Results on Non-Text Domains
8.5 Results on Text Domains
8.6 Conclusions and Related Work
Chapter 9: Hierarchical Classification
9.1 Introduction
9.2 Hierarchical Classification Scheme
9.3 Results
9.4 Extensions to Directed Acyclic Graphs
9.5 Conclusions
Part IV: Putting It All Together
Chapter 10: SONIA -- A Complete System
10.1 Introduction
10.2 SONIA on the InfoBus
10.3 A Component View SONIA
10.4 Examples of System Usage
10.5 Conclusions
Chapter 11: Conclusions and Future Work
11.1 Where Have We Been?
11.2 Where Are We Going?
ABSTRACT
The explosion of on-line information has given rise to many
query-based search engines (such as Alta Vista) and manually
constructed topic hierarchies (such as Yahoo!). But with the
current growth rate in the amount of information, query results
grow incomprehensibly large and manual classification in topic
hierarchies creates an immense information bottleneck. Therefore,
these tools are rapidly becoming inadequate for addressing users'
information needs.
In this dissertation, we address these problems with a system for
topical information space navigation that combines the query-based and
taxonomic approaches. Our system, named SONIA (Service for Organizing
Networked Information Autonomously), is implemented as part of the
Stanford Digital Libraries testbed. It enables the creation of
dynamic hierarchical document categorizations based on the full-text
of articles. Using probability theory as a formal foundation, we
develop several Machine Learning methods to allow document collections
to be automatically organized at a topical level. First, to generate
such topical hierarchies, we employ a novel probabilistic clustering
scheme that outperforms traditional methods used in both Information
Retrieval and Probabilistic Reasoning. Furthermore, we develop
methods for classifying new articles into such automatically
generated, or existing manually generated, hierarchies. In contrast
to standard classification approaches which do not make use of the
taxonomic relations in a topic hierarchy, our method explicitly uses
the existing hierarchical relationships between topics, leading to
improvements in classification accuracy. Much of this improvement is
derived from the fact that the classification decisions in such a
hierarchy can be made by considering only the presence (or absence) of
a small number of features (words) in each document. The choice of
relevant words is made using a novel information theoretic algorithm
for feature selection. Many of the components developed as part of
SONIA are also general enough that they have been successfully applied
to data mining problems in different domains than text.
The integration of hierarchical clustering and classification will
allow large amounts of information to be organized and presented to
users in a individualized and comprehensible way. By alleviating the
information bottleneck, we hope to help users with the problems of
information access on the Internet.
More information about the Connectionists
mailing list