PhD Thesis on Machine Learning/Information Access

Tue Jan 19 22:01:39 EST 1999

         [Apologies if you receive this more than once.]

Dear colleagues,

I am very pleased to announce the availability of my PhD thesis,
entitled "Using Machine Learning to Improve Information Access" at
the following URL:
http://robotics.stanford.edu/users/sahami/papers-dir/thesis.ps

The dissertation examines the the use of novel clustering, feature
selection and classification algorithms applied to text data (as well
as some non-text domains).  It also presents a working system, SONIA,
that makes use of these technologies to enable the automatic topical
organization of retrieval results.

The table of contents and a more detailed abstract are appended below.

Best,
Mehran

------------------+----------------------------------
Mehran Sahami     | http://xenon.stanford.edu/~sahami
Systems Scientist | phone: (650) 496-2399
Epiphany, Inc.    | http://www.epiphany.com
------------------+----------------------------------

----------------------------------------------------------------------
        Using Machine Learning to Improve Information Access

Part I:   Preliminaries

  Chapter 1: Introduction
             1.1 Challenges of Information Access
             1.2 System Overview
             1.3 Reader's Guide

  Chapter 2: Document Representation
             2.1 Defining a Vector Space
             2.2 Controlling Dimensionality

  Chapter 3: Probabilistic Framework
             3.1 Bayesian Networks
             3.2 Machine Learning Overview

  Chapter 4: Related Work in Information Access
             4.1 Probabilistic Retrieval
             4.2 Feature Selection for Text
             4.3 Document Clustering
             4.4 Document Classification

Part II:  Clustering

  Chapter 5: Feature Selection for Clustering
             5.1 Introduction
             5.2 Mixture Modeling Revisited
             5.3 Theoretical Underpinnings
             5.4 Feature Selection Algorithms
             5.5 Empirical Results
             5.6 Conclusions

  Chapter 6: A New Model for Document Clustering
             6.1 Introduction
             6.2 Probabilistic Document Overlap
             6.3 Clustering Algorithms
             6.4 Results
             6.5 Comparison With Mixture Modeling
             6.6 Conclusion

Part III: Classification

  Chapter 7: Feature Selection for Classification
             7.1 Introduction
             7.2 Theoretical Framework
             7.3 An Approximate Algorithm
             7.4 Initial Results on Non-Text Domains
             7.5 Results on Text Domains
             7.6 Conclusions

  Chapter 8: Limited Dependence Bayesian Classifiers
             8.1 Introduction
             8.2 Probabilistic Classification Models
             8.3 The KDB Algorithm
             8.4 Initial Results on Non-Text Domains
             8.5 Results on Text Domains
             8.6 Conclusions and Related Work

  Chapter 9: Hierarchical Classification
             9.1 Introduction
             9.2 Hierarchical Classification Scheme
             9.3 Results
             9.4 Extensions to Directed Acyclic Graphs
             9.5 Conclusions

Part IV:  Putting It All Together

  Chapter 10: SONIA -- A Complete System
              10.1 Introduction
              10.2 SONIA on the InfoBus
              10.3 A Component View SONIA
              10.4 Examples of System Usage
              10.5 Conclusions

  Chapter 11: Conclusions and Future Work
              11.1 Where Have We Been?
              11.2 Where Are We Going?

                            ABSTRACT

The explosion of on-line information has given rise to many
query-based search engines (such as Alta Vista) and manually
constructed topic hierarchies (such as Yahoo!).  But with the
current growth rate in the amount of information, query results
grow incomprehensibly large and manual classification in topic
hierarchies creates an immense information bottleneck.  Therefore,
these tools are rapidly becoming inadequate for addressing users'
information needs.

In this dissertation, we address these problems with a system for
topical information space navigation that combines the query-based and
taxonomic approaches. Our system, named SONIA (Service for Organizing
Networked Information Autonomously), is implemented as part of the
Stanford Digital Libraries testbed.  It enables the creation of
dynamic hierarchical document categorizations based on the full-text
of articles.  Using probability theory as a formal foundation, we
develop several Machine Learning methods to allow document collections
to be automatically organized at a topical level.  First, to generate
such topical hierarchies, we employ a novel probabilistic clustering
scheme that outperforms traditional methods used in both Information
Retrieval and Probabilistic Reasoning.  Furthermore, we develop
methods for classifying new articles into such automatically
generated, or existing manually generated, hierarchies.  In contrast
to standard classification approaches which do not make use of the
taxonomic relations in a topic hierarchy, our method explicitly uses
the existing hierarchical relationships between topics, leading to
improvements in classification accuracy.  Much of this improvement is
derived from the fact that the classification decisions in such a
hierarchy can be made by considering only the presence (or absence) of
a small number of features (words) in each document.  The choice of
relevant words is made using a novel information theoretic algorithm
for feature selection.  Many of the components developed as part of
SONIA are also general enough that they have been successfully applied
to data mining problems in different domains than text.

The integration of hierarchical clustering and classification will
allow large amounts of information to be organized and presented to
users in a individualized and comprehensible way.  By alleviating the
information bottleneck, we hope to help users with the problems of
information access on the Internet.