thesis announcement

Dunja Mladenic Dunja.Mladenic at ijs.si
Mon Mar 8 07:08:38 EST 1999


I'm glad to announce the availability of my PhD thesis "Machine
Learning on non-homogeneous, distributed text data" Advisors:
Prof. Ivan Bratko, Prof Tom M. Mitchell.  The thesis is available at
  http://www.cs.cmu.edu/~TextLearning/pww/PhD.html 
as well as at
  http://www-ai.ijs.si/DunjaMladenic/PhD.html

Best regards,
              Dunja Mladenic

================

ABSTRACT

This dissertation proposes new machine learning methods where the
corresponding learning problem is characterized by a high number of
features, unbalanced class distribution and asymmetric
misclassification costs. The input is given as a set of text documents
or their Web addresses (URLs). The induced target concept is
appropriate for the classification of new documents including
shortened documents describing individual hyperlinks. The proposed
methods are based on several new solutions.

Proposed is a new, enriched document representation that extends the
bag-of-words representation by adding word sequences and document
topic categories. Features that represent word sequences are generated
using a new efficient procedure. Features giving topic categories are
obtained from background knowledge constructed using the new machine
learning method for learning from class hierarchies. When learning
from class hierarchy, a high number of class values, examples and
features, are handled by (1) dividing a problem into subproblems based
on the hierarchical structure of class values and examples, (2) by
applying feature subset selection and (3) by pruning unpromising class
values during classification.

Several new feature scoring measures are proposed as a result of
comparison and analysis of different feature scoring measures used in
feature subset selection on text data. The new measures are
appropriate for text domains with several tens or hundreds of
thousands of features, can handle unbalanced class distribution and
asymmetric misclassification costs.

Developed methods are suitable for the classification of documents
including shortened documents. We build descriptions of hyperlinks,
and treat these as shortened documents.  Since each hyperlink on the
Web is pointing to some document, the classification of hyperlinks
(corresponding shortened documents) could be potentially improved by
using this information. We give the results of preliminary experiments
for learning in domains with mutually dependent class attributes.

Training examples are used for learning `a next state function on the
Web', where document content (class attributes) is predicted from the
hyperlink (feature-vector) that points to the document. Document
content we are predicting is represented as a feature-vector each
feature being one of the mutually dependent class attributes.

The proposed methods and solutions are implemented and experimentally
evaluated on real-world data collected from the Web in three
independent projects. It is shown that document classification,
categorization and prediction using the proposed methods perform well
on large, real-world domains.

The experimental findings further indicate that the developed methods
can efficiently be used to support analysis of large amount of text
data, automatic document categorization and abstraction, document
content prediction based on the hyperlink content, classification of
shortened documents, development of user customized text-based
systems, and user customized Web browsing. As such, the proposed
machine learning methods contribute to machine learning and to related
fields of text-learning, data mining, intelligent data analysis,
information retrieval, intelligent user interfaces, and intelligent
agents.  Within machine learning, this thesis contributes an approach
to learning on large, distributed text data, learning on hypertext,
and learning from class hierarchies.  Within computer science, it
contributes to better design of Web browsers and software assistants
for people using the Web.


More information about the Connectionists mailing list