Ben's thesis proposal this Wednesday at 4pm

Mon Nov 29 16:47:14 EST 2021

Team,

Please consider joining Benedikt Boecking's thesis proposal
presentation - which will be on zoom only - this Wednesday December
1st at 4pm.

Details below.

Cheers,
Artur

Title:
Learning with Diverse Forms of Imperfect and Indirect Supervision

Abstract:
High capacity Machine Learning (ML) models trained on large, annotated
datasets have driven impressive advances in several fields including
natural language processing and computer vision, in turn leading to
impactful applications of ML in areas such as healthcare, e-commerce,
and predictive maintenance. However, obtaining annotated datasets at
the scale required for training such models is costly and often
becomes a bottleneck for promising applications of ML. In this thesis,
I study imperfect and indirect forms of supervision (weak supervision)
such as partial rules and pairwise constraints as a mechanism to
encode domain knowledge, as these are frequently easy to obtain at
scale and can enable learning without pointillistic ground truth
annotations.

I begin by studying the utility of small amounts of pairwise
supervision for clustering, by using known group-membership
constraints to learn a kernel to improve constrained clustering
performance. Next, I propose a methodology that uses imperfect
pairwise labels to augment learning for programmatic data labeling
methods which traditionally only learn from Labeling Functions (LFs),
i.e. user defined functions that directly but imperfectly label
subsets of data. Such label models aggregate sources of imperfect
supervision to estimate the latent ground truth and act as teachers to
end models, thereby playing an essential role in achieving
generalization. Preliminary results show promising performance
improvements.

I further the study of programmatic data labeling methods by
introducing integrated, end-to-end learning frameworks and novel label
models. I first introduce a framework for joint learning of a label
and end models from LFs, showing improved performance over prior work
in terms of end model performance on downstream test sets. I then
propose a new methodology based on discrete latent variable modeling
in generative adversarial networks to improve estimates of the
unobserved ground truth through uncovering of disentangled, discrete
structures in the features.

Finally, I study two extremes on the spectrum of domain knowledge
acquisition in weak supervision: user interactivity for discovering
useful sources of imperfect labels, and learning merely from data
paired with unstructured natural language descriptions. I first
introduce an interactive learning framework that aids users in
discovering weak supervision sources to systematically and proactively
capture subject matter experts’ knowledge of the application domain in
an efficient and effective fashion. I then propose to study how
unstructured natural language descriptions (such as doctors notes)
paired with images can be exploited for image representation learning
and zero-shot classification, without requiring experts to define
rules on the text or images as in prior related work.

Together, these works provide novel methodologies and frameworks to
more efficiently encode expert domain knowledge in ML models, reducing
the bottleneck created by the need for pointillistic ground truth
annotations.

Thesis Committee
Artur Dubrawski (Chair)
Barnabás Póczos
Jeff Schneider
Hoifung Poon (Microsoft Research)

Zoom Meeting ID: 936 5393 2784 Passcode: 877794