THE DATA ALLOCATION PROBLEM

Thu Oct 1 14:10:05 EDT 1998

THE DATA ALLOCATION PROBLEM

I currently got interested in the following problem and I would be grateful
for any feedback you can give me (mailto:kehagias at egnatia.ee.auth.gr). 

The Setup: Consider a collection of data: y(1), y(2), y(3), ..., generated
by more than one sources. At time t one of the sources is activated
(perhaps randomly) and generates the datum y(t). We want to identify the
number of active sources and extract some information regarding each source
(e.g. an input/output model, or some statistics such as mean value,
standard deviation etc. of the source's output). No a priori information is
available regarding the number, behavior etc. of the sources. In
particular, the observed data are unlabelled, i.e. it is not known which
source is active at time t.

The Online Data Allocation Task: It seems to me that in such a situation
the major task is data allocation. I mean this: if the observed data were
partitioned into groups, each group containing data generated by a single
source, then each data group could be used to train a model for the
respective source. Generally speaking, training on clean data groups should
not be too hard. However, since the data are not labelled, it is not
immediately clear how to allocate them between groups. As I will explain a
little later, the problem seems harder for the online case (with a
continuously incoming stream of data) than for the offline case, where a
finite data set is involved. 

The Convergence Question: Special cases of the above problem and various
solutions have appeared in the literature. I am interested in obtaining
quite general sufficient (and necessary ?) conditions for an online data
allocation process to converge to a correct solution. By "correct
solution", I mean a partition of the observed data into groups such that
every group contains predominantly data from one source and every source
corresponds to only one data group. The convergence conditions should be
fairly general, so as to allow a unified treatment of many different data
allocation algorithms and different kinds of sources (and data).

We have obtained some results, which appear in our recent book (announced
in a separate posting) and in a series of papers (also announced in a
separate posting). I summarize our results in my web site at

http://skiron.control.ee.auth.gr/~kehagias/thn/thn030.htm

At this point I am interested in getting some feedback regarding: possible
approaches to the problem, relevant biblio pointers and so on. I already
have a modestly sized bibliography on this. I will summarize all responses
and post.

___________________________________________________________________
Ath. Kehagias
--Assistant Prof. of Mathematics, American College of Thessaloniki 
--Research Ass., Dept. of Electrical and Computer Eng. Aristotle Univ.,
Thessaloniki, GR54006, GREECE

--email:	 kehagias at egnatia.ee.auth.gr, kehagias at ac.anatolia.edu.gr
--web:	http://skiron.control.ee.auth.gr/~kehagias/index.htm