Synthetically enlarging the database

hicks@cs.titech.ac.jp hicks at cs.titech.ac.jp
Fri Jul 30 15:49:15 EDT 1993


About synthetically enlarging the database:

Drucker Harris writes:
>In our work in OCR using multilayer networks (single
>layer networks are not powerful enough) boosting has
>ALWAYS improved performance. Synthetically enlarging
>the database using deformations of the original data is
>essential.

(Point of view from outside the learning system) It seems to me that the cost
of obtaining training data is an issue implicit in the above statement, and
ought to be made explicit.  As the number of data in the original training set
increases, the benefits of synthetically created data will become less.
Moreover, wouldn't it be correct to say that one could always do better by
using N extra randomly selected training data than by using N extra
synthetically created data?  Nevertheless, the cost of obtaining training data
is a real factor and synthetically created training data may be virtually
free.

(Point of view from inside the system) But what about the cost of learning any
training data, synthetic or otherwise?  Synthesis of training data may be
cheaper than obtaining real training data, but it still has to be learned.  Is
it possible to have synthesis without extra learning cost?  Consider that
synthetically creating data has the effect of compressing the size of the
input space (and thus enforcing smoothing) in same way as would a
preprocessing front giving translational invariance.  In both cases a single
input is given to the system and the system learns many samples, explicitly in
the case of synthetic creation, implicitly in the case of translational
invariance.  The former incurrs extra learning cost, the latter none.  I know
this is not a good example, because translational invariance is a trivial
problem, and the difficult problems do require more learning. Synthetically
creating data is one way to go about smoothing the area around a
(non-synthetic) training sample, but aren't there others?  For example, adding
a penalty term for the complexity of the output function (or some internal
rep. if there is no continuous output function) around the sample point.


Craig Hicks           hicks at cs.titech.ac.jp
Ogawa Laboratory, Dept. of Computer Science
Tokyo Institute of Technology, Tokyo, Japan
lab: 03-3726-1111 ext. 2190  		home: 03-3785-1974
fax: +81(3)3729-0685 (from abroad), 03-3729-0685  (from Japan)







More information about the Connectionists mailing list