Finiteness of VC Dimension of Sigmoidal Feedforward "Neural Nets"

Mon Oct 5 18:28:35 EDT 1992

     Finiteness of VC Dimension of Sigmoidal Feedforward "Neural Nets"
		Angus Macintyre, Oxford University
		Eduardo Sontag, Rutgers University

	[This is NOT a TeX file; it is just for reading on the screen.]

It was until now, as far as we know, an open question whether "sigmoidal neural
networks" lead to learnable (in the sense of sample complexity) classes of
concepts.  We wish to remark via this announcement that indeed the
corresponding VC dimension is finite.  The result holds without imposing any
bounds on weights.  The proof, which is outlined below, consists of simply
putting together a couple of recent results in model theory.
(A detailed paper, exploring other consequences of these results in the context
of neural networks, is being prepared.  This problem is also related to issues
of learnability for sparse polynomials [="fewnomials"].)

More precisely, we define a _sigma-circuit_ as an unbounded fan-in circuit
(i.e., a directed acyclic graph) whose edges are labeled by real numbers
(called "weights"), and, except for the input nodes (i.e., the nodes of
in-degree zero), every node is also labeled by a real number (called its
"bias").  There is a single output node (i.e., node of out-degree zero).  We
think of such a circuit as computing a function R^m -> R, where m is the
number of input nodes.  This function is inductively defined on nodes V as
follows.  If V is the i-th input node, it computes just F(u1,...,uk)=ui.
If V is a noninput node, its function is

                   s( w0 + w1.u1 + ... + wk.uk )

where w0 is the bias of the node V, u1,...,uk are the functions computed by
the nodes Vi incident to V, and wi is the weight in the edge from Vi to V.
The function computed by the output node is the function computed by the
circuit.  Here "s" denotes the "standard sigmoid":

                                 1
                  s (x) =    ---------    .
                                    -x
                               1 + e

(The results will depend critically on the choice of this particular s(x),
which is standard in neural network theory.  Minor modifications are possible,
but the result is most definitely false if, e.g., s(x) = sin(x).  The result
is even false for other "sigmoidal-type" functions s; see e.g. [5].)

We also define a _sigmoidal feedforward architecture_ as a circuit in which
all weights and biases are left as variables.  In that case, we write F(u,w)
for the function computed by the circuit obtained by specializing all weights
and biases to a particular vector w.  We view an architecture in the obvious
way as a map

    F : R^m x R^r  -> R

where 'r' is the total number of weights and biases.

For each subset S of R^m, a _dichotomy_ on S is a function c: S -> {-1,+1}.
We say that a function  f: R^m -> R  _implements_ this dichotomy if it holds
that
        c(s) > 0   <==>  f(x) > 0  .

THEOREM.
Let F be an architecture.  Then there exists a (finite) integer f = VC(F) such
that, for each subset S of R^m of cardinality > f, there is some dichotomy  c
on S which cannot be implemented by any f = F(.,w),  w in R^r.

Sketch of proof:

First note that one can write a formula Q(u,w) in the first order language of
the real numbers with addition, multiplication, order, and real exponentiation,

                Th(R,+,.,0,1,<,exp(.))

(all real constants are included as well), such that:

   for each (u,w) in R^m x R^r,  F(u,w) > 0 if and only if Q(u,w) is true .

(Divisions, as needed in computing s(x), can be encoded by adding new
variables z, including an atomic formula of the type
                z . (1 + e^-x) = 1,
and then existentially quantifying on z.)

The paper [1] deals precisely with problem of proving the existence of such
finite integers f, for formulas in first order theories.  In page 383, first
paragraph, it is shown that the desired result will be true if there is
_order-minimality_ for the corresponding theory.  In our context, this latter
property means that every set of the form:

      { x in R  | P(x,w) true } ,  w in R^r ,

where P is any formula in the above language, with SCALAR x, must be a finite
union of intervals (possibly points).  Now, the papers [2]-[4] prove that
order-minimality indeed holds.  (The paper [1] stated that order minimality was
an open problem for Th(R,+,.,0,1,<,exp(.)); in fact, the papers [2]-[4] were
written while [1] was in press.)

Remarks:

Note that we do no give explicit bounds here, but only remark that finiteness
holds.  However, the results being used are essentially constructive, and it
is in principle possible to compute the VC dimension of a given architecture.
Observe also that one can extend this result to more general architectures
than the ones considered here.  High-order nets (for which products of inputs
are allowed) can be treated with the same proof, as are if-then-else nodes.
The latter allow the application of the same techniques to the Blum-Shub-Smale
model of computation as well.

A number of decidability issues, loading, interpolation, teaching dimension
questions, and so forth, for sigmoidal nets can also be treated using
model-theoretic techniques, and will be the subject of a forthcoming paper.

References:

[1] Laskowski, Michael C., "Vapnik-Chervonenkis classes of definable sets,"
J. London Math Soc (2) 45 (1992): 377-384.

[2] Wilkie, Alec J., "Some model completeness results for expansions of the
ordered field of reals by Pfaffian functions," preprint, Oxford, 1991,
submitted.

[3] Wilkie, Alec J., "Smooth o-minimal theories and the model completeness of
the real exponential field," preprint, Oxford, 1991, submitted.

[4] Macintyre, Angus, Lou van den Dries, and David Marker, "The elementary
theory of restricted analytic fields with exponentiation," preprint, Oxford,
1991.

[5] Sontag, Eduardo D., "Feedforward nets for interpolation and
classification," J. Comp. Syst. Sci. 45 (1992): 20-48.