Connectionists: weight guessing quickly solves n-bit parity

Wed Jul 20 10:55:25 EDT 2022

I have never understood the difference between "symbolic" and "sub-symbolic" reasoning. A recurrent neural net (RNN) is a general computer that can do both. n-bit parity for arbitrary n can be solved by a tiny RNN with few connections, sequentially reading bits one by one [1]. The best learning algorithm is NOT gradient descent. Instead keep randomly initializing the RNN weights between -100 and +100 until the RNN solves parity for a few training examples of various large sizes n (this will take just 1000 trials or so). Now the RNN will probably generalize to ANY n. BTW, try that with a Transformer - it will never generalize like that.

[1] J. Schmidhuber and S. Hochreiter. Guessing can outperform many long time lag algorithms. Technical Note IDSIA-19-96, IDSIA, 1996

Jürgen

On 19 Jul 2022, at 23:08, Danko Nikolic <danko.nikolic at gmail.com> wrote:

Dear Barak,

Thank you for the pointers. I have to read the papers. I need to understand then why, if parity is so easy to learn, my deep learning models had such a hard time that it led to an exponential growth in the number of needed parameters with each additional bit added to the input. Strange.

I will report back.

Best

Danko

Dr. Danko Nikolić
www.danko-nikolic.com
https://www.linkedin.com/in/danko-nikolic/
-- I wonder, how is the brain able to generate insight? --

On Mon, Jul 18, 2022 at 1:12 PM Barak A. Pearlmutter <barak at pearlmutter.net> wrote:
On Mon, 18 Jul 2022 at 08:28, Danko Nikolic <danko.nikolic at gmail.com> wrote:
In short, learning mechanisms cannot discover generalized XOR functions with simple connectivity -- only with complex connectivity. This problem results in exponential growth of needed resources as the number of bits in the generalized XOR increases.

Assuming that "generalized XOR" means parity, this must rely on some
unusual definitions which you should probably state in order to avoid
confusion.

Parity is a poster boy for an *easy* function to learn, albeit a
nonlinear one. This is because in the (boolean) Fourier domain its
spectrum consists of a single nonzero coefficient, and functions that
are sparse in that domain are very easy to learn. See N. Linial, Y.
Mansour, and N. Nisan, "Constant depth circuits, Fourier Transform and
learnability", FOCS 1989, or Mansour, Y. (1994). Learning Boolean
Functions via the Fourier Transform. Theoretical Advances in Neural
Computation and Learning, 391–424. doi:10.1007/978-1-4615-2696-4_11

--Barak Pearlmutter

On 18 Jul 2022, at 18:01, Gary Marcus <gary.marcus at nyu.edu> wrote:

sure, but a person can learn the idea for n-bits from a few examples with a small number of bits, generalizing it to large values of n. most current systems learn it for a certain number of bits and don’t generalize beyond that number of bits.

On Jul 18, 2022, at 7:17 AM, Barak A. Pearlmutter <barak at pearlmutter.net> wrote:

On Mon, 18 Jul 2022 at 14:43, Danko Nikolic <danko.nikolic at gmail.com> wrote:
<image.png>

It is a hard problem to learn for a connectionist network.

We don't need to invent new terminology, like "inverters problem" or "generalized xor." This is parity. Four (4) bit parity.

https://en.wikipedia.org/wiki/Parity_function

Parity is *not* a hard function to learn. Even for a connectionist network.

It is an interesting function for historic reasons (n-bit parity cannot be loaded by a k-th order perceptron, for k<n, although there are loopholes if a random bits are available or if you are allowed to only almost load it) and because it's an interesting function for many mathematical constructions. See the above wikipedia page for some details. But it is not super difficult to learn.

--Barak Pearlmutter.

On 18 Jul 2022, at 00:58, gary at ucsd.edu <gary at eng.ucsd.edu> wrote:

Sorry, I can't let this go by:
And it is good so because generalize XOR scales worse than power law. It scales exponentially! This a more agressive form of explosion than power law. 
I'm not sure exactly what you mean by this, but a single-hidden layer network with N inputs and N hidden units can solve N-bit parity. Each unit has an increasing threshold, so, one turns on if there is one unit on in the input, and then turns on the output with a weight of +1. If two units are on in the input, then a second unit comes on and cancels the activation of the first unit via a weight of -1. Etc.

g.

On Sat, Jul 16, 2022 at 12:03 AM Danko Nikolic <danko.nikolic at gmail.com> wrote:
Dear Thomas,

Thank you for reading the paper and for the comments.

I cite: "In my experience, supervised classification scales linearly in the number of classes."
This would be good to quantify as a plot. Maybe a research paper would be a good idea. The reason is that it seems that everyone else who tried to quantify that relation found a power law. At this point, it would be surprising to find a linear relationship. And it would probably make a well read paper.

But please do not forget that my argument states that even a linear relationship is not good enough to match bilogical brains. We need something more similar to a power law with exponent zero when it comes to the model size i.e., a constant number of parameters in the model. And we need linear relationship when it comes to learning time: Each newly learned object should needs about as much of learning effort as was needed for each previous object. 

I cite: "The real world is not dominated by generalized XOR problems."
Agreed. And it is good so because generalize XOR scales worse than power law. It scales exponentially! This a more agressive form of explosion than power law. 
Importantly, a generalized AND operation also scales exponentially (with a smaller exponent, though). I guess we would agree that the real world probably encouners a lot of AND problems. The only logical operaiton that could be learned with a linear increase in the number of parameters was a generalized OR. Finally, I foiund that a mixure of AND and OR resulted in a power law-like scaling of the number of parameters. So, a mixture of AND and OR seemed to scale as good (or as bad) as the real world. I have put this information into Supplementary Materials.

The conclusion that I derived from those analyses is: connectionism is not sustainable to reach human (or animal) levels of intelligence. Therefore, I hunted for an alternative pradigm.

Greetings,

Danko