<div dir="ltr"><div><br></div><div>I believe one way to address this<span style="color:rgb(0,0,0)"> problem that learning with different types of features may</span><span style="color:rgb(0,0,0)"> give up other virtues (of Transformers etc) is to scale better by:</span><div><font color="#000000"><br></font><div><span style="color:rgb(0,0,0)">1) reducing the cost of learning so the same information over more feature types can be learned at once. </span></div><div><span style="color:rgb(0,0,0)"><br></span></div><div><div><font color="#000000">2) new features/learning to be able to add modularly to the model (eg avoid catastrophic forgetting)</font></div><div><font color="#000000"><br></font></div><div><font color="#000000">3) </font><span style="color:rgb(0,0,0)"> </span><span style="color:rgb(0,0,0)">Not making a decision of what features are most important ahead of time</span></div><br class="gmail-Apple-interchange-newline"><div><font color="#000000">4) taking a shotgun approach and learning with as much features as possible</font></div><div><font color="#000000"><br></font></div><div><font color="#000000">These goals can be better achieved if the networks learning (or at least top layer learning) does not require iid (independent and identically distributed) rehearsal and is super scalable.</font></div><div><font color="#000000"><br></font></div><div><font color="#000000">Feedforward methods (e.g. current neural networks) have issues with 1 & 2 while most other methods such as Bayesian Networks have problems with 1 & scalability. </font></div><div><font color="#000000"><br></font></div><div>My 2c,<br class="gmail-Apple-interchange-newline"></div><div><font color="#000000">-Tsvi</font></div><div><br></div></div></div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Tue, Jul 19, 2022 at 12:09 AM Gary Marcus <<a href="mailto:gary.marcus@nyu.edu">gary.marcus@nyu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">sure, but this goes back to Tom’s point about representation; i addressed this sort of thing at length in Chapter 3 of The Algebraic Mind. <br>

<br>

You can solve this one problem in that way but then give up most of the other virtues of Transformers etc if you build a different network and representational scheme for each problem you encounter.<br>

<br>

> On Jul 18, 2022, at 9:19 AM, Barak A. Pearlmutter <<a href="mailto:barak@pearlmutter.net" target="_blank">barak@pearlmutter.net</a>> wrote:<br>

> <br>

> On Mon, 18 Jul 2022 at 17:02, Gary Marcus <<a href="mailto:gary.marcus@nyu.edu" target="_blank">gary.marcus@nyu.edu</a>> wrote:<br>

>> <br>

>> sure,   but a person can learn [n-bit parity] from a few examples with a small number of bits, generalizing it to large values of n. most current systems learn it for a certain number of bits and don’t generalize beyond that number of bits.<br>

> <br>

> Really? Because I would not think that induction of a two-state DFA<br>

> over a two-symbol alphabet woud be beyond the current state of the<br>

> art.<br>

> <br>

> --Barak Pearlmutter.<br>

<br>

<br>

</blockquote></div>