<div dir="ltr"><div>Dear Connectionists,<br><br clear="all"></div><div>


<p class="MsoNormal"><span style="mso-fareast-language:AR-SA">Currently, the

leading, and very successful, paradigm of machine intelligence is Convolutional

Networks (ConvNets), introduced by Yann LeCun.<span style="mso-spacerun:yes"> 

</span>As Dr. LeCun states in the </span><a href="http://research.microsoft.com/apps/video/default.aspx?id=259574&r=1"><span style="mso-fareast-language:AR-SA">Deep Learning Tutorial</span></a><span style="mso-fareast-language:AR-SA"> (Hinton, Bengio & LeCun) given about

two months ago at NIPS, “Everyone uses ConvNets” (at ~minute 49 of the talk).  <span style="mso-spacerun:yes"><br></span></span></p><p class="MsoNormal"><span style="mso-fareast-language:AR-SA"><span style="mso-spacerun:yes"><br></span></span></p>


<p class="MsoNormal"><span style="mso-fareast-language:AR-SA">At the same time,

within the hardware community, it has long been understood that most of the

computational time / power used in executing algorithms on the essentially

ubiquitous Von Neumann computer model is expended in moving data between memory

and processor.<span style="mso-spacerun:yes">  </span><span style="mso-spacerun:yes"> </span>For this reason, there is a tremendous imperative

to build new architectures in which the processor is physically co-localized

with memory.<span style="mso-spacerun:yes">  </span>Clearly, whatever

algorithm(s) underlie natural intelligence, they run on brains which are

networks of neurons, each one of which is both processor and memory.<span style="mso-spacerun:yes">  </span>More specifically, each synapse is

essentially both processor and memory: it’s memory because it retains (stably

over potentially very long periods) knowledge of the history of signals that it

has mediated; it’s processor because it effectively multiplies the signal being

mediated by the instantaneous weight.<span style="mso-spacerun:yes">  </span>A Memristor,

for example, perfectly co-localizes processor and memory.</span></p><p class="MsoNormal"><span style="mso-fareast-language:AR-SA"><br></span></p>


<p class="MsoNormal"><span style="mso-fareast-language:AR-SA">The above two

conditions compel me to submit for discussion what I believe is a fundamental problem regarding

the scalability of ConvNets.<span style="mso-spacerun:yes">  </span>Specifically, the technique of “shared

parameters”, upon which, I think most would concur, ConvNets depend in order to

scale to massive problems, is <i style="mso-bidi-font-style:normal">fundamentally</i>

incompatible with processor-memory co-localization (PMC).<span style="mso-spacerun:yes">  <br></span></span></p><p class="MsoNormal"><span style="mso-fareast-language:AR-SA"><span style="mso-spacerun:yes"><br></span></span></p>


<p class="MsoNormal"><span style="mso-fareast-language:AR-SA">In the shared

parameters technique, a single filter (kernel) is learned for each feature

map.<span style="mso-spacerun:yes">  </span>The sharing occurs essentially through

<i style="mso-bidi-font-style:normal">averaging the gradients</i> computed for all

the, say <i style="mso-bidi-font-style:normal">U</i>×<i style="mso-bidi-font-style:

normal">V</i>, units comprising that map.<span style="mso-spacerun:yes"> 

</span>The justification for learning a single filter for an entire input

surface is that, for natural input domains, the local statistics in any

filter-scale 2D patch are highly similar across the entire input surface: e.g.,

virtually all patches of a visual input can approximately decomposed as some small number of

low-order visual features, e.g., oriented Gabors, edges, etc.<span style="mso-spacerun:yes">  </span>Sharing parameters has two major benefits: a)

it greatly boosts the number of samples informing the learning of the filter,

yielding a better model; and b) it drastically reduces the number of parameters

that need to be learned (typically via stochastic gradient descent), which in

turn drastically reduces learning time.</span></p><p class="MsoNormal"><span style="mso-fareast-language:AR-SA"><br></span></p>


<p class="MsoNormal"><span style="mso-fareast-language:AR-SA">However, averaging

the gradients implies aggregating, at a central location, information originating

from multiple spatial locales (i.e., <i style="mso-bidi-font-style:normal">U</i>×<i style="mso-bidi-font-style:normal">V</i> locales) on the input surface.<span style="mso-spacerun:yes">  </span>Suppose the kernel itself is an <i style="mso-bidi-font-style:normal">X×Y</i> array.<span style="mso-spacerun:yes">  </span>The <i style="mso-bidi-font-style:normal">X×Y</i>

values collected for the (0,0)^th position of the map are physically distinct from

those collected for the (<i style="mso-bidi-font-style:normal">U</i>-1,<i style="mso-bidi-font-style:normal">V-1</i>)^th position (modulo overlap).<span style="mso-spacerun:yes">  </span>Even if we were to grant that the <i style="mso-bidi-font-style:normal">X×Y</i> array positions (i.e., “synaptic”

weights) for the (0,0)^th position of the map were represented by Memristors, they

cannot be the same Memristors that represent the weights for the (<i style="mso-bidi-font-style:normal">U</i>-1,<i style="mso-bidi-font-style:normal">V-1</i>)^th

position, nor in general, any of the other <i style="mso-bidi-font-style:normal">U</i>×<i style="mso-bidi-font-style:normal">V</i>‑1 map positions.<span style="mso-spacerun:yes">  </span>Thus, in order to do the gradient averaging,

there must be macroscopic movement of large amounts of data, not simply between

the processor and memory of any single “node”, but <i>between nodes</i> (or,

from all <i style="mso-bidi-font-style:normal">U</i>×<i style="mso-bidi-font-style:

normal">V </i>nodes to a central point).<span style="mso-spacerun:yes"> 

</span>There appears to be no way around the fact that the shared (tied) parameters

technique of ConvNets entails massive movement of data.<span style="mso-spacerun:yes">  </span>Note that the situation described here

applies to every feature map learned at every level of a network.</span></p><p class="MsoNormal"><span style="mso-fareast-language:AR-SA"><br></span></p>


<p class="MsoNormal"><span style="mso-fareast-language:AR-SA">Thus we have the situation

that: a) in order to scale to massive problem sizes, ConvNets require parameter

sharing; b) to achieve huge reductions in the amount of time and power expended

in computation, we need PMC; and c) parameter sharing is incompatible with PMC.</span></p>


<p class="MsoNormal"><span style="mso-fareast-language:AR-SA"> </span></p>


<p class="MsoNormal"><span style="mso-fareast-language:AR-SA">In fact, Dr. LeCun acknowledges

that sharing parameters entails a large amount of data movement (at ~minute 46

of the talk, Slide 66 “Distributed Learning”).<span style="mso-spacerun:yes"> 

</span>On that same slide, he references efforts to address this issue, e.g.,

asynchronous stochastic gradient descent, but indicates that substantial challenges

remain.<span style="mso-spacerun:yes">  </span>There has been some exploration

of “locally connected” ConvNets (Gregor, Szlam, Lecun, 2011), i.e., ConvNets

which do not use parameter sharing.<span style="mso-spacerun:yes">  </span>However,

the fact remains that without parameter sharing, scalabilty of ConvNets to

massive problems has to be considered an open question.<span style="mso-spacerun:yes">  </span>For example, what would be the training time on a larger

benchmark like ImageNet, without parameter sharing?</span></p><p class="MsoNormal"><span style="mso-fareast-language:AR-SA"><br></span></p>


<p class="MsoNormal"><span style="mso-fareast-language:AR-SA">In fact, the brain

is, of course, locally connected.<span style="mso-spacerun:yes">  </span>That

would seem to make it overwhelmingly likely that the actual algorithm of

intelligence requires local connectivity. The existence proof that is the

brain, therefore, constitutes another serious challenge to ConvNets.<span style="mso-spacerun:yes">  </span>It is clear that ConvNets have risen to the

forefront of machine intelligence over the past decade.<span style="mso-spacerun:yes">  </span>But that preeminent stature only makes the

point I raise that much more important.<span style="mso-spacerun:yes">  </span>If,

as I’ve claimed, there is fundamentally no way to reconcile parameter sharing

with PMC, there could be massive economic implications, e.g., regarding future hardware.</span></p><p class="MsoNormal"><br></p><p class="MsoNormal">I hope this post stimulates thought on this matter and leads to a lively discussion.</p><p class="MsoNormal"><br></p></div><div>Sincerely,<br></div>Rod Rinkus<br><div><div>-- <br><div class="gmail_signature"><div dir="ltr"><div>Gerard (Rod) Rinkus, PhD<br>President,<br>rod at neurithmicsystems dot com<br><a href="http://sparsey.com" target="_blank">Neurithmic Systems LLC</a><br>275 Grove Street, Suite 2-400<br>Newton, MA 02466<br>617-997-6272<br><br>Visiting Scientist, Lisman Lab<br>Volen Center for Complex Systems<br>Brandeis University, Waltham, MA<br>grinkus at brandeis dot edu<br><a href="http://people.brandeis.edu/%7Egrinkus/" target="_blank">http://people.brandeis.edu/~grinkus/</a><a href="http://people.brandeis.edu/%7Egrinkus/" target="_blank"></a>

</div></div></div>

</div></div></div>