<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div dir="ltr"><meta http-equiv="content-type" content="text/html; charset=utf-8"><div dir="ltr"></div><div dir="ltr">Dear Geoff, </div><div dir="ltr"><br></div><div dir="ltr">I addressed this question previously, but maybe you missed it? One can certainly have knowledge of some domains and not others (eg, one could be knowledgeable about cars but not cognitive science or vice versa).</div><div dir="ltr"><br></div><div dir="ltr">One can certainly also have incomplete knowledge of some domain as well.</div><div dir="ltr"><br></div><div dir="ltr">But the kinds of errors that we see in LLMs are indicative of machines that rely on textual similarity, rather than abstraction. </div><div dir="ltr"><br></div><div dir="ltr">This is seen in mostly clearly here in Razeghi et al 2022, one of the few cases in which outputs are carefully compared to training sets.<a href="https://arxiv.org/abs/2202.07206"> https://arxiv.org/abs/2202.07206</a> , concluding</div><div dir="ltr"><blockquote type="cite">“<span style="font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; -webkit-text-size-adjust: auto; background-color: rgb(255, 255, 255);"> <i>Our results consistently demonstrate that models are more accurate on instances whose terms are more prevalent, in some cases above</i></span><span style="font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; -webkit-text-size-adjust: auto; background-color: rgb(255, 255, 255);"><i> </i></span><span class="mn" id="MathJax-Span-3" style="font-size: 16.873919px; white-space: nowrap; font-style: italic; transition: none 0s ease 0s; display: inline; position: static; border: 0px; padding: 0px; margin: 0px; vertical-align: 0px; line-height: normal; font-family: MathJax_Main;">70</span><span class="mi" id="MathJax-Span-4" style="font-size: 16.873919px; white-space: nowrap; font-style: italic; transition: none 0s ease 0s; display: inline; position: static; border: 0px; padding: 0px; margin: 0px; vertical-align: 0px; line-height: normal; font-family: MathJax_Main;">% </span><i><span style="font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; background-color: rgb(255, 255, 255);">(absolute) more accurate on the top 10\% frequent terms in comparison to the bottom 10\%. Overall, although LMs exhibit strong performance at few-shot numerical reasoning tasks, our results raise the question of how much models actually generalize beyond pretraining data”, </span></i></blockquote><div><i><span style="font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; background-color: rgb(255, 255, 255);"><br></span></i></div></div><i><span class="MathJax" id="MathJax-Element-1-Frame" tabindex="0" style="-webkit-text-size-adjust: auto; display: inline; line-height: normal; font-size: 13.607999px; overflow-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; padding: 0px; margin: 0px; font-family: "Lucida Grande", Helvetica, Arial, sans-serif; background-color: rgb(255, 255, 255);"><nobr style="transition: none 0s ease 0s; border: 0px; padding: 0px; margin: 0px; max-width: 5000em; max-height: 5000em; min-width: 0px; min-height: 0px; vertical-align: 0px; line-height: normal;"><span class="math" id="MathJax-Span-1" style="transition: none 0s ease 0s; display: inline-block; position: static; border: 0px; padding: 0px; margin: 0px; vertical-align: 0px; line-height: normal; width: 2.375em;"><span style="transition: none 0s ease 0s; display: inline-block; position: relative; border: 0px; padding: 0px; margin: 0px; vertical-align: 0px; line-height: normal; width: 1.9em; height: 0px; font-size: 16.873919px;"><span style="transition: none 0s ease 0s; position: absolute; border: 0px; padding: 0px; margin: 0px; vertical-align: 0px; line-height: normal; clip: rect(1.308em, 1001.841em, 2.493em, -999.997em); top: -2.25em; left: 0em;"><span style="transition: none 0s ease 0s; display: inline-block; position: static; border: 0px; padding: 0px; margin: 0px; vertical-align: 0px; line-height: normal; width: 0px; height: 2.256em;"></span></span></span><span style="transition: none 0s ease 0s; display: inline-block; position: static; border-width: 0px; border-left-style: solid; padding: 0px; margin: 0px; vertical-align: -0.143em; line-height: normal; overflow: hidden; width: 0px; height: 1.18em;"></span></span></nobr></span></i></div><div dir="ltr"><span style="-webkit-text-size-adjust: auto; font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; background-color: rgb(255, 255, 255);">They continued, and I could not agree more strongly “we encourage researchers to take the pretraining data into account when interpreting evaluation result.” </span></div><div dir="ltr"><span style="-webkit-text-size-adjust: auto; font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; background-color: rgb(255, 255, 255);"><br></span></div><div dir="ltr"><span style="-webkit-text-size-adjust: auto; font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; background-color: rgb(255, 255, 255);"> Looking only at correct answers in a system where the training set isn’t even disclosed tells us very little. </span></div><div dir="ltr"><span style="-webkit-text-size-adjust: auto; font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; background-color: rgb(255, 255, 255);"><br></span></div><div dir="ltr"><span style="-webkit-text-size-adjust: auto; font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; background-color: rgb(255, 255, 255);">A related work from Guy van de Broeck’s lab is also illuminating: </span><a href="https://arxiv.org/abs/2205.11502"> https://arxiv.org/abs/2205.11502</a> </div><div dir="ltr"><blockquote type="cite"><i>“<span style="font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; -webkit-text-size-adjust: auto; background-color: rgb(255, 255, 255);">Logical reasoning is needed in a wide range of NLP tasks. Can a BERT model be trained end-to-end to solve logical reasoning problems presented in natural language? We attempt to answer this question in a confined problem space where there exists a set of parameters that perfectly simulates logical reasoning. We make observations that seem to contradict each other: BERT attains near-perfect accuracy on in-distribution test examples while failing to generalize to other data distributions over the exact same problem space”</span></i></blockquote></div><div dir="ltr"><span style="-webkit-text-size-adjust: auto; font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; background-color: rgb(255, 255, 255);"><br></span></div><div dir="ltr"><span style="-webkit-text-size-adjust: auto; font-family: "Lucida Grande", Helvetica, Arial, sans-serif; font-size: 13.608px; background-color: rgb(255, 255, 255);"> </span></div><div dir="ltr"><font face="Lucida Grande, Helvetica, Arial, sans-serif" size="2"><span style="-webkit-text-size-adjust: auto; background-color: rgb(255, 255, 255);">When you see a system that works perfectly in-distribution but fails to generalize reliably out of distribution, you have every reason to doubt that there is genuine understanding there.</span></font></div><div dir="ltr"><font face="Lucida Grande, Helvetica, Arial, sans-serif" size="2"><span style="-webkit-text-size-adjust: auto; background-color: rgb(255, 255, 255);"><br></span></font></div><div dir="ltr"><div dir="ltr">Gary</div><div dir="ltr"><br></div><div dir="ltr"><br></div><div dir="ltr"><br><blockquote type="cite">On Mar 17, 2023, at 08:39, Geoffrey Hinton <geoffrey.hinton@gmail.com> wrote:<br><br></blockquote></div><blockquote type="cite"><div dir="ltr"><div dir="ltr">But if a learning disabled person with communication difficulties exhibited the same behaviour, would you then say they did not understand ANYTHING?<div>Understanding is not all-or-none.</div><div><br></div><div>Geoff</div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Mar 16, 2023 at 2:44 AM Kagan Tumer <<a href="mailto:kagan.tumer@oregonstate.edu">kagan.tumer@oregonstate.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>
I'm very reluctant to use the word "understand" beyond perhaps ChatGPT <br>
understanding the structure of language (statistically). Here's an <br>
example of a ChatGPT response to a simple arithmetic operation where:<br>
<br>
1- chatGPT was wrong;<br>
2- worked out an example that showed it was wrong, but it didn't <br>
register that and double down on its wrong conclusion;<br>
3- gave a high level explanation (also wrong) of why it was right even <br>
though it was wrong.<br>
<br>
you can forgive 1, but 2 and 3 clearly show ChatGPT does not actually <br>
understand what it is saying.<br>
<br>
Kagan<br>
<br>
<br>
<br>
On 3/14/23 9:54 AM, Natsuki Oka wrote:<br>
> [This email originated from outside of OSU. Use caution with links and <br>
> attachments.]<br>
> <br>
> Judging from the responses below, ChatGPT understands counting and <br>
> sorting to the extent that it can write a correct Python program but <br>
> does not have the understanding to bring the proper execution results.<br>
> count_and_sort.png<br>
> Here's the program that ChatGPT created:<br>
> ---<br>
> sentences = [<br>
> "The quick brown fox jumps over the lazy dog",<br>
> "Python is a popular programming language",<br>
> "I like to eat pizza for dinner",<br>
> "The capital of France is Paris",<br>
> "The cat in the hat wears a red and white striped hat",<br>
> "My favorite color is blue",<br>
> "The United States has fifty states",<br>
> "There are seven days in a week",<br>
> "I enjoy listening to music while I work",<br>
> "Mount Everest is the highest mountain in the world"<br>
> ]<br>
> <br>
> # sort the sentences by number of words<br>
> sentences.sort(key=lambda x: len(x.split()))<br>
> <br>
> # print the sorted sentences with the number of words in parentheses<br>
> for sentence in sentences:<br>
> num_words = len(sentence.split())<br>
> print(f"{sentence} ({num_words})")<br>
> ---<br>
> <br>
> The execution of this program yields the following correct results:<br>
> ---<br>
> My favorite color is blue (5)<br>
> Python is a popular programming language (6)<br>
> The capital of France is Paris (6)<br>
> The United States has fifty states (6)<br>
> I like to eat pizza for dinner (7)<br>
> There are seven days in a week (7)<br>
> I enjoy listening to music while I work (8)<br>
> The quick brown fox jumps over the lazy dog (9)<br>
> Mount Everest is the highest mountain in the world (9)<br>
> The cat in the hat wears a red and white striped hat (12)<br>
> ---<br>
> <br>
> Oka Natsuki<br>
> Miyazaki Sangyo-keiei University<br>
> <br>
<br>
<br>
-- <br>
Kagan Tumer<br>
Director, Collaborative Robotics and Intelligent Systems Institute<br>
Professor, School of MIME<br>
Oregon State University<br>
<a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__engr.oregonstate.edu_-7Ektumer&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=wQR1NePCSj6dOGDD0r6B5Kn1fcNaTMg7tARe7TdEDqQ&m=-tJl8_K0Wxb2ulUHzAE8o1dVx_f6CYH5XloApKXaEhRn8WoksIq5vqPQJEnpkPKg&s=fgq1m-vd3UnRqVja7PLtcyLXhUoVvCG4yIqf6bS6HoE&e=" rel="noreferrer" target="_blank">http://engr.oregonstate.edu/~ktumer</a><br>
<a href="https://urldefense.proofpoint.com/v2/url?u=https-3A__kagantumer.com&d=DwMFaQ&c=slrrB7dE8n7gBJbeO0g-IQ&r=wQR1NePCSj6dOGDD0r6B5Kn1fcNaTMg7tARe7TdEDqQ&m=-tJl8_K0Wxb2ulUHzAE8o1dVx_f6CYH5XloApKXaEhRn8WoksIq5vqPQJEnpkPKg&s=wVJdIqZbbzDfyM9PhLMkSPvdXib8snFqOCemuTX6Z_s&e=" rel="noreferrer" target="_blank">https://kagantumer.com</a><br>
</blockquote></div>
</div></blockquote></div></body></html>