<div dir="ltr"><div>This is potentially more tricky than it seems, partly because we can only see so much of what is going on behind the curtain in both ChatGPT and Bard.</div><div>I can confirm that GPT-3.5 totally fumbles this task even with extensive help from me, and GPT-4 aces it with no help.</div><div>GPT-3.5 however seems to be just a model and GPT-4 is definitely an agent, since it shows us when it uses its function calling and code-interpreter. I'm also not sure how different the system prompts are for either, too.<br></div><div>Both GPT-3.5 and GPT-4 were able to generate the correct list of 50 US states, but GPT-4 did this within the code interpreter and just wrote python code to check whether they contain 'a' or not.</div><div>GPT-3.5 could easily be hurt on this task by the random sampling of the next token. For instance "Does" is the most likely token but it randomly picks "Doesn't" due to the temperature and therefore trips itself for the rest of the answer. GPT-4 might both have better predictions for tokens besides whatever fine-tuning or prompt that makes it know when to function call. I wouldn't be surprised if GPT-3.5 with function-calling and a code-interpreter could also ace this task.</div><div><br></div><div>At the same time as we strive for better architectures that can do this by themselves, I definitely feel like the [LLM + function-call + code-interpreter + external-source search] style agents can greatly increase what the LLM alone is capable of, and there seems to be a lot of activity in the literature in this direction. We ourselves are probably closer to the Agent than to the Model in how we do tasks. For instance, I don't know the 50 US states and had to search for the list on google, so why not allow the model to do the same, and so on. Whether these hits or misses entail understanding or not will continue to be a tricky debate but I think these experiments are useful to see what helps these models (and agents) increase the hit rate.<br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Feb 19, 2024 at 10:51 AM Thomas Trappenberg <<a href="mailto:tt@cs.dal.ca">tt@cs.dal.ca</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="auto">Good point, but Dave's point stands as the models he is referring to did not even comprehend that they made mistakes. <div dir="auto"><br></div><div dir="auto">Cheers, Thomas</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Feb 19, 2024, 4:43 a.m.  <<a href="mailto:wuxundong@gmail.com" target="_blank">wuxundong@gmail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr">That can be attributed to the models' underlying text encoding and processing mechanisms, specifically tokenization that removes the spelling information from those words. If you use GPT-4 instead, it can process it properly by resorting to external tools.<br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Feb 19, 2024 at 3:45 PM Dave Touretzky <<a href="mailto:dst@cs.cmu.edu" rel="noreferrer" target="_blank">dst@cs.cmu.edu</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">My favorite way to show that LLMs don't know what they're talking about<br>

is this simple prompt:<br>

<br>

   List all the US states whose names don't contain the letter "a".<br>

<br>

ChatGPT, Bing, and Gemini all make a mess of this, e.g., putting "Texas"<br>

or "Alaska" on the list and leaving out states like "Wyoming" and<br>

"Tennessee".  And you can have a lengthy conversation with them about<br>

this, pointing out their errors one at a time, and they still can't<br>

manage to get it right.  Gemini insisted that all 50 US states have an<br>

"a" in their name.  It also claimed "New Jersey" has two a's.<br>

<br>

-- Dave Touretzky<br>

</blockquote></div>

</blockquote></div>

</blockquote></div>