Connectionists: Statistics versus “Understanding” in Generative AI.

Mon Feb 19 09:15:57 EST 2024

This is potentially more tricky than it seems, partly because we can only
see so much of what is going on behind the curtain in both ChatGPT and Bard.
I can confirm that GPT-3.5 totally fumbles this task even with extensive
help from me, and GPT-4 aces it with no help.
GPT-3.5 however seems to be just a model and GPT-4 is definitely an agent,
since it shows us when it uses its function calling and code-interpreter.
I'm also not sure how different the system prompts are for either, too.
Both GPT-3.5 and GPT-4 were able to generate the correct list of 50 US
states, but GPT-4 did this within the code interpreter and just wrote
python code to check whether they contain 'a' or not.
GPT-3.5 could easily be hurt on this task by the random sampling of the
next token. For instance "Does" is the most likely token but it randomly
picks "Doesn't" due to the temperature and therefore trips itself for the
rest of the answer. GPT-4 might both have better predictions for tokens
besides whatever fine-tuning or prompt that makes it know when to function
call. I wouldn't be surprised if GPT-3.5 with function-calling and a
code-interpreter could also ace this task.

At the same time as we strive for better architectures that can do this by
themselves, I definitely feel like the [LLM + function-call +
code-interpreter + external-source search] style agents can greatly
increase what the LLM alone is capable of, and there seems to be a lot of
activity in the literature in this direction. We ourselves are probably
closer to the Agent than to the Model in how we do tasks. For instance, I
don't know the 50 US states and had to search for the list on google, so
why not allow the model to do the same, and so on. Whether these hits or
misses entail understanding or not will continue to be a tricky debate but
I think these experiments are useful to see what helps these models (and
agents) increase the hit rate.

On Mon, Feb 19, 2024 at 10:51 AM Thomas Trappenberg <tt at cs.dal.ca> wrote:

> Good point, but Dave's point stands as the models he is referring to did
> not even comprehend that they made mistakes.
>
> Cheers, Thomas
>
> On Mon, Feb 19, 2024, 4:43 a.m. <wuxundong at gmail.com> wrote:
>
>> That can be attributed to the models' underlying text encoding and
>> processing mechanisms, specifically tokenization that removes the spelling
>> information from those words. If you use GPT-4 instead, it can process it
>> properly by resorting to external tools.
>>
>> On Mon, Feb 19, 2024 at 3:45 PM Dave Touretzky <dst at cs.cmu.edu> wrote:
>>
>>> My favorite way to show that LLMs don't know what they're talking about
>>> is this simple prompt:
>>>
>>>    List all the US states whose names don't contain the letter "a".
>>>
>>> ChatGPT, Bing, and Gemini all make a mess of this, e.g., putting "Texas"
>>> or "Alaska" on the list and leaving out states like "Wyoming" and
>>> "Tennessee".  And you can have a lengthy conversation with them about
>>> this, pointing out their errors one at a time, and they still can't
>>> manage to get it right.  Gemini insisted that all 50 US states have an
>>> "a" in their name.  It also claimed "New Jersey" has two a's.
>>>
>>> -- Dave Touretzky
>>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/connectionists/attachments/20240219/b49be2a3/attachment.html>