Fwd: If you do multimodal or vision AI you should check this out
Artur Dubrawski
awd at cs.cmu.edu
Fri Apr 3 18:34:08 EDT 2026
Sharing Szymon's insights as they have a broader appeal I think.
Also, the paper confirms our much earlier observation that AI benchmarks do
not measure up to the models they are supposed to assess.
That's why we have invested time and effort to develop benchmarking
*frameworks* that would let us dynamically generate new benchmarks
that would hopefully be able to stay ahead of the capabilities of the AI
technology as it continues to evolve.
Basically, putting the horses in front of the carriage again. Big thanks to
TimeSeriesGym and TimeSeriesExamAgent teams
for spearheading these efforts here at the Auton Lab!
Cheers,
Artur
PS It is hard to blame an AI model for accomplishing their tasks with
whatever we give them.
It was always the case in ML that we should be careful about how we (or our
AI agents these days) test the models properly,
to make sure they are doing their things in the ways we expect them to do.
---------- Forwarded message ---------
From: Szymon Rusiecki <srusieck at andrew.cmu.edu>
Date: Fri, Apr 3, 2026 at 9:57 AM
Subject: Re: If you do multimodal or vision AI you should check this out
To: Artur Dubrawski <awd at cs.cmu.edu>
After reproducing methodology presented in their paper, the mirage issue
occurs only for “big”
models. The “small” ones often don’t have this issue.
On Fri, Apr 3, 2026 at 15:41 Szymon Rusiecki <srusieck at andrew.cmu.edu>
wrote:
> I am actually surprised as I recently broke my collarbone so I decided to
> test on Gemini 3 flash with OOD sample (I think Google doesn’t have an
> image from my iPhone and even if, the photo doesn’t have any description)
> with prompt “what do you see on this image?” and it responded with the same
> answer as my doctor.
>
> SR
>
> On Fri, Apr 3, 2026 at 12:48 Artur Dubrawski <awd at cs.cmu.edu> wrote:
>
>> https://x.com/heygurisingh/status/2039012548260082082?s=20
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.srv.cs.cmu.edu/pipermail/autonlab-users/attachments/20260403/a7a48f77/attachment.html>
More information about the Autonlab-users
mailing list