Meta Uses a bit Misleading Benchmarks for its New AI Models

Read Time: 1 minutes

On Saturday, Meta unveiled Llama 4, a new version of their Llama family of AI models. There are four new models in total: the Llama 4 Scout, Llama 4 Maverick, and Llama 4 Behemoth. According to Meta, they were all trained on "broad visual understanding" using "large amounts of unstructured text, image, and video data. "

Meta stated in a blog post:

"The Llama ecosystem is entering a new era with these Llama 4 models. The Llama 4 collection is just getting started."

Maverick on LM Arena

Maverick, one of Meta's new flagship AI models, was revealed on Saturday. It came in second place on the LM Arena test, which asked human testers to compare model outputs and select their favorite.

However, it appears that the Maverick version that Meta installed on LM Arena is different from the one that developers may generally access. They observed a significant behavioral difference between the Maverick version hosted on LM Arena and the one that is available for public download.

Researchers Reviews

In its release, Meta stated that the Maverick on LM Arena is an "experimental chat version," as other AI researchers pointed out on X. At this point, Meta's LM Arena testing was carried out using "Llama 4 Maverick optimized for conversationality," according to a chart on the official Llama website.

LM Arena has never been the most accurate indicator of an AI model's performance for a variety of reasons. But most AI firms haven't altered or otherwise improved their models to perform better on LM Arena, or at least haven't acknowledged doing so.

In contrast to the ordinary version, the LM Arena edition was unique in its lengthy responses and heavy emoji usage. This discovery was posted on X by researcher Nathan Lambert, who remarked jokingly:

"Well, Llama4 is definitely a touch overcooked, hehe. Where in Yaph City is this?"

The Flaws in LM Arena

It is difficult for developers to project precisely how well a model will perform in specific scenarios when a model is customized to a benchmark, hidden, and then released as a "vanilla" version of the same model. It is also misleading. Even though benchmarks are often not enough, they should ideally give an overview of a single model's advantages and disadvantages on a variety of jobs.

For various reasons, LM Arena has not always been regarded as the most trustworthy indicator of AI model performance, yet AI firms generally do not openly acknowledge that they have adjusted their models to achieve higher benchmark results. Meta's strategy seems to break this tradition, which has caused a larger debate over the openness of AI model reviews.