Meta recently released its Llama 4 series of AI models, making headlines for outranking GPT-4o and Gemini 2.0 Pro in Chatbot Arena (formerly LMSYS). The company claimed that its Llama 4 Maverick model — an MoE model that activates only 17 billion parameters out of a massive 400B across 128 experts — achieved an impressive ELO score of 1,417 on Chatbot Arena benchmark.
This result raised eyebrows across the AI community, as a relatively smaller MoE model outranked much larger LLMs such as GPT-4.5 and Grok 3. The unusual performance from a small model led many in the AI community to test the model independently. Surprisingly, the real-world performance of Llama 4 Maverick didn’t match benchmark claims from Meta, particularly in coding tasks.
On 1Point3Acres, a popular forum for Chinese people in North America, a user claiming to be a former Meta employee posted a bombshell. According to the post, which has been translated into English on Reddit, the Meta leadership allegedly mixed “the test sets of various benchmarks in the post-training process” to inflate the benchmark score and meet internal targets.
The Meta employee found the practice unacceptable and chose to resign. The former employee also asked the team to exclude their name from the Llama 4 technical report. In fact, the user claims that the recent resignation of Meta’s Head of AI research, Joelle Pineau, is directly linked to the Llama 4 benchmark hacking.
In response to the growing allegations, Ahmad Al-Dahle, head of Meta’s Generative AI division, shared a post on X. He firmly dismissed the claim that Llama 4 was post-trained on the test sets. Al-Dahle writes:
We’ve also heard claims that we trained on test sets — that’s simply not true and we would never do that. Our best understanding is that the variable quality people are seeing is due to needing to stabilize implementations.
He acknowledged the inconsistent Llama 4 performance across different platforms. And, also urged the AI community to give it some days for the implementation to get “dialed in.”
LMSYS Responds to Llama 4 Benchmark Manipulation Allegations
Following concerns from AI community, LMSYS — the organization behind the Chatbot Arena leaderboard — issued a statement to improve transparency. LMSYS clarified that the submitted model on Chatbot Arena was “Llama-4-Maverick-03-26-Experimental”. It was a custom variant of the model, optimized for human preference.
LMSYS acknowledged that “style and model response tone was an important factor”. This may have given undue advantage to the custom Llama 4 Maverick model. The organization also admitted that this information was not made sufficiently clear by the Meta team. In addition, LMSYS stated, “Meta’s interpretation of our policy did not match what we expect from model providers.”
To be fair, Meta, in its official Llama 4 blog, mentioned that “an experimental chat version” scored 1,417 on Chatbot Arena. But they didn’t explain anything further.
Finally, to improve transparency, LMSYS added the Hugging Face version of Llama 4 Maverick to Chatbot Arena. Besides that, it has released over 2,000 head-to-head battle results for the public to review. The results include prompts, model responses, and user preferences.
I reviewed the battle results, and it was baffling to see users consistently preferring Llama 4’s often incorrect and overly verbose responses. This raises deeper questions about trusting community-driven benchmarks like Chatbot Arena.
Not the First Time Meta Gaming Benchmarks
This isn’t the first time Meta has been accused of gaming benchmarks through data contamination i.e. mixing benchmark datasets in the training corpus. Back in February this year, Susan Zhang — a former Meta AI researcher who now works at Google DeepMind — shared a revealing study in response to a post by Yann LeCun, Meta AI’s chief scientist.
The study found that over 50% of test samples from key benchmarks were present in Meta’s Llama 1 pretraining data. The paper says: “In particular, Big Bench Hard, HumanEval, HellaSwag, MMLU, PiQA, and TriviaQA show substantial contamination levels across both corpora”.
Now, amid the latest benchmark hacking allegations around Llama 4, Zhang has sarcastically noted that Meta should at least cite their “previous work” from Llama 1 for this “unique approach.” The jab is directed at Meta that benchmark manipulation is not an accident. But it’s a strategy by the Zuckerberg-led company to artificially boost performance metrics.