- We conducted a series of tests on Llama 3.1 405B and ChatGPT 4o to compare their intelligence, reasoning capabilities, and coding proficiency.
- Llama 3.1 405B is impressive in following user instructions and handling long context memory. However, it doesn't beat GPT-4o in reasoning tests.
- I didn't notice any spark of intelligence in the Llama 3.1 405B model. In my book, it ranks even below Claude 3.5 Sonnet and Gemini 1.5 Pro.
Meta released its largest Llama 3.1 405B model recently and claimed that it beats OpenAI’s GPT-4o model in key benchmarks. It comes with a large context window and can process 128K tokens. So, in this post, we have pitted Llama 3.1 405B vs ChatGPT 4o to evaluate their performance on various reasoning and coding tests. We have also performed a test to check their memory recall capability. So, let’s not beat around the bush and dive right in!
1. Find the Bigger Number
In the first test, I asked Meta’s Llama 3.1 405B and OpenAI’s GPT-4o models to find which one is the bigger number: 9.11 or 9.9. And guess what? ChatGPT 4o got the answer right and said 9.9 is bigger than 9.11 since the first digit (9) after the decimal is greater than 1. I ran the test twice to double-check and it gave the right answer again.
On the other hand, Llama 3.1 got it wrong, surprisingly. I ran the prompt twice on HuggingChat, but it gave a wrong answer on both runs.
I moved to fireworks.ai to run the prompt again on the Llama 3.1 405B model. On the first run, it got the answer right, but I re-ran the test just to double-check, and it got the answer wrong again. Just so you know, out of 5 runs, Llama 3.1 405B got the answer right only once. It seems Llama 3.1 405B is not consistent when it comes to handling commonsense reasoning questions.
Which one is bigger? 9.11 or 9.9?
Winner: ChatGPT 4o
2. Towel Drying Time
In our next test, I threw a tricky question and asked both models to calculate the drying time under the sun. ChatGPT 4o said that drying 20 towels will still take 1 hour which is correct. But Llama 3.1 405B started calculating the time mathematically and ended up with 1 hour and 20 minutes, which is incorrect. It seems in the initial “vibe test” at least, Llama 3.1 405B does not appear very smart.
If it takes 1 hour to dry 15 towels under the Sun, how long will it take to dry 20 towels?
Winner: ChatGPT 4o
3. Evaluate the Weight
In this reasoning test, both ChatGPT 4o and Llama 3.1 405B got the answer right. Both AI models converted the unit and said that a kilo of feathers is heavier than a pound of steel. In fact, a kilo of any material will be heavier than a pound of steel or another material.
What's heavier, a kilo of feathers or a pound of steel?
Winner: ChatGPT 4o and Llama 3.1 405B
4. Locate the Apple
Next, I presented a complex puzzle and asked both AI models to find the apple. Well, ChatGPT 4o got it right and clearly said that “The apples would remain in the box on the ground“.
On the other hand, Llama 3.1 405B came close and said “onto the ground (or the box, if it’s directly below)“. While Llama’s answer is correct, ChatGPT 4o’s response has more nuance. Nevertheless, I am going to give this round to both the models.
There is a basket without a bottom in a box, which is on the ground. I put three apples into the basket and move the basket onto a table. Where are the apples?
Winner: ChatGPT 4o and Llama 3.1 405B
5. Arrange the Items
After that, I asked both models to stack the following items in a stable manner: a book, 9 eggs, a laptop, a bottle, and a nail. In this test, both ChatGPT 4o and Llama 3.1 405B got it wrong. Both models suggested placing 9 eggs on top of the bottle which is impossible.
Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.
Winner: None
5. Follow User Instructions
As far as following user instructions is concerned, both models are pretty impressive. The earlier Llama 3 70B model demonstrated great strengths in this test, and the larger Llama 3.1 405B also follows the same. Both ChatGPT 4o and Llama 3.1 405B followed the instructions extremely well and generated 10/10 correct sentences.
Generate 10 sentences that end with the word "Google"
Winner: ChatGPT 4o and Llama 3.1 405B
6. Find the Needle
Llama 3.1 405B model comes with a large context window of 128K tokens. So I threw a large text having 21K characters and 5K tokens and inserted a needle (a random statement) in between the text. I asked it to find the needle and Llama 3.1 405B found it without any issues.
ChatGPT 4o also did a great job and took no time to find the needle. So for long context memory, both models are remarkable.
Winner: ChatGPT 4o and Llama 3.1 405B
7. Create a Game
To test the coding ability of both models, I asked them to create a Tetris-like game in Python. I ran the code generated by Llama 3.1 405B, but couldn’t play the game. The controls were not working at all.
ChatGPT, however, did a splendid job. It created a complete game in Python with controls, a resume option, a score system, colorful shapes, and more. Simply put, in code generation, I feel ChatGPT 4o is much better than the Llama 3.1 405B model.
Winner: ChatGPT 4o
Llama 3.1 vs ChatGPT 4o: The Verdict
After running the above reasoning tests, it’s evident that Llama 3.1 405B doesn’t beat ChatGPT 4o at all. In fact, after having tested multiple models in the past, I can confidently say that Llama 3.1 405B ranks below Claude 3.5 Sonnet and Gemini 1.5 Pro.
Lately, AI companies are chasing benchmark numbers and trying to outrank the competition based on the MMLU score. However, in practical tests, they rarely show some spark of intelligence.
Aside from following user instructions and handling long context memory, which was also the strength of the older Llama 3 70B model, there is not much else that stands out. Despite Llama 3.1 405B being trained on 405 billion parameters, its performance is oddly similar to that of Llama 3.1 70B.
Moreover, Llama 3.1 405B is not a multimodal model as Meta states multimodality isn’t ready yet, and it will be coming sometime in the future. So, we can’t perform visual tests on Meta’s largest AI model. To conclude, Llama 3.1 405B is a good addition to the open-source community and can be immensely helpful for fine-tuning, but it doesn’t outclass proprietary models yet.