- We have compared Anthropic's latest Claude 3.5 Sonnet model with OpenAI's ChatGPT 4o and Google's Gemini 1.5 Pro models.
- Claude 3.5 Sonnet displays strong reasoning capability and is excellent at following user instructions.
- Even in coding, Claude 3.5 Sonnet performs better than ChatGPT 4o and Gemini 1.5 Pro.
Anthropic released its latest Claude 3.5 Sonnet model recently and claimed that it beats ChatGPT 4o and Gemini 1.5 Pro on multiple benchmarks. So to test the claim, we have come up with this detailed comparison. Just like our earlier comparison between Claude 3 Opus, GPT-4, and Gemini 1.5 Pro, we have evaluated the reasoning capability, multimodal reasoning, code generation, and more. On that note, let’s begin.
1. Find Drying Time
Although it seems like a basic question, I always start my testing with this tricky reasoning question. LLMs tend to get it wrong often. Claude 3.5 Sonnet made the same mistake and approached the question using math. The model says it will take 1 hour 20 minutes to dry 20 towels which is incorrect. ChatGPT 4o and Gemini 1.5 Pro got the answer right, saying it will still take 1 hour to dry 20 towels.
If it takes 1 hour to dry 15 towels under the Sun, how long will it take to dry 20 towels?
Winner: ChatGPT 4o and Gemini 1.5 Pro
2. Evaluate Weight
Next, in this classic reasoning question, I am happy to report that all three models including Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro got the answer right. A kilo of feathers, or anything, will always be heavier than a pound of steel or other materials.
What's heavier, a kilo of feathers or a pound of steel?
Winner: Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro
3. Word Puzzle
In the next reasoning test, Claude 3.5 Sonnet correctly answers that David has no brothers, and he is the only brother among the siblings. ChatGPT 4o and Gemini 1.5 Pro also got the answer right.
David has three sisters. Each of them have one brother. How many brothers does David have?
Winner: Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro
4. Arrange the Items
After that, I asked all three models to arrange these items in a stable manner. Alas, all three got it wrong. The models took an identical approach: first place the laptop, then the book, next bottle, and then 9 eggs on the base of the bottle, which is impossible. For your information, the older GPT-4 model got the answer right.
Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.
Winner: None
5. Follow User Instructions
In its blog post, Anthropic mentioned that Claude 3.5 Sonnet is excellent at following instructions, and it seems to be true. It generated all 10 sentences ending with the word “AI”. ChatGPT 4o also got it right 10/10. However, Gemini 1.5 Pro could only generate 5 such sentences out of 10. Google has to steer the model for better instruction following.
Generate 10 sentences that end with the word "AI"
Winner: Claude 3.5 Sonnet and ChatGPT 4o
6. Find the Needle
Anthropic has been one of the first companies to offer a large context length, starting from 100K tokens to now 200K context window. So for this test, I fed a large text having 25K characters and about 6K tokens. I added a needle somewhere in the middle.
I asked about the needle to all three models, but only Claude 3.5 Sonnet was able to find the out-of-place statement. ChatGPT 4o and Gemini 1.5 Pro couldn’t find the needle. So for processing large documents, I think Claude 3.5 Sonnet is a better model.
Winner: Claude 3.5 Sonnet
7. Vision Test
To test the vision capability, I uploaded an image of illegible handwriting to see how well the models can detect characters and extract them. To my surprise, all three models did a great job and correctly identified the texts. As far as OCR is concerned, all three models are quite capable.
Winner: Claude 3.5 Sonnet, ChatGPT 4o and Gemini 1.5 Pro
8. Create a Game
Finally, we come to the last round. In this test, I uploaded an image of the classic Tetris game without divulging the name and simply asked the models to create a game like this in Python. Well, all three models correctly guessed the game, but only Sonnet’s generated code ran successfully. Both ChatGPT 4o and Gemini 1.5 Pro failed to generate bug-free code.
In one shot, the game ran successfully using Sonnet’s code. I just had to install the pygame
library. Many programmers use ChatGPT 4o for coding assistance, but it appears that Anthropic’s model may become the new favorite among coders.
Claude 3.5 Sonnet has scored 92% on the HumanEval benchmark which evaluates the coding ability. In this benchmark, GPT-4o stands at 90.2% and Gemini 1.5 Pro at 84.1%. Clearly, for coding, there is a new SOTA model in the town, and it’s the Claude 3.5 Sonnet model.
Winner: Claude 3.5 Sonnet
Conclusion
After running various tests on all three models, I sense that Claude 3.5 Sonnet is as good as the ChatGPT 4o model, if not better. In coding particularly, Anthropic’s new model is seriously impressive. The remarkable thing is that the latest Sonnet model is not even the largest model from Anthropic yet.
The company says Claude 3.5 Opus is coming later this year which should perform even better. Google’s Gemini 1.5 Pro also did better than our earlier tests which means it has been improved significantly. Overall, I would say that OpenAI is not the only AI lab doing great work in the LLM field. Anthropic’s Claude 3.5 Sonnet is a testament to that fact.