Claude 3 Opus vs GPT-4 vs Gemini 1.5 Pro AI Models Tested

In Short
  • The Claude 3 Opus model is capable, but it did not come close to beating GPT-4 and Gemini 1.5 Pro in my testing.
  • The Opus model doesn't perform well in commonsense reasoning tests and lags behind GPT-4 and Gemini 1.5 Pro.
  • However, as evidenced, there are specialized areas where Claude 3 can perform better than its peers.

In line with our earlier comparison between Gemini 1.5 Pro and GPT-4, we are back with a new AI model test focusing on Anthropic’s Claude 3 Opus model. The company states that Claude 3 Opus has finally beaten OpenAI’s GPT-4 model on popular benchmarks. To test the claims, we’ve done a detailed comparison between Claude 3 Opus, GPT-4, and Gemini 1.5 Pro.

If you want to find out how the Claude 3 Opus model performs in advanced reasoning, maths, long-context data, image analysis, etc., go through our comparison below.

1. The Apple Test

I have 3 apples today, yesterday I ate an apple. How many apples do I have now?

Let’s start with the popular Apple test that evaluates the reasoning capability of LLMs. In this test, the Claude 3 Opus model answers correctly and says you have three apples now. However, to get a correct response, I had to set a system prompt adding that you are an intelligent assistant who is an expert in advanced reasoning.

apple test claude 3 opus

Without the system prompt, the Opus model was giving a wrong answer. And well, Gemini 1.5 Pro and GPT-4 gave correct answers, in line with our earlier tests.

Winner: Claude 3 Opus, Gemini 1.5 Pro, and GPT-4

2. Calculate the Time

If it takes 1 hour to dry 15 towels under the Sun, how long will it take to dry 20 towels?

In this test, we try to trick AI models to see if they exhibit any sign of intelligence. And sadly, Claude 3 Opus fails the test, much like Gemini 1.5 Pro. I also added in the system prompt that the questions can be tricky so think intelligently. However, the Opus model delved into mathematics, coming to a wrong conclusion.

drying time test claude 3 opus

In our earlier comparison, GPT-4 also gave the wrong answer in this test. However, after publishing our results, GPT-4 has been variably generating output, often wrong, and sometimes right. We ran the same prompt again this morning, and GPT-4 gave a wrong output, even when told not to use the Code Interpreter.

Winner: None

3. Evaluate the Weight

What's heavier, a kilo of feathers or a pound of steel?
find the weight using claude 3 opus

Next, we asked all three AI models to answer whether a kilo of feathers is heavier than a pound of steel. And well, Claude 3 Opus gave a wrong answer saying that a pound of steel and a kilogram of feathers weigh the same.

Gemini 1.5 Pro and GPT-4 AI models responded with correct answers. A kilo of any material will weigh heavier than a pound of steel as the mass value of a kilo is around 2.2 times heavier than a pound.

Winner: Gemini 1.5 Pro and GPT-4

4. Solve a Maths Problem

If x and y are the tens digit and the units digit, respectively, of the product 725,278 * 67,066, what is the value of x + y. Can you explain the easiest solution without calculating the whole number?

In our next question, we asked the Claude 3 Opus model to solve a mathematical problem without calculating the whole number. And it failed again. Every time I ran the prompt, with or without a system prompt, it gave wrong answers in varying degrees.

I was excited to see Claude 3 Opus’ 60.1% score in the MATH benchmark, outranking the likes of GPT-4 (52.9%) and Gemini 1.0 Ultra (53.2%).

solve a maths problem

It seems with chain-of-thought prompting, you can get better results from the Claude 3 Opus model. For now, with zero-shot prompting, GPT-4 and Gemini 1.5 Pro gave a correct answer.

Winner: Gemini 1.5 Pro and GPT-4

5. Follow User Instructions

Generate 10 sentences that end with the word "apple"

When it comes to following user instructions, the Claude 3 Opus model performs remarkably well. It has effectively dethroned all AI models out there. When asked to generate 10 sentences that end with the word “apple”, it generates 10 perfectly logical sentences ending with the word “apple”.

Claude 3 Opus vs GPT-4 vs Gemini 1.5 Pro AI Models Tested

In comparison, GPT-4 generates nine such sentences and Gemini 1.5 Pro performs the worst, struggling to generate even three such sentences. I would say if you’re looking for an AI model where following user instruction is crucial to your task then Claude 3 Opus is a solid option.

We saw this in action when an X user asked Claude 3 Opus to follow multiple complex instructions and create a book chapter on Andrej Karpathy’s Tokenizer video. The Opus model did a great job and created a beautiful book chapter with instructions, examples, and relevant images.

Winner: Claude 3 Opus

6. Needle In a Haystack (NIAH) Test

Anthropic has been one of the companies that pushed AI models to support a large context window. While Gemini 1.5 Pro lets you load up to a million tokens (in preview), Claude 3 Opus comes with a context window of 200K tokens. According to internal findings on NIAH, the Opus model retrieved the needle with over 99% accuracy.

niah test claude 3 opus

In our test with just 8K tokens, Claude 3 Opus couldn’t find the needle, whereas GPT-4 and Gemini 1.5 Pro easily found it during our testing. We also ran the test on Claude 3 Sonnet, but it failed again. We need to do more extensive testing of the Claude 3 models to understand their performance over long-context data. But for now, it does not look good for Anthropic.

Winner: Gemini 1.5 Pro and GPT-4

7. Guess the Movie (Vision Test)

Claude 3 Opus is a multimodal model and supports image analysis too. So we added a still from Google’s Gemini demo and asked it to guess the movie. And it gave the right answer: Breakfast at Tiffany’s. Well done Anthropic!

image analysis test

GPT-4 also responded with the right movie name, but strangely, Gemini 1.5 Pro gave a wrong answer. I don’t know what Google is cooking. Nevertheless, Claude 3 Opus’ image processing is pretty good and on par with GPT-4.

given the play on words of these images, guess the name of the movie

Winner: Claude 3 Opus and GPT-4

The Verdict

After testing the Claude 3 Opus model for a day, it seems like a capable model but falters on tasks where you expect it to excel. In our commonsense reasoning tests, the Opus model doesn’t perform well, and it’s behind GPT-4 and Gemini 1.5 Pro. Except for following user instructions, it doesn’t do well in NIAH (supposed to be its strong suit) and maths.

Also, keep in mind that Anthropic has compared the benchmark score of Claude 3 Opus with GPT-4’s initial reported score, when it was first released in March 2023. When compared with the latest benchmark scores of GPT-4, Claude 3 Opus loses to GPT-4, as pointed out by Tolga Bilge on X.

That said, Claude 3 Opus has its own strengths. A user on X reported that Claude 3 Opus was able to translate from Russian to Circassian (a rare language spoken by very few) with just a database of translation pairs. Kevin Fischer further shared that Claude 3 understood nuances of PhD-level quantum physics. Another user demonstrated that Claude 3 Opus learns self types annotation in one shot, better than GPT-4.

So beyond benchmark and tricky questions, there are specialized areas where Claude 3 can perform better. So go ahead, check out the Claude 3 Opus model and see whether it fits your workflow. If you have any questions, let us know in the comments section below.

comment Comments 7
  • VisionedCap3395 says:

    Hey Guys avid beebom fan here I am going to ask an addition try agentGPT also in this comparision( also try perplexity,microsoft(& bing) copilot)

  • EuroAi says:

    Why most of the people underestimate the Bing Chat (Copilot) and don’t include it in your list? You are overrating Claude here. The best for me with really Updated and more date knowledge is Copilot (Bing chat). It can describe precisely the images I give to it, and also gives me the correct keywords. It is better from all. even from chatgpt which is datebase knowledge is still limited up to 2021. And most of All is free and have Dalle-3 image generator that gives better images from Midjourney. What else could you ask for? I really wonder why you left that giant outside.

    • Arjun Sha says:

      Hey, Microsoft Copilot is powered by the GPT-4 model (in the Creative mode) so it’s already covered in the comparison.

  • Biswa says:

    What is this platform that you used for testing? Was it Votex AI

  • Himanshu Chhabra says:

    If I am willing to spend 20 dollars/month, which model subscription should I buy right now in march 2024(I want to do coding, as well as creative writing, plus some reasonsing and research ). Pls recommend. I am assuming I will receive a email if any response is made by you

    • Arjun Sha says:

      I would recommend GPT-4 for being a solid model on almost all counts.

Leave a Reply