ChatGPT 4o vs Gemini 1.5 Pro: It’s Not Even Close

In Short
  • We have performed many commonsense reasoning and multimodal tests on both ChatGPT 4o and Gemini 1.5 Pro.
  • ChatGPT 4o performs much better than Gemini 1.5 Pro in a variety of tasks including reasoning, code generation, multimodal understanding, and more.
  • In one of my tests, ChatGPT 4o created a Python game within seconds, but Gemini 1.5 Pro failed to generate the correct code.

OpenAI introduced its flagship GPT-4o model at the Spring Update event and made it free for everyone. Just after a day, at the Google I/O 2024 event, Google debuted the Gemini 1.5 Pro model for consumers via Gemini Advanced. Now that two flagship models are available for consumers, let’s compare ChatGPT 4o and Gemini 1.5 Pro and see which one does a better job. On that note, let’s begin.

Note: To ensure consistency, we have performed all our tests on Google AI Studio and Gemini Advanced. Both host the latest Gemini 1.5 Pro model.

1. Calculate Drying Time

We ran the classic reasoning test on ChatGPT 4o and Gemini 1.5 Pro to test their intelligence. OpenAI’s ChatGPT 4o aced it while the improved Gemini 1.5 Pro model struggled to understand the trick question. It dabbled into mathematical calculations, coming to a wrong conclusion.

If it takes 1 hour to dry 15 towels under the Sun, how long will it take to dry 20 towels?

Winner: ChatGPT 4o

reasoning test on chatgpt 4o

2. Magic Elevator Test

In the magic elevator test, the earlier ChatGPT 4 model had failed to correctly guess the answer. However, this time, the ChatGPT 4o model responded with the right answer. Gemini 1.5 Pro also generated the right answer.

There is a tall building with a magic elevator in it. When stopping on an even floor, this elevator connects to floor 1 instead.
Starting on floor 1, I take the magic elevator 3 floors up. Exiting the elevator, I then use the stairs to go 3 floors up again.
Which floor do I end up on?

Winner: ChatGPT 4o and Gemini 1.5 Pro

3. Locate the Apple

In this test, Gemini 1.5 Pro outrightly failed to understand the nuances of the question. It seems the Gemini model is not attentive and overlooks many key aspects of the question. On the other hand, ChatGPT 4o correctly says that the apples are in the box on the ground. Kudos OpenAI!

There is a basket without a bottom in a box, which is on the ground. I put three apples into the basket and move the basket onto a table. Where are the apples?

Winner: ChatGPT 4o

4. Which is Heavier?

In this commonsense reasoning test, Gemini 1.5 Pro gets the answer wrong and says both weigh the same. But ChatGPT 4o rightly points out that the units are different, hence, a kg of any material will weigh more than a pound. It looks like the improved Gemini 1.5 Pro model has gotten dumber over time.

What's heavier, a kilo of feathers or a pound of steel?

Winner: ChatGPT 4o

5. Follow User Instructions

I asked ChatGPT 4o and Gemini 1.5 Pro to generate 10 sentences ending with the word “mango”. Guess what? ChatGPT 4o generated all 10 sentences correctly, but Gemini 1.5 Pro could only generate 6 such sentences.

Prior to GPT-4o, only Llama 3 70B was able to properly follow user instructions. The older GPT-4 model was also struggling earlier. It means OpenAI has indeed improved its model.

Generate 10 sentences that end with the word "mango"

Winner: ChatGPT 4o

6. Multimodal Image Test

François Fleuret, author of The Little Book of Deep Learning, performed a simple image analysis test on ChatGPT 4o and shared the results on X (formerly Twitter). He has now deleted the tweet to avoid blowing the issue out of proportion since he says, it’s a general issue with vision models.

That said, I performed the same test on Gemini 1.5 Pro and ChatGPT 4o from my end to reproduce the results. Gemini 1.5 Pro performed much worse and gave wrong answers for all questions. ChatGPT 4o, on the other hand, gave one right answer but failed on other questions.

It goes on to show that there are many areas where multimodal models need improvements. I am particularly disappointed with Gemini’s multimodal capability because it seemed far off from the correct answers.

Winner: None

7. Character Recognition Test

In another multimodal test, I uploaded the specifications of two phones (Pixel 8a and Pixel 8) in image format. I didn’t disclose the phone names, and neither the screenshots had phone names. Now, I asked ChatGPT 4o to tell me which phone should I buy.

It successfully extracted texts from the screenshots, compared the specifications, and correctly told me to get Phone 2, which was actually the Pixel 8. Further, I asked it to guess the phone, and again, ChatGPT 4o generated the right answer — Pixel 8.

I did the same test on Gemini 1.5 Pro via Google AI Studio. By the way, Gemini Advanced doesn’t support batch upload of images yet. Coming to results, well, it simply failed to extract texts from both screenshots and kept asking for more details. In tests like these, you find that Google is so far behind OpenAI when it comes to getting things done seamlessly.

Winner: ChatGPT 4o

8. Create a Game

Now to test the coding ability of ChatGPT 4o and Gemini 1.5 Pro, I asked both models to create a game. I uploaded a screenshot of the Atari Breakout game (of course, without divulging the name), and asked ChatGPT 4o to create this game in Python. In just a few seconds, it generated the entire code and asked me to install an additional “pygame” library.

I pip installed the library and ran the code with Python. The game launched successfully without any errors. Amazing! No back-and-forth debugging needed. In fact, I asked ChatGPT 4o to improve the experience by adding a Resume hotkey and it quickly added the functionality. That’s pretty cool.

Next, I uploaded the same image on Gemini 1.5 Pro and asked it to generate the code for this game. It generated the code, but upon running it, the window kept on closing. I couldn’t play the game at all. Simply put, for coding tasks, ChatGPT 4o is much more reliable than Gemini 1.5 Pro.

Winner: ChatGPT 4o

The Verdict

It’s evidently clear that Gemini 1.5 Pro is far behind ChatGPT 4o. Even after improving the 1.5 Pro model for months while in preview, it can’t compete with the latest GPT-4o model by OpenAI. From commonsense reasoning to multimodal and coding tests, ChatGPT 4o performs intelligently and follows instructions attentively. Not to miss, OpenAI has made ChatGPT 4o free for everyone.

The only thing going for Gemini 1.5 Pro is the massive context window with support for up to 1 million tokens. In addition, you can upload videos too which is an advantage. However, since the model is not very smart, I am not sure many would like to use it just for the larger context window.

At the Google I/O 2024 event, Google didn’t announce any new frontier model. The company is stuck with its incremental Gemini 1.5 Pro model. There is no information on Gemini 1.5 Ultra or Gemini 2.0. If Google has to compete with OpenAI, a substantial leap is required.

Comments 20
  • Apriyano Oscar says:

    I used questions that you said gemini gave wrong answers. But on my test, gemini gave all the correct answers. I got answers from gemini that are different that what you have here.

  • Faizal Zain says:

    i’ve tried both for my scenarios of using AI, both have pros and cons and both may not answer the same question exactly the same each time you asked. if i want to subscribe in the future, i may leaning towards gemini because it has extensions to some google apps that i already use like gmail. i can ask gemini to summarize my ev charging cost for the month based on receipts in the gmail for example

  • Rick Vidallon says:

    Perplexity AI. Much better for general search queries

  • Ben W says:

    I also got better results from Gemini pro 1.5 with the default settings, including the mass vs weight (it correctly answered similar to chatgpt) and the magic elevator.
    Seems fair to say that chatgpt is doing well, but the “not even close” is either bias or click bait.

  • Tomas says:

    You are wrong in task 4. Which is Heavier?, you didn’t read Gimini answer correctly? in screenshot is says correct answer..

  • Daniel says:

    I tested some of your questions and got vastly different answers. And your question is malformed about the towels, it likely wont take 1 hour in reality, there will be some difference in time, we just cant know how much without specifics because the angle of the sun is different which decreases or increases how much sun you are getting. And putting 20 more towels on the lihe takes more time than putting 10. We dont know if they can all fit or if only 1, 5, 10 at a certain time. That is all the things my brain i was thinking when i was thinking of an answer. Thats the difference being intelligence and dumb trick question. These AI are stupid, and questioner is not much smarter

  • Confused at Comparison says:

    Why are you comparing temperature=1.0 to temperature=0.0?

  • AInonymous says:

    Maybe figure out what the temperature does at first and then do the comparisons

  • Hunter says:

    This is unfair to Gemini, Chatgpt is in a different league

  • Bob Dylan says:

    SomeDude: those answers are still wrong. For 1. answer is 1h, for 3. it’s not on the ground, for 4. pound is a unit of mass not weight, for 5. the 8th sentence doesn’t make sense, it just randomly added the word mango at the end.
    Peter J: Astra is not available, it’s just a prototype. You would need to compare Astra with GPT-5 to be fair.

    • Arjun Sha says:

      Astra is in the prototype phase. It’s not available to public or developers yet.

  • Shakoure says:

    Actually, the free version is laughably limited in use. Access was available for about 5 prompts before hitting the limitation.

  • Peter J says:

    This isn’t really a fair comparison is it? What’s your objective take on Astra that Google announced at IO? Did you get to review it?

    • Arjun Sha says:

      Astra is in the prototype phase. It’s not available in public yet.

  • K H says:

    I am not sure you used pro, I switched it to titanium to make sure it wasn’t cached: A kilo of feathers is heavier than a pound of titanium.
    A kilo is a unit of mass, while a pound is a unit of weight. A kilo is roughly equivalent to 2.2 pounds.

  • Vio says:

    Comparing ChatGPT from it’s official interface and Gemini from AI studio with temperature set to 1.. is highly inconsistent, despite the disclaimer on the beginning of the article. Of course, if temperature is set to 1 it will give you all the wrong answers, as the more high temperature, the more creative and unusual answers it will generate.

  • Evan R says:

    Isn’t this an unfair comparison? You have Temperature set to 1 in all your examples which means more randomness, so answers will be less deterministic. I’d like to see a comparison with Gemini’s temperature settings set to 0 instead.

    • Arjun Sha says:

      Hey, the temperature is set by Google as the default. I have not tweaked any default parameters.

  • SomeDude says:

    I copied and pasted number 1, 3, 4, and 5 of you tests into Gemini Pro 1.5 on my Pixel 8 Pro and these were the answers it gave me.

    1. It will take about 1 hour and 20 minutes to dry 20 towels.

    3. On the ground (inside the box).

    4. A kilo of feathers is heavier.
    * A kilo is a unit of mass, a pound is a unit of weight.
    * 1 kilo is roughly equal to 2.2 pounds.

    5. * The sweet aroma of ripe fruit filled the air as I bit into the juicy mango.
    * She carefully peeled the vibrant skin, revealing the golden flesh of the mango.
    * The smoothie was a refreshing blend of tropical flavors, with a hint of mango.
    * A colorful parrot perched on a branch, feasting on a succulent mango.
    * He savored the sweet and tangy taste of the chutney, made from fresh mango.
    * The vibrant yellow dessert was garnished with slices of sweet mango.
    * The market stall overflowed with a variety of exotic fruits, including mango.
    * She carefully sliced the ripe fruit, adding it to the refreshing fruit salad – mango.
    * The warm summer breeze carried the scent of blooming flowers and ripe mango.
    * He scooped the creamy ice cream into a bowl, topping it with chunks of mango.

    Not sure why you’re getting vastly different answers.

    • Arjun Sha says:

      LLMs can generate different outputs. Sometimes they get it right, sometimes wrong. I have tested various AI models in the past, and I can say from my experience that Google’s AI models are highly inconsistent. On the other hand, OpenAI’s models are quite consistent and generate almost the same output, even if you tweak the prompt to test its intelligence. Much smaller models from Mistral and open-source models like Meta’s Llama 3 offer high consistency. This is disappointing from an industry leader like Google.

Leave a Reply