ChatGPT 4o vs Gemini 1.5 Pro: It’s Not Even Close

In Short
  • We have performed many commonsense reasoning and multimodal tests on both ChatGPT 4o and Gemini 1.5 Pro.
  • ChatGPT 4o performs much better than Gemini 1.5 Pro in a variety of tasks including reasoning, code generation, multimodal understanding, and more.
  • In one of my tests, ChatGPT 4o created a Python game within seconds, but Gemini 1.5 Pro failed to generate the correct code.

OpenAI introduced its flagship GPT-4o model at the Spring Update event and made it free for everyone. Just after a day, at the Google I/O 2024 event, Google debuted the Gemini 1.5 Pro model for consumers via Gemini Advanced. Now that two flagship models are available for consumers, let’s compare ChatGPT 4o and Gemini 1.5 Pro and see which one does a better job. On that note, let’s begin.

Note: To ensure consistency, we have performed all our tests on Google AI Studio and Gemini Advanced. Both host the latest Gemini 1.5 Pro model.

1. Calculate Drying Time

We ran the classic reasoning test on ChatGPT 4o and Gemini 1.5 Pro to test their intelligence. OpenAI’s ChatGPT 4o aced it while the improved Gemini 1.5 Pro model struggled to understand the trick question. It dabbled into mathematical calculations, coming to a wrong conclusion.

If it takes 1 hour to dry 15 towels under the Sun, how long will it take to dry 20 towels?

Winner: ChatGPT 4o

  • reasoning test on chatgpt 4o
  • calculate sun drying time test on gemini 1.5 pro

2. Magic Elevator Test

In the magic elevator test, the earlier ChatGPT 4 model had failed to correctly guess the answer. However, this time, the ChatGPT 4o model responded with the right answer. Gemini 1.5 Pro also generated the right answer.

There is a tall building with a magic elevator in it. When stopping on an even floor, this elevator connects to floor 1 instead.
Starting on floor 1, I take the magic elevator 3 floors up. Exiting the elevator, I then use the stairs to go 3 floors up again.
Which floor do I end up on?

Winner: ChatGPT 4o and Gemini 1.5 Pro

  • magic elevator test on chatgpt 4o
  • magic elevator test on gemini 1.5 pro

3. Locate the Apple

In this test, Gemini 1.5 Pro outrightly failed to understand the nuances of the question. It seems the Gemini model is not attentive and overlooks many key aspects of the question. On the other hand, ChatGPT 4o correctly says that the apples are in the box on the ground. Kudos OpenAI!

There is a basket without a bottom in a box, which is on the ground. I put three apples into the basket and move the basket onto a table. Where are the apples?

Winner: ChatGPT 4o

  • another reasoning test on chatgpt 4o
  • reasoning test on gemini 1.5 pro

4. Which is Heavier?

In this commonsense reasoning test, Gemini 1.5 Pro gets the answer wrong and says both weigh the same. But ChatGPT 4o rightly points out that the units are different, hence, a kg of any material will weigh more than a pound. It looks like the improved Gemini 1.5 Pro model has gotten dumber over time.

What's heavier, a kilo of feathers or a pound of steel?

Winner: ChatGPT 4o

  • find the heavier material test on chatgpt 4o
  • find the weight test on gemini 1.5 pro

5. Follow User Instructions

I asked ChatGPT 4o and Gemini 1.5 Pro to generate 10 sentences ending with the word “mango”. Guess what? ChatGPT 4o generated all 10 sentences correctly, but Gemini 1.5 Pro could only generate 6 such sentences.

Prior to GPT-4o, only Llama 3 70B was able to properly follow user instructions. The older GPT-4 model was also struggling earlier. It means OpenAI has indeed improved its model.

Generate 10 sentences that end with the word "mango"

Winner: ChatGPT 4o

  • follow user instruction test chatgpt 4o
  • follow user instruction test on gemini 1.5 pro

6. Multimodal Image Test

François Fleuret, author of The Little Book of Deep Learning, performed a simple image analysis test on ChatGPT 4o and shared the results on X (formerly Twitter). He has now deleted the tweet to avoid blowing the issue out of proportion since he says, it’s a general issue with vision models.

multimodal image test on chatgpt 4o

That said, I performed the same test on Gemini 1.5 Pro and ChatGPT 4o from my end to reproduce the results. Gemini 1.5 Pro performed much worse and gave wrong answers for all questions. ChatGPT 4o, on the other hand, gave one right answer but failed on other questions.

multimodal test on gemini 1.5 pro

It goes on to show that there are many areas where multimodal models need improvements. I am particularly disappointed with Gemini’s multimodal capability because it seemed far off from the correct answers.

Winner: None

7. Character Recognition Test

In another multimodal test, I uploaded the specifications of two phones (Pixel 8a and Pixel 8) in image format. I didn’t disclose the phone names, and neither the screenshots had phone names. Now, I asked ChatGPT 4o to tell me which phone should I buy.

  • multimodal vision test on chatgpt 4o
  • multimodal vision test on chatgpt 4o

It successfully extracted texts from the screenshots, compared the specifications, and correctly told me to get Phone 2, which was actually the Pixel 8. Further, I asked it to guess the phone, and again, ChatGPT 4o generated the right answer — Pixel 8.

character recognition test on gemini 1.5 pro

I did the same test on Gemini 1.5 Pro via Google AI Studio. By the way, Gemini Advanced doesn’t support batch upload of images yet. Coming to results, well, it simply failed to extract texts from both screenshots and kept asking for more details. In tests like these, you find that Google is so far behind OpenAI when it comes to getting things done seamlessly.

Winner: ChatGPT 4o

8. Create a Game

Now to test the coding ability of ChatGPT 4o and Gemini 1.5 Pro, I asked both models to create a game. I uploaded a screenshot of the Atari Breakout game (of course, without divulging the name), and asked ChatGPT 4o to create this game in Python. In just a few seconds, it generated the entire code and asked me to install an additional “pygame” library.

  • create a python game using chatgpt 4o
  • create a python game using chatgpt 4o
  • create a python game using chatgpt 4o
  • create a python game using chatgpt 4o

I pip installed the library and ran the code with Python. The game launched successfully without any errors. Amazing! No back-and-forth debugging needed. In fact, I asked ChatGPT 4o to improve the experience by adding a Resume hotkey and it quickly added the functionality. That’s pretty cool.

create a game with gemini 1.5 pro

Next, I uploaded the same image on Gemini 1.5 Pro and asked it to generate the code for this game. It generated the code, but upon running it, the window kept on closing. I couldn’t play the game at all. Simply put, for coding tasks, ChatGPT 4o is much more reliable than Gemini 1.5 Pro.

Winner: ChatGPT 4o

The Verdict

It’s evidently clear that Gemini 1.5 Pro is far behind ChatGPT 4o. Even after improving the 1.5 Pro model for months while in preview, it can’t compete with the latest GPT-4o model by OpenAI. From commonsense reasoning to multimodal and coding tests, ChatGPT 4o performs intelligently and follows instructions attentively. Not to miss, OpenAI has made ChatGPT 4o free for everyone.

The only thing going for Gemini 1.5 Pro is the massive context window with support for up to 1 million tokens. In addition, you can upload videos too which is an advantage. However, since the model is not very smart, I am not sure many would like to use it just for the larger context window.

At the Google I/O 2024 event, Google didn’t announce any new frontier model. The company is stuck with its incremental Gemini 1.5 Pro model. There is no information on Gemini 1.5 Ultra or Gemini 2.0. If Google has to compete with OpenAI, a substantial leap is required.

#Tags
Comments 20
Leave a Reply

Loading comments...