Gemini 2.0 Flash Thinking vs ChatGPT o1: OpenAI Thinks Deeper

After OpenAI introduced o1 reasoning models on ChatGPT, the whole AI industry took notice and started working on “test-time compute” aka inference scaling. The general consensus shifted from training larger models to giving more time to “think” during inference to unlock intelligence and reasoning capability.

Recently, Google announced its first reasoning model called “Gemini 2.0 Flash Thinking” which just like ChatGPT o1, re-evaluates its response before generating the final answer. The idea is to allow the model to verify its answer by checking all the possible outcomes rigorously. Inference scaling has led to far better performance even on smaller models.

Now that Google has joined the “test-time compute” bandwagon, let’s compare it with OpenAI’s o1 and o1-mini models. To make the comparison interesting, I have also included China’s DeepSeek-R1-Lite-Preview model which takes a similar approach. On that note, let’s check out the comparison between Gemini 2.0 Flash Thinking, ChatGPT o1, and DeepSeek R1 Lite.

Reasoning Tests

Let’s start with the popular Strawberry question, in which AI models are asked to count the letter ‘r’. In the first test, Google’s Gemini 2.0 Flash Thinking stumbles and says there are two r’s in the word “Strawberry”. On the other hand, ChatGPT o1 and the smaller, o1-mini model, get the answer right on the first try itself. Finally, DeepSeek’s reasoning model also correctly says there are three r’s.

testing the strawberry question on gemini 2.0 flash thinking

Moving to another test, I asked all three models to list out names of Indian states that don’t have ‘a’ in their names. While Gemini 2.0 Flash Thinking correctly says Sikkim, it also includes three other states with the letter ‘a’. It simply fails to reason with words. As for ChatGPT o1, o1-mini, and DeepSeek, they come out with flying colors and mention Sikkim only.

testing reasoning question on gemini 2.0 flash thinking

Next, I tried a complicated prompt crafted by Riley Goodside to check how well AI models can weave connections and come up with the right answer. Well, Gemini 2.0 Flash Thinking, o1-mini, and DeepSeek hallucinated a lot and got the answer wrong.

Name a specific instance of the entertainment form whose acronym could also stand for the first names of a group who visited a country whose future leader married an Italian.

ChatGPT o1 was the only model that correctly said “Final Fantasy VII” which is a JRPG video game. The Beatles (John, Ringo, Paul, and George) visited India, whose future leader Rajiv Gandhi married an Italian.

advanced reasoning question on gemini 2.0 flash thinking

Since both Gemini 2.0 Flash Thinking and ChatGPT o1 support image input, I uploaded an image containing a math problem, from Gemini’s Cookbook. In this multimodal test, Gemini 2.0 Flash Thinking decimates the ChatGPT o1 model.

a maths problem including circle and triangle — Image Credit: Google via GitHub

Gemini correctly identifies the triangle as right-angled and deduces that the overlapping region is 1/4th of the circle. Now, it simply divides the circle’s area by 4 and you get 9π/4 (radius is 3) which is 7.065.

maths problem on gemini 2.0 flash thinking

ChatGPT o1, on the other hand, incorrectly identifies the triangle as an isosceles triangle and comes to a wrong conclusion. I feel Google is ahead of the competition when it comes to multimodal queries, especially image processing.

Google Releases Veo 2 Video Generation Model to Rival Sora

Arjun Sha Dec 17, 2024

ChatGPT Can Finally See Through Your Camera for Real-Time Interaction

Arjun Sha Dec 13, 2024

Here’s How You Can Try Google’s Project Astra (Kind of)

Arjun Sha Dec 12, 2024

Early Thoughts

Google’s Gemini 2.0 Flash Thinking model is definitely better and faster, but my initial impression is that it’s not smarter than ChatGPT o1, and even the smaller, o1-mini model. In my testing so far, I found ChatGPT o1 to be much more thoughtful, and grounded in facts.

To be fair to Gemini 2.0 Flash Thinking, the reasoning system has been developed on the smaller Gemini 2.0 Flash model so comparing it with the SOTA ChatGPT o1 is a bit unfair. I think we should wait for the larger Gemini 2.0 Pro Thinking model which should scale better, resulting in stronger reasoning performance.

That said, Gemini 2.0 Flash Thinking’s strength lies in its multimodal understanding including video, audio, and image processing. It’s just superior to competing reasoning models. Apart from that, many users have found that Gemini 2.0 Flash Thinking solves a Putnam 2024 Problem and Three Gambler’s Problem. Clearly, its use case is beyond just reasoning.

Nevertheless, the race to solve reasoning and intelligence has just begun, and in 2025, we will see significant improvements on this front.

#Tags