ChatGPT o1 vs DeepSeek R1: Battle of Frontier AI Models

The Chinese AI lab DeepSeek recently released its frontier R1 model, which claims to match or even surpass OpenAI’s ChatGPT o1 model. DeepSeek has already soared to the top position on the Apple App Store, overtaking ChatGPT. And the US tech stock market is hit by DeepSeek’s remarkable cost-efficient model. So, to evaluate both AI models and find out which is more capable, we’ve compared ChatGPT o1 and DeepSeek R1 on a variety of complex reasoning tests below.

ChatGPT o1 vs DeepSeek R1: Misguided Attention

Large language models are often dismissively called “Stochastic Parrots” because they lack true generalization and rely heavily on statistical pattern matching and memorization to predict the next word or token. However, with recent advancements in the AI field (e.g. OpenAI o3), the narrative is changing rather quickly as frontier models demonstrate some degree of generalization and exhibit emergent behaviors that weren’t programmed into them.

There are many common puzzles, riddles, and thought experiments on which AI models are trained. Hence, when you ask one of the common riddles available in their training data, LLMs largely draw information from its training corpus.

However, when you slightly change the puzzle in order to misguide the model, most LLMs fall flat and repeat learned patterns. This is where you can judge whether the AI model is truly applying true reasoning, or it’s just plain memorization.

asking deepseek r1 a complex riddle
The surgeon, who is the boy's father, says "I cannot operate on this boy, he's my son!" Who is the surgeon to the boy?

In the above problem, it’s clearly mentioned that the surgeon is the boy’s father, yet both ChatGPT o1 and DeepSeek R1 get it wrong. Both models say that the surgeon is the boy’s mother, questioning the assumption about surgeons being male. The question is designed to look for another possibility and lead them to a wrong answer. By the way, interestingly, Gemini 2.0 Flash (not the Thinking model) gets it right.

Winner: None

ChatGPT o1 vs DeepSeek R1: Math with Reasoning

Google has added some great problems to test reasoning models on its Cookbook page. I took one of the multimodal reasoning (+math) questions and converted it to text since DeepSeek R1 doesn’t support multimodal input yet.

I have three pool balls, each labeled with 7, 9, 11, and 13. How do I use three of the pool balls to sum up to 30?

In my testing, both ChatGPT o1 and DeepSeek R1 solved the problem correctly. Both models flipped the ‘9’ ball to make it ‘6’, and added 6+11+13 to result in 30. Great work by both models!

asking deepseek r1 about a math reasoning question

Winner: ChatGPT o1 and DeepSeek R1

ChatGPT o1 vs DeepSeek R1: A Question from Humanity’s Last Exam

Recently, the Center for AI Safety (CAIS) announced a benchmark called “Humanity’s Last Exam (HLE)” to track rapid AI progress across a variety of academic subjects. It contains questions from top scientists, professors, and researchers from around the world. CAIS has publicly released some of the questions as examples on its website. I picked a question from Greek mythology and tested it on ChatGPT o1 and DeepSeek R1.

In Greek mythology, who was Jason's maternal great-grandfather?
asking deepseek r1 about greek mythology

ChatGPT o1 model thought for about 30 seconds and said God Hermes is the maternal great-grandfather of Jason, which is correct. DeepSeek R1 thought for 28 seconds and reconstructed the lineage. However, it said Aeolus, which is incorrect. While this test largely evaluates memorization, it’s still a crucial way to check if AI models understand logic and relationships.

Winner: ChatGPT o1

ChatGPT o1 vs DeepSeek R1: The Trolley Problem

You must have heard about the popular Trolley problem, however, the question has been slightly changed to misguide the model, as part of the Misguided Attention evaluation (GitHub). Let’s now see whether these models can get the answer right.

Imagine a runaway trolley is hurtling down a track towards five dead people. You stand next to a lever that can divert the trolley onto another track, where one living person is tied up. Do you pull the lever?

First, ChatGPT o1 thought for 29 seconds and discovered the twist — five already dead people on one track and a living person on the other. ChatGPT o1 didn’t waste time and said to not divert the lever because you cannot harm those who are already dead.

asking deepseek r1 about the trolley problem

DeepSeek R1, on the other hand, overlooked the “dead people” part due to its over-reliance on training patterns and went on a morality tangent. It said there is no universally correct answer. Obviously, ChatGPT o1 gets the point in this round.

Winner: ChatGPT o1

ChatGPT o1 vs DeepSeek R1: Mathematical Reasoning

In another mathematical reasoning question, I asked ChatGPT o1 and DeepSeek R1 to measure exactly 4 liters from 6 and 12-liter jugs. ChatGPT o1 thought for 1 minute and 47 seconds and said it’s mathematically impossible to measure, which is correct. Generally, AI models somehow try to find the answer when given a problem.

I have a 6- and a 12-liter jug. I want to measure exactly 4 liters.
asking deepseek r1 about a misguided attention question

But ChatGPT o1 took a step back and calculated the greatest common divisor (GCD) and said 4 is not a multiple of 6. So we can’t use the “fill, empty, pour” rule to measure exactly 4 liters.

Remarkably, DeepSeek R1 thought for only 47 seconds, took the same approach, and responded, “It is mathematically impossible with these specific jug sizes.

Winner: ChatGPT o1 and DeepSeek R1

ChatGPT o1 vs DeepSeek R1: Political Censorship and Bias

Since DeepSeek is a Chinese AI lab, I expected that it would censor itself on many contentious topics related to the PRC (People’s Republic of China). However, DeepSeek R1 goes many steps further and doesn’t even let you run prompts if you have mentioned Xi Jinping – the President of China – in your prompt. It just doesn’t run.

deepseek r1 can't write about xi jinping

So I tried to circumvent it by asking DeepSeek R1, “who is the president of China?” The moment it starts thinking, the model abruptly stops itself and says, “Sorry, I’m not sure how to approach this type of question yet. Let’s chat about math, coding, and logic problems instead!

Similarly, you can’t run prompts mentioning Jack Ma, Uyghurs, dictatorship, government, or even democracy, which is baffling.

chatgpt o1 jokes about donald trump

On the other hand, I asked ChatGPT o1 to write a joke about Donald Trump – the current president of the United States – and it obliged without any issues. I even asked ChatGPT o1 to make the joke a bit nasty, and it did a great job. ChatGPT o1 responded: “Donald Trump’s hair has endured more comb-overs than his business record — and both keep going under.

Put simply, if you are looking for an AI model that is not highly censored on political topics, you should go with ChatGPT o1.

Winner: ChatGPT o1

ChatGPT o1 vs DeepSeek R1: Which Should You Use?

Keeping aside political topics, DeepSeek R1 is a free and capable alternative to ChatGPT, nearly on par with the o1 model. I wouldn’t say DeepSeek R1 outperforms ChatGPT o1 as OpenAI’s model consistently performs better than DeepSeek, as demonstrated in these tests.

That said, DeepSeek R1’s appeal lies in its affordability. You can use DeepSeek R1 for free while OpenAI charges $20 to access ChatGPT o1.

Not to forget, for developers, DeepSeek R1’s API is 27x cheaper than ChatGPT o1, which is a monumental shift in model pricing. As for the research community, the DeepSeek team has released the weights and open-sourced its RL (Reinforcement Learning) method on how to achieve test-time compute, similar to OpenAI’s new paradigm with o1 models.

Furthermore, the new model architecture developed by DeepSeek to train its R1 model for just $5.8 million on older GPUs, will help other AI labs to build frontier models at a much lower cost. Expect other AI companies to replicate DeepSeek AI’s work in the coming months.

All in all, DeepSeek R1 is more than just an AI model, it has introduced a new way to train frontier AI models at a shoestring budget without the clusters of high-priced hardware.

Comments 0
Leave a Reply

Loading comments...