OpenAI Releases o3 and o4-mini, Says o3 Can ‘Generate Novel Hypotheses’

openai releases o3 and o4-mini reasoning models
Image Credit: OpenAI
In Short
  • OpenAI has finally released the full o3 reasoning AI model along with a smaller o4-mini (and o4-mini-high) model.
  • The new reasoning models can use multiple tools inside ChatGPT to solve complex tasks. They can also analyze images, charts, and graphs.
  • The o3 model delivers state-of-the-art performance in coding, math, science, and visual tasks.

In December 2024, OpenAI announced o3, its most advanced reasoning AI model, and said the model will be released after proper safety testing. Finally, the frontier AI lab has launched the full o3 AI model after a gap of four months. Along with that, OpenAI has also released the next-generation o4-mini (and o4-mini-high) reasoning model.

In these four months, OpenAI has improved the o3 model even further and says o3 is the “most powerful reasoning model” developed by the company. Both o3 and o4-mini models can use multiple agentic tools inside ChatGPT, including web search, Python tools, and more. The reasoning models can finally analyze images as well. Both o3 and o4-mini are trained to pick the right tools, depending on the task.

OpenAI says o3 sets a new benchmark in coding, math, science, and visual tasks such as analyzing images, charts, and graphics. Early testers say that o3 can “generate and critically evaluate novel hypotheses—particularly within biology, math, and engineering contexts.

o3 and o4-mini benchmark scores
Image Credit: OpenAI

On the other hand, the new o4-mini is a smaller model, designed for speed and cost-efficiency. It excels in math, coding, and visual tasks. In fact, the smaller o4-mini model achieves 99.5% on AIME 2025 when given access to a Python interpreter.

As for benchmarks, both models have nearly saturated AIME 2024 and 2025. However, on GPQA Diamond, o3 achieves 83.3 and o4-mini gets 81.4. On Humanity’s Last Exam, o3 (without tools) scores 20.32 and with tools, gets 24.9. Finally, on SWE-Bench Verified, the o3 model scores 69.1%, even higher than Google’s Gemini 2.5 Pro (63.8%).

o3 and o4-mini multimodal and coding benchmarks
Image Credit: OpenAI

On multimodal benchmarks, both models are pretty competitive and achieve high accuracy in MMMU, MathVista, and CharXiv-Reasoning.

Lastly, OpenAI also released Codex, a new command-line agentic tool, somewhat similar to Anthropic’s Claude Code. You can run it from your terminal and take advantage of multimodal reasoning using o3 and o4-mini.

Availability: OpenAI o3 and o4-mini

As for availability, o3 and o4-mini are rolling out to ChatGPT Plus, Pro, and Team users, starting today. The two new models will replace o1, o3-mini, and o3-mini-high. OpenAI says ChatGPT Enterprise and Edu users will get access in one week. Thankfully, o4-mini is also coming to free-tier ChatGPT users, which can be accessed through the ‘Think’ button.

OpenAI has also assured that o3-pro is coming in a few weeks with support for all tools. Meanwhile, ChatGPT Pro users can continue to use the o1-pro model.

OpenAI o3 is a Powerful Reasoning Model

In case you missed the 2024 announcement, OpenAI’s o3 reasoning model was the first to crack the ARC-AGI benchmark, scoring an impressive 87.5% on the ARC-AGI Semi-Private Evaluation set in a high-compute configuration. François Chollet, the creator of ARC-AGI, noted in a blog post:

This is not merely incremental improvement, but a genuine breakthrough, marking a qualitative shift in AI capabilities compared to the prior limitations of LLMs. o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain.

However, it was also revealed that o3 had been trained on 75% of the ARC-AGI Public Training set, raising questions about how much of o3’s performance relied on generalized intelligence or benchmark-specific tuning.

Nevertheless, a recent report from The Information reveals that o3 can blend information from multiple fields like Nikola Tesla. It can come up with novel scientific ideas and experiments in areas like nuclear fusion and pathogen detection. In fact, OpenAI reportedly believes that its capabilities are powerful enough to justify a $20,000 per month pricing tier and calls it a “PhD-level AI.”

#Tags
Comments 0
Leave a Reply

Loading comments...