OpenAI Unveils o3 Model and Becomes First to Crack the ARC-AGI Benchmark in 5 Years

openai announces o3 model and cracks arc-agi benchmark
Image Credit: Beebom
In Short
  • OpenAI announced a new state-of-the-art o3 reasoning model that finally cracked the hallowed ARC-AGI benchmark.
  • On the ARC-AGI Semi-Private Evaluation Set, OpenAI o3 scored a whopping 87.5%, beating the 85% mark, generally achieved by humans.
  • OpenAI also announced o3-mini which is optimized for coding and offers faster performance at a lower cost.

On the final day of the “12 days of OpenAI” announcements, OpenAI revealed the biggest update. OpenAI announced the o3 and o3-mini reasoning models, and most notably, OpenAI made history as o3 became the first AI model to crack the hallowed ARC-AGI benchmark, breaking a five-year unbeaten streak.

openai o3 arc-agi benchmark
Image Credit: OpenAI via YouTube

On the ARC-AGI Semi-Private Evaluation Set, OpenAI’s o3 model scored a whopping 87.5% when using high-compute resources and given more time to think. The ARC Prize threshold was set at 85%, close to what humans generally achieve. Just so you know, the OpenAI o1 model could only score 32%.

ARC-AGI is designed to test AI models for generalized intelligence, focusing on the ability to solve novel problems, rather than relying on memorized patterns. So with the o3 model, OpenAI has indeed achieved a historic breakthrough in generalized intelligence. It may bring OpenAI closer to achieving AGI (Artificial General Intelligence) — an AI system that can match or exceed human intelligence.

  • openai o3 codeforces
  • openai o3 gpqa diamond and aime 2024

Besides ARC-AGI, OpenAI o3 scored 71.7 in SWE-bench Verified, 2,727 in Codeforces, 96.7 in AIME 2024, and 87.7 in GPQA Diamond. All these tests are highly challenging and the scores are significantly higher than what o1 achieved. Finally, in the EpochAI Frontier Math benchmark which requires expert mathematicians hours to solve a problem, OpenAI o3 got 25.2 accuracy. The earlier best score was just 2.0.

  • o3-mini aime 2024
  • o3-mini codeforces

Coming to the o3-mini model, OpenAI says it’s a distilled model from o3, and optimized for coding, fast performance, and cost-efficiency. o3-mini has three compute settings: low, medium, and high. At medium setting, the o3-mini outperforms the larger o1 model and costs less. Its latency is also lower than the o1 model.

In case you are wondering why is it called o3, and not o2, well, to avoid legal issues with O2, the UK-based mobile network operator, OpenAI decided to skip o2 altogether.

Finally, about availability, OpenAI says it’s performing safety testing on o3 and o3-mini models. The company is also opening up the o3-mini model for public safety testing. OpenAI plans to release the o3-mini model by the end of January 2025. And after that, the o3 model will be released, after rigorous testing and approval by regulators.

#Tags
Comments 0
Leave a Reply

Loading comments...