Anthropic Announces Claude 3 AI Models; Beats GPT-4 and Gemini 1.0 Ultra

In Short
  • Anthropic, backed by Google and Amazon, has released a new family of Claude 3 AI models -- Opus, Sonnet, and Haiku.
  • The largest Claude 3 Opus model beats OpenAI's GPT-4 model and Google's Gemini 1.0 Ultra model in all key benchmarks.
  • All three models support a context window of 200K tokens and come with vision capability as well.

Another week, another AI model surpassed GPT-4, at least on benchmarks. This time, it’s Anthropic, the company formed by ex-OpenAI members Daniela and Dario Amodei, who are siblings. The company has launched a family of Claude 3 models featuring Opus (largest and most capable), Sonnet (mid-size), and Haiku (smallest) models. Anthropic says the Claude 3 Opus model beats GPT-4 and Gemini 1.0 Ultra on all popular benchmarks.

Claude 3 Benchmarks

Anthropic has tested all three models on popular benchmarks like MMLU, GPQA, GSM8K, MATH, HumanEval, HellaSwag, and more. On MMLU, Claude 3 Opus scored 86.8% whereas GPT-4 has a reported score of 86.4%. Gemini 1.0 Ultra got 83.7% on the same 5-shot prompting technique.

claude 3 vs gpt-4 vs gemini ultra benchmarks
Image Courtesy: Anthropic

On the HumanEval benchmark that tests coding ability, the largest Opus model scored 84.9%, much higher than GPT-4’s 67% and Gemini 1.0 Ultra’s 74.4% score. The Clade 3 Opus model even defeated GPT-4 in the HellaSwag test but with a slight margin. It scored 95.4% whereas GPT-4 got 95.3% and Gemini 1.0 Ultra achieved 87.8%.

Claude 3 Capabilities

Overall, the largest Claude 3 Opus model looks very promising and we will definitely test it against GPT-4, Gemini 1.5 Pro, and Mistral Large so stay tuned with us. Apart from that, Anthropic says that all three models have great capabilities in analysis and forecasting, nuanced content creation, code generation, and fluency in international languages like Spanish, Japanese, and French.

opus vision capability
Image Courtesy: Anthropic

Claude 3 models also have vision capability, however, Anthropic is not marketing them as multimodal models. Anthropic says the vision capability in Claude 3 can help enterprise customers process charts, graphs, and technical diagrams. On benchmarks, it does better than GPT-4V but slightly lags behind Gemini 1.0 Ultra.

200K Context Length

In terms of context length, Anthropic says that all three models will initially offer a context window of 200K tokens, which is quite large, I must say. In addition, the company says that Claude 3 family models can process more than 1 million tokens, however, this capability will be available to select customers only.

opus niah test
Image Courtesy: Anthropic

On the Needle In A Haystack (NIAH) test with over 200K tokens, the Opus model performed exceptionally well with over 99% accurate retrieval, just like Gemini 1.5 Pro. Claude has been one of the best AI models for long context retrieval, and the performance has significantly improved with Claude 3.

Performance and Pricing

Coming to performance, Anthropic states that Claude 3 models are quite fast and the largest Opus model offers the same performance as Claude 2 and 2.1, but with better intelligence. The mid-size Sonnet model is almost 2x faster than Claude 2 and 2.1. On top of that, Anthropic mentions that Claude 3 models are significantly less likely to refuse to answer, which was an issue in earlier models.

You can start using the flagship Opus model by subscribing to Claude Pro which costs $23.60 after taxes. And the mid-size Claude 3 Sonnet is already deployed on the free version of (visit). Finally, developers can immediately access APIs for Opus and Sonnet models.

claude 3 API pricing
Image Courtesy: Anthropic

As for the API pricing, Claude 3 Opus with a 200K context window costs $15 per one million tokens (input) and $75 per one million tokens (output). In comparison to GPT-4 Turbo ($10 input / $30 output with 128K context), the pricing seems quite expensive.

Nevertheless, what do you think about the new family of models released by Anthropic, especially the Opus model? Let us know in the comment section below.

SOURCE Anthropic
  • Himanshu Chhabra says:

    Waiting for your comparison with Gemini 1.5 and GPT -4. Also compare the free Sonnet model

