- OpenAI has revealed it has been testing a new AI model on ChatGPT since last week.
- Turns out, it's an updated "chatgpt-4o-latest" model, which is said to bring improvements in coding, reasoning, and other creative tasks.
- On the LMSYS leaderboard, the new ChatGPT model has reclaimed the top spot with a score of 1314 points.
Increasingly, AI companies are testing new and experimental models under strange names on the LMSYS Chatbot Arena and quietly deploying them without any release notes. Case in point, since last week, X users have been discussing improved performance on ChatGPT, whether for coding or creative tasks. Many believed it was a new OpenAI model, likely related to Project Strawberry — a new advanced reasoning engine.
Finally, OpenAI let the genie out of the bottle and revealed that ChatGPT is indeed running a new model. It’s not a new frontier-class model but an improved GPT-4o model. The release note says that it is an updated GPT-4o model optimized for chat, and its name is chatgpt-4o-latest
. Based on qualitative feedback and experiment results, OpenAI has tuned the GPT-4o model for better performance.
OpenAI further says that it continues to remove bad data from the training dataset and add good ones along with “experimenting with new research methods.” This is where the intrigue begins. Project Strawberry is supposed to bring a new post-training method to improve reasoning. Is the new ChatGPT model already running the Strawberry engine?
I can’t say for sure, but many X users noticed that ChatGPT now uses multi-step reasoning to give correct answers. In this method, the model improves itself by generating various step-by-step reasoning rationales, and ultimately, coming to a correct conclusion.
By the way, OpenAI also tested the new ChatGPT model on LMSYS under the name “anonymous-chatbot” and it received more than 11,000 votes. The new “chatgpt-4o-latest
” model has again taken the first spot, outranking other AI models from Google, Anthropic, and Meta. It has become the first model to score 1314 points in LMSYS Arena.
Does the New ChatGPT Model Pass the Vibe Test?
To test the updated ChatGPT model, I tried a few reasoning prompts, and well, I did not find much difference between the older and the latest model. I asked it to find the bigger number between 9.11 and 9.9, and it gave a correct response, just like before. I also ran other commonsense reasoning questions, and it was in line with the older model.
However, in some prompts, it still fails to get the answer right. For example, in response to the below prompt, it tells me to stack 9 eggs on top of the bottle, which is impossible.
Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.
In another test, it says that there are only two “R”s in the word strawberry, which is again incorrect.
how many Rs are in strawberry?
It might be the case that the new ChatGPT model has not been rolled out widely. Either way, with OpenAI’s new model, we can expect improvements in other key areas. If you have any queries, let us know in the comments below.