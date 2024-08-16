Increasingly, AI companies are testing new and experimental models under strange names on the LMSYS Chatbot Arena and quietly deploying them without any release notes. Case in point, since last week, X users have been discussing improved performance on ChatGPT, whether for coding or creative tasks. Many believed it was a new OpenAI model, likely related to Project Strawberry — a new advanced reasoning engine. Something might be going on w/ GPT-4o



For the first time in a long time, it provided better "vibes" on an output than 3.5 Sonnet



Really surprised… will keep using it today to see if it continues— Matt Shumer (@mattshumer_) August 12, 2024

Finally, OpenAI let the genie out of the bottle and revealed that ChatGPT is indeed running a new model. It’s not a new frontier-class model but an improved GPT-4o model. The release note says that it is an updated GPT-4o model optimized for chat, and its name is chatgpt-4o-latest . Based on qualitative feedback and experiment results, OpenAI has tuned the GPT-4o model for better performance. there's a new GPT-4o model out in ChatGPT since last week. hope you all are enjoying it and check it out if you haven't! we think you'll like it 😃— ChatGPT (@ChatGPTapp) August 12, 2024

OpenAI further says that it continues to remove bad data from the training dataset and add good ones along with “experimenting with new research methods.” This is where the intrigue begins. Project Strawberry is supposed to bring a new post-training method to improve reasoning. Is the new ChatGPT model already running the Strawberry engine?

Wow, GPT-4o now uses multi-step reasoning. impressive to see this in action. Turns out the update wasn’t a new model, but a new method. pic.twitter.com/kVF0ndA21T— Ra (@misaligned_agi) August 13, 2024

I can’t say for sure, but many X users noticed that ChatGPT now uses multi-step reasoning to give correct answers. In this method, the model improves itself by generating various step-by-step reasoning rationales, and ultimately, coming to a correct conclusion.

By the way, OpenAI also tested the new ChatGPT model on LMSYS under the name “anonymous-chatbot” and it received more than 11,000 votes. The new “ chatgpt-4o-latest ” model has again taken the first spot, outranking other AI models from Google, Anthropic, and Meta. It has become the first model to score 1314 points in LMSYS Arena. Exciting Update from Chatbot Arena!



The latest @OpenAI ChatGPT-4o (20240808) API has been tested under "anonymous-chatbot" for the past week with over 11,000 community votes.



OpenAI has now successfully re-claimed the #1 position, surpassing Google's Gemini-1.5-Pro-Exp with an… https://t.co/9lJlASI9UW pic.twitter.com/gxCDuBOi9N— lmsys.org (@lmsysorg) August 14, 2024

Does the New ChatGPT Model Pass the Vibe Test?

To test the updated ChatGPT model, I tried a few reasoning prompts, and well, I did not find much difference between the older and the latest model. I asked it to find the bigger number between 9.11 and 9.9, and it gave a correct response, just like before. I also ran other commonsense reasoning questions, and it was in line with the older model.

However, in some prompts, it still fails to get the answer right. For example, in response to the below prompt, it tells me to stack 9 eggs on top of the bottle, which is impossible.

Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.

In another test, it says that there are only two “R”s in the word strawberry, which is again incorrect.

how many Rs are in strawberry?

It might be the case that the new ChatGPT model has not been rolled out widely. Either way, with OpenAI’s new model, we can expect improvements in other key areas. If you have any queries, let us know in the comments below.