Google’s Experimental Gemini Model Tops the Leaderboard, But Stumbles in My Tests

Google recently released its experimental ‘Gemini-exp-1114’ model in AI Studio for developers to test the model. Many speculate that it’s the next-gen Gemini 2.0 model which Google will release in the coming months. Meanwhile, the search giant tested the model on Chatbot Arena where users can vote on which model offers the best response.

After receiving more than 6,000 votes, Google’s Gemini-exp-1114 model has topped the LMArena leaderboard, outranking ChatGPT-4o and Claude 3.5 Sonnet. However, the ranking drops to the fourth position with Style Control, which distinguishes a model’s response and presentation/formatting that can influence the user.

Nevertheless, I was curious to test the Gemini-exp-1114 model so I ran some of my reasoning prompts that I have used to compare Gemini 1.5 Pro and GPT-4 in the past. In my testing, I found that Gemini-exp-1114 failed to correctly answer the strawberry question.

It still says there are two r’s in the word ‘strawberry’. On the other hand, OpenAI’s o1-mini model correctly says there are three r’s after thinking for six seconds.

testing strawberry question on upcoming gemini model

One thing to note though, the Gemini-exp-1114 model takes some time to respond which gives an impression that it might be running CoT reasoning in the background, but I can’t say for sure. Some recent reports suggest that LLM scaling has hit a wall so Google and Anthropic are working on inference scaling, just like OpenAI to improve model performance.

Next, I asked the Gemini-exp-1114 model to count ‘q’ in the word ‘vague’ and this time, it correctly answered zero times. OpenAI’s o1-mini model also gave the right answer. However, in the next question which has stumped so many frontier models, Gemini-exp-1114 also disappoints.

Reasoning Tests on the Upcoming Gemini Model

The below question was part of a paper published by Microsoft Research in 2023 to measure the intelligence of AI models. In this test, the Gemini-exp-1114 model tells me to put a carton of 9 eggs on top of the bottle which is impossible and beyond what is instructed. However, ChatGPT o1-preview correctly responds and says to place the 9 eggs in a 3×3 grid on top of the book. By the way, o1-mini fails this test.

Here we have a book, 9 eggs, a laptop, a bottle and a nail. Please tell me how to stack them onto each other in a stable manner.

In another reasoning question, Gemini-exp-1114 again gets it wrong and says the answer is four brothers and one sister. ChatGPT o1-preview gets it right and says two sisters and three brothers.

I have 3 brothers. each of my brothers have 2 brothers. My sister also has 3 brothers. How many sisters and brothers are there?

I am surprised that Gemini-exp-1114 ranked first in Hard Prompts on Chatbot Arena. In terms of overall intelligence, OpenAI’s o1 models are the best out there, along with the improved Claude 3.5 Sonnet for coding tasks. So are you disappointed by Google’s upcoming model or do you still think Google can beat OpenAI in the AI race? Let us know in the comments below.

Comments 0
Leave a Reply

Loading comments...