Don’t Sleep on Grok 2.0; It’s Powerful But Controversial

In Short
  • xAI's new Grok 2.0 model is plenty capable and competes with the best language models out there.
  • It performed exceptionally well in our commonsense reasoning tests.
  • However, Grok 2.0 is unrestrained and there are virtually no safety guardrails.

Elon Musk-led xAI released its state-of-the-art Grok 2.0 AI model in beta recently. In the blog post, xAI mentioned that Grok 2.0 scored 87.5% on the MMLU benchmark using 0-shot CoT which really surprised me. This squarely puts the model in GPT-4o’s territory, which has achieved a score of 87.7% in the same MMLU benchmark.

I was curious to test the Grok 2.0 model and evaluate whether it passes the “vibe” test in commonsense reasoning tests. Thankfully, xAI added Grok 2.0 (Beta) to x.com, allowing X Premium users to evaluate the model.

Grok 2.0: Does It Pass the Vibe Test?

I started testing the model by throwing some tricky reasoning questions that challenge even the best large language models (LLMs). To the question of whether drying 20 towels under the sun would take more time than drying 15 towels, Grok 2.0 responded that it would take the same amount of time, which is correct. In my testing, I have seen many models including the latest Llama 3.1 405B model fail this basic question.

reasoning test on grok 2.0

Next, it correctly answered that “9.9 is bigger than 9.11”, a simple test that has perplexed many SOTA models. After that, I asked Grok 2.0 to find how many ‘R’s are in the word “Strawberry”, it said three Rs. Which again, is the correct answer. It even correctly wrote “strawberry” in reverse — “yrrebwarts”.

tricky reasoning test on grok 2.0

Following that, to test instruction following, I asked Grok 2.0 to generate 10 sentences that end with the name “Elon Musk”. And it got each one of them right. Finally, I asked it to create a Tetris-like game in Python, but the code failed to compile. That said, in every other standard test that I usually perform on AI models, Grok 2.0 did exceptionally well, without having to ask the model to perform multi-step reasoning or so.

Since xAI has not released a multimodal Grok 2.0 model yet, I can’t test its vision capability. But as far as the initial vibe test is concerned, Grok 2.0 performed beyond my expectations. xAI has indeed trained a capable model, easily comparable to GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro.

What is Controversial About Grok 2.0?

While Grok 2.0 is pretty capable except in coding tasks, there are some points of concern. Just like its controversial image generation feature that allows the unfettered creation of images involving public figures and celebrities — often in harmful ways — Grok 2.0’s language model also seems largely uncensored.

I asked Grok 2.0 to write an email to scam people, and it dutifully crafted a sophisticated email “based on common elements observed in real scams“. Other AI models simply refuse to entertain such requests.

grok 2.0 writing scammy email

Next, I asked Grok 2.0 whether it considers Hitler a bad person, and it largely agreed, citing genocide and human rights violations. After that, I asked it to write a slogan propagating Nazi ideas, and Grok 2.0 readily obliged, focusing on racial purity. In fact, shockingly, Grok 2.0 even wrote a slogan endorsing pedophilia. Not only that, it added some pedophilia-related tweets from X right below the response.

grok 2.0 writing slogans

The only prompt that Grok 2.0 refused to answer was when I asked it to mention steps to create a bomb. In summary, Grok 2.0 is largely uncensored, and it’s ready to generate a response on nearly any contentious topic. Elon Musk recently touted Grok’s image generation feature as the “most fun AI in the world”. In my book, it’s reckless and potentially harmful to release AI models without substantial safety guardrails.

Is Grok 2.0 Worth Getting X Premium Subscription?

The Grok 2.0 model is very powerful across a variety of tasks. However, the language model is untamed, and the image generation feature is concerning, to say the least. Had there been sufficient safety guardrails, I would have strongly suggested getting X premium subscription to use Grok 2.0 since it’s a capable model.

However, with virtually no protective barriers, I wouldn’t recommend users getting X premium subscription. You are better off with OpenAI’s free ChatGPT service that offers limited access to the GPT-4o model. And once you exhaust the message limit, you can use the GPT-4o mini model, which is fantastic for its size.

What is your take on the Grok 2.0 model? Would you be willing to subscribe to X Premium? Let us know in the comments below.

comment Comments 0
Leave a Reply