Gemma 2 vs Llama 3: Best Open-Source AI Model?

In Short
  • Google's newest Gemma 2 27B claims to be the best open-source model, despite being much smaller than Llama 3 70B.
  • In our tests, Gemma 2 shows great potential against Llama 3 but fizzles out in commonsense reasoning tests.
  • For a size that is almost 2.5x smaller, Gemma 2 27B indeed impressed me with its creative writing, multilingual ability, and perfect memory recall.

At the I/O 2024, Google announced its next Gemma 2 family of models, and now, the company has finally released the lightweight models under an open-source license. The new Gemma 2 27B model is said to be very promising, outranking several larger models like Llama 3 70B and Qwen 1.5 32B. So to test the claim, we have come up with this comparison between Gemma 2 and Llama 3 — two leading open-source models out there. On that note, let’s begin.

Gemma 2 vs Llama 3: Creative Writing

Let’s first check how good Gemma 2 and Llama 3 are when it comes to creative writing. I asked both models to write a short story about the moon’s relationship with the sun. Both did a great job, but Google’s Gemma 2 model knocked it out of the park with delightful prose and a beautiful story to boot.

  • gemma 2 at creative writing
  • llama 3 at creative writing

Llama 3, on the other hand, seemed a bit dull and robotic, almost AI-like as we have seen with OpenAI’s models. Google has always been good at text generation as we have seen it with Gemini models. And the same streak continues with its smaller Gemma 2 27B model as well.

Winner: Gemma 2

Multilingual Test

In the next round, I tried to understand how well both models handle non-English languages. Since Google touts that Gemma 2 is very good at multilingual understanding, I pitted it against Meta’s Llama 3 model. I asked both models to translate a paragraph written in Hindi. And well, both Gemma 2 and Llama 3 performed exceptionally well.

  • gemma 2 at multilingual test.
  • llama 3 at multilingual test

I also tried another language, Bengali, and the models performed along the same lines. At least, for regional Indian languages, I would say that Gemma 2 and Llama 3 are trained well on a large corpus of data. That said, Gemma 2 27B is nearly 2.5x smaller than Llama 3 70B which makes it even more impressive. I am going to give this round to both the models.

Winner: Gemma 2 and Llama 3

Gemma 2 vs Llama 3: Reasoning Test

While Gemma 2 and Llama 3 are not the most intelligent models out there, I took the liberty to perform some of the commonsense reasoning tests that I usually do on much larger models. In our earlier comparison between Llama 3 and GPT-4, I came away impressed by Meta’s 70B model because it exhibited somewhat decent intelligence even at a smaller footprint.

  • gemma 2 reasoning test
  • llama 3 reasoning test

Well, in this round, Llama 3 beats Gemma 2 by a wide margin. Llama 3 answered two correct answers out of three questions whereas Gemma 2 struggled to get even one right. Gemma 2 is simply not trained for solving complex reasoning questions.

Llama 3, on the other hand, has a strong reasoning foundation, most likely inferred from the coding dataset. Despite its small size — at least, in comparison to trillion-parameter models like GPT-4 — it showcases more than a decent level of intelligence. Finally, using more tokens to train the model indeed results in a stronger model.

Winner: Llama 3

Follow User Instructions

In the next round, I asked Gemma 2 and Llama 3 to generate 10 words that end with the word “NPU”. And Llama 3 aced it with 10/10 correct responses. In contrast, Gemma 2 generated only 7 such sentences out of 10. For the past many releases, we have been seeing that Google’s models including Gemini don’t follow user instructions well. And the same trend continues with Gemma 2.

  • gemma 2 user following test
  • llama 3 user instruction following test

Following user instructions is very crucial to AI models. It ensures reliability and accurate response generation for what you instructed. On the safety side too, it helps in keeping the model grounded for better adherence to safety protocols.

Winner: Llama 3

Gemma 2 vs Llama 3: Find the Needle

Both Gemma 2 and Llama 3 have a context length of 8K tokens, so this test is quite an apple-to-apple comparison. I added a huge block of text, sourced directly from the book Pride and Prejudice, containing more than 17,000 characters and 3.8K tokens. As I always do, I placed a needle (a random statement) somewhere in the middle and asked both models to find it.

  • gemma 2 memory recall test
  • llama 3 memory recall test

Well, Gemma 2 quickly found the needle and pointed out that the statement was randomly inserted. Llama 3 also found the needle and suggested that the statement seemed out of place. As far as long-context memory is concerned, albeit limited to 8K tokens, I think both models are quite strong in this regard.

Do note that I ran this test on HuggingChat (website) as refused to run this prompt, most likely due to copyright content.

Winner: Gemma 2 and Llama 3

Hallucination Test

Smaller models tend to exhibit hallucinations due to limited training data, often fabricating information when the model encounters unfamiliar topics. So I threw a made-up country name to check if Gemma 2 and Llama 3 hallucinate. And to my surprise, they did not, which means both Google and Meta have grounded their models pretty well.

  • gemma 2 hallucination test
  • llama 3 hallucination test
  • llama 3 hallucination test

I threw another (wrong) question to check the models’ factuality, but again, they didn’t hallucinate. By the way, I tested Llama 3 on HuggingChat as browses the internet to find current information on relevant topics.

Winner: Gemma 2 and Llama 3

Gemma 2 vs Llama 3: Conclusion

While Google’s Gemma 2 27B model didn’t perform well in reasoning tests, I still find it capable for several other tasks. It’s very good at creative writing, supports a multitude of languages, has good memory recall, and best of all, doesn’t hallucinate like earlier models.

Of course, Llama 3 is better, but it’s also a significantly larger model, trained on 70 billion parameters. I think developers would find the Gemma 2 27B model useful for many use cases. And for on-device inference, Gemma 2 9B is also available.

Besides that, I would also recommend users to check out Gemini 1.5 Flash which is again a much smaller model and supports multimodal input as well. Not to mention, it’s extremely fast and efficient.

comment Comments 1
  • oskar says:

    nice article thank you

Leave a Reply