- Groq is a company founded by ex-Google TPU engineers. It has built an LPU that can generate outputs at lightning speed.
- Cutting-edge Nvidia GPU chips which are used for AI inferencing in ChatGPT top out at 30 to 60 tokens per second.
- Groq LPUs offer 10x performance at 1/10th latency and consume minimum energy when compared to Nvidia GPUs.
While using ChatGPT, especially with the GPT-4 model, you must have noticed how slow the model responds to queries. Not to mention, voice assistants based on large language models like ChatGPT’s Voice Chat feature or the recently released Gemini AI, which replaced Google Assistant on Android phones are even slower due to the high latency of LLMs. But all of that is likely to change soon, thanks to Groq’s powerful new LPU (Language Processing Unit) inference engine.
Groq has taken the world by surprise. Mind you, this is not Elon Musk’s Grok, which is an AI model available on X (formerly Twitter). Groq’s LPU inference engine can generate a massive 500 tokens per second when running a 7B model. It comes down to around 250 tokens per second when running a 70B model. This is a far cry from OpenAI’s ChatGPT, which runs on GPU-powered Nvidia chips that offer around 30 to 60 tokens per second.
Groq is Built by Ex-Google TPU Engineers
Groq is not an AI chatbot but an AI inference chip, and it’s competing against industry giants like Nvidia in the AI hardware space. It was co-founded by Jonathan Ross in 2016, who while working at Google co-founded the team to build Google’s first TPU (Tensor Processing Unit) chip for machine learning.
Later, many employees left Google’s TPU team and created Groq to build hardware for next-generation computing.
What is Groq’s LPU?
The reason Groq’s LPU engine is so fast in comparison to established players like Nvidia is that it’s built entirely on a different kind of approach.
According to the CEO Jonathan Ross, Groq first created the software stack and compiler and then designed the silicon. It went with the software-first mindset to make the performance “deterministic” — a key concept to get fast, accurate, and predictable results in AI inferencing.
As for Groq’s LPU architecture, it’s similar to how an ASIC chip (Application-specific integrated circuit) works and is developed on a 14nm node. It’s not a general-purpose chip for all kinds of complex tasks instead, it’s custom-designed for a specific task, which, in this case, is dealing with sequences of data in large language models. CPUs and GPUs, on the other hand, can do a lot more but also result in delayed performance and increased latency.
And with the tailored compiler that knows exactly how the instruction cycle works in the chip, the latency is reduced significantly. The compiler takes the instructions and assigns them to the correct place reducing latency further. Not to forget, every Groq LPU chip comes with 230MB of on-die SRAM to deliver high performance and low latency with much better efficiency.
Coming to the question of whether Groq chips can be used for training AI models, as I said above, it is purpose-built for AI inferencing. It doesn’t feature any high-bandwidth memory (HBM), which is required for training and fine-tuning models.
Groq also states that HBM memory leads to non-determinacy of the overall system, which adds to increased latency. So no, you can’t train AI models on Groq LPUs.
We Tested Groq’s LPU Inference Engine
You can head to Groq’s website (visit) to experience the blazing-fast performance without requiring an account or subscription. Currently, it hosts two AI models, including Llama 70B and Mixtral-8x7B. To check Groq’s LPU performance, we ran a few prompts on the Mixtral-8x7B-32K model, which is one of the best open-source models out there.
Groq’s LPU generated a great output at a speed of 527 tokens per second, taking only 1.57 seconds to generate 868 tokens (3846 characters) on a 7B model. On a 70B model, its speed is reduced to 275 tokens per second, but it’s still much higher than the competition.
To compare Groq’s AI accelerator performance, we did the same test on ChatGPT (GPT-3.5, a 175B model) and we calculated the performance metrics manually. ChatGPT, which uses Nvidia’s cutting-edge Tensor-core GPUs, generated output at a speed of 61 tokens per second, taking 9 seconds to generate 557 tokens (3090 characters).
For better comparison, we did the same test on the free version of Gemini (powered by Gemini Pro) which runs on Google’s Cloud TPU v5e accelerator. Google has not disclosed the model size of the Gemini Pro model. Its speed was 56 tokens per second, taking 15 seconds to generate 845 tokens (4428 characters).
In comparison to other service providers, the ray-project did an extensive LLMPerf test and found that Groq performed much better than other providers.
While we have not tested it, Groq LPUs also work with diffusion models, and not just language models. According to the demo, it can generate different styles of images at 1024px under a second. That’s pretty remarkable.
Groq vs Nvidia: What Does Groq Say?
In a report, Groq says its LPUs are scalable and can be linked together using optical interconnect across 264 chips. It can further be scaled using switches, but it will add to latency. According to the CEO Jonathan Ross, the company is developing clusters that can scale across 4,128 chips which will be released in 2025, and it’s developed on Samsung’s 4nm process node.
In a benchmark test performed by Groq using 576 LPUs on a 70B Llama 2 model, it performed AI inferencing in one-tenth of the time taken by a cluster of Nvidia H100 GPUs.
Not just that, Nvidia GPUs took 10 joules to 30 joules of energy to generate tokens in a response whereas Groq only took 1 joule to 3 joules. In summation, the company says, that Groq LPUs offer 10x better speed, for AI inferencing tasks at 1/10th the cost of Nvidia GPUs.
What Does It Mean For End Users?
Overall, it’s an exciting development in the AI space, and with the introduction of LPUs, users are going to experience instant interactions with AI systems. The significant reduction in inference time means users can play with multimodal systems instantly while using voice, feeding images, or generating images.
Groq is already offering API access to developers so expect much better performance of AI models soon. So what do you think about the development of LPUs in the AI hardware space? Let us know your opinion in the comment section below.