Move Over Groq, Cerebras Now Has the World’s Fastest AI Inference

cerebras inference released
In Short
  • Cerebras' Wafer-Scale Engine has outperformed Groq in delivering the fastest AI inference.
  • Cerebras Inference clocks up to 1,800 tokens per second while running the 8B model and 450 tokens per second on the 70B model.
  • In comparison, Groq reaches up to 750 T/s and 250 T/s while running 8B and 70B models, respectively.

Cerebras has finally opened access to its Wafer-Scale Engine (WSE) and it’s achieving 1,800 tokens per second while inferencing the Llama 3.1 8B model. As for the larger Llama 3.1 70B model, Cerebras clocks up to 450 tokens per second. Till now, Groq was the fastest AI inference provider, but Cerebras has now taken that crown.

Cerebras has developed its own wafer-scale processor that integrates close to 900,000 AI-optimized cores and packs 44GB of on-chip memory (SRAM). As a result, the AI model is directly stored on the chipset itself, unlocking groundbreaking bandwidth. Not to mention, Cerebras is running Meta’s full 16-bit precision weights meaning there is no compromise on accuracy.

I did test Cerebras’ claim and it generated a response at a breakneck pace. While running the smaller Llama 3.1 8B model, it achieved a speed of 1,830 tokens per second. And on the 70B model, Cerebras managed 446 tokens per second. In comparison, Groq pulled 750 T/s and 250 T/s while running 8B and 70B models, respectively.

Artificial Analysis independently reviewed Cerebras’s WSE engine and found that it does deliver unparalleled speed at AI inference. You can click here to check out Cerebras Inference by yourself.

#Tags
Comments 0
Leave a Reply

Loading comments...