Meta’s New Spirit LM Open-Source Model Can Mimic Human Expressions

Meta launches Spirit LM
Image Courtesy: Shutterstock
In Short
  • Meta's Spirit LM is a multimodal model that mixes speech and text and can capture human vocal expressions better.
  • We get to hear some generation samples where the model uses tone and pitch to mimic those human expressions.
  • It's similar to how Google's Notebook LM's AI hosts express their opinions.

Multimodality for AI chatbots is definitely the new big thing, and we’ve already lost count of the number of such models that show up on GitHub every now and then. Now, Meta AI, in line with its open-source approach, has launched the new Spirit LM model in an attempt to address some multimodal challenges. And, from the looks of it, it’s quite impressive.

Currently, you can go wild with ChatGPT’s Advanced Voice Mode and get some pretty expressive human-like responses out of it. You have probably come across those viral videos of ChatGPT flirting with humans better than you ever could.

https://twitter.com/AIatMeta/status/1847383580269510670

While it’s still not there where we expected it to be, it’s better than what Gemini Live can do right now. Well, turns out, Meta has been silently making observations, and Spirit LM is meant to take things up a notch and offer more natural-sounding speech.

As per Meta, Spirit LM is based on a “7B pretrained text language model.” Meta also notes in its X post that most of the multimodal AI models that exist right now use ASR (Automatic Speech Recognition) to identify voice inputs and convert them to text. However, according to Meta, this results in the AI losing a whole lot of expression. So, Meta notes:

Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification.

The official Spirit LM release page details the research (PDF warning) that went behind making Spirit LM see the light of day. At the bottom, there are some generation samples that give us an idea of what to expect.

Meta Spirit LM mechanism
Image Courtesy: Meta

From the sound of it, Spirit LM certainly does a good job of landing those vocal modulations by using tone and pitch tokens well. However, it’s very similar to how Google’s Notebook LM’s AI hosts run the surprisingly impressive show.

Meta’s Spirit LM is out for developers and researchers to try out and build upon. However, we have dropped an access request, and hopefully, we’ll get to try out the tool soon enough. When we do, you know where to find us.

It will also be exciting to see it get integrated within Meta AI, letting users easily access and have hilarious and insightful conversations with it right within WhatsApp, Instagram, and Facebook. And, it most likely will be, given the demonstration we got to see by Meta at Connect 2024.

Meanwhile, there’s no denying that we’re looking at a future where AI models that are more expressive than Jarvis will be surrounding and helping us get through our daily chores. Scarily exciting, isn’t it?

What do you think about Meta’s new Spirit LM? Cry your heart out in the comments down below!

#Tags
comment Comments 0
Leave a Reply

Loading comments...