I Tried Out Gemini Live; It Can’t Compete with ChatGPT Advanced Voice Mode

In Short
  • I finally got access to Gemini Live and tested it by having a conversation with the virtual AI assistant on a variety of topics.
  • Gemini Live is good for casual conversations, but it doesn't bring a truly multimodal experience like ChatGPT's Advanced Voice Mode.
  • Gemini Live relies on a text-to-speech engine to convert written text into spoken words. It can't understand the emotions, speaker's mood, or identify speech intonations.

After months of wait, Gemini Live is finally here. You can start using Gemini Live on your Android phone right away, assuming you have subscribed to Gemini Advanced. I tested Gemini Live on my OnePlus phone, and it did not come across as revolutionary as we saw in the demos during Google I/O 2024.

For one, Gemini Live currently does not support other modalities like images or real-time camera input, which was showcased with Project Astra. Right now, it only supports free-flowing audio conversations that work for the most part, but again, there are some fundamental issues related to how the feature has been implemented. But we will come to that later on. Let’s first go through our interactions with Gemini Live.

Putting Gemini Live Through Its Paces

Interruptions and Multilingual Capability

First things first, Gemini Live supports interruptions and it works fairly well except in some situations where it continues to blabber on despite you having interrupted it. You can also talk to Gemini Live in the background, even when your phone is locked.

Further, it can converse in multiple languages, freely switching from one language to another. I’ve tried talking to Gemini in English, Hindi, and Bengali, and it performed pretty well. You can check out the demo below:

Prepping for a Job Interview

I started my conversation with Gemini Live and asked it to help me prepare for a job interview as a developer in the AI field. It asked me whether I would be doing research work or on the application side of things. Once I told Gemini Live that I would be working on a web app to generate AI images, it gave me a list of languages and frameworks such as Python, PyTorch, TensorFlow Lite, etc. that I should be comfortable with.

It also gave me suggestions to brush up on Diffusion models since it’s the hot new thing for image generation in the AI field. Overall, I had a good conversation with Gemini Live, and it does give you several useful suggestions.

That said, often, it gives you broader and generic suggestions without delving deep into the topic’s technicality. At times, you may have to steer Gemini to go beyond the surface-level talk. Hello, I’m the Editor chiming in to share a bit of my experience. I tried talking with Gemini about MHA and JJK manga and its spoilers, and I witnessed the same results. Gemini Live used the same surface-level repetitive sentences to describe the context of the story.

Private Messaging Apps

Next, to test hallucination, I dived deep into the world of privacy and asked which messaging app is the best for talking to anonymous sources. It recommended Signal and told me to avoid WhatsApp since it is owned by Facebook. I said since both apps use the same end-to-end protocol, why is Signal better than WhatsApp?

Gemini replied, “Facebook is known for collecting a lot of data” and “they want to make more money by showing ads.” I further asked who developed the end-to-end protocol, and it correctly answered Moxie Marlinspike, the man behind the Signal protocol.

Further, I told Gemini that there was some security issue with Signal recently, can you find it? It quickly browsed the internet and came up with the report saying, “there was a vulnerability in the desktop version of Signal that could let someone snoop on your files” but “it’s been fixed.” Till this point, I didn’t find Gemini hallucinating on key facts.

Minecraft and Hallucination

However, on many other topics, Gemini Live kept hallucinating. Hello, it’s the Editor again. For me, Gemini hallucinated the most when talking about Minecraft and manga. When I inquired about the final MHA manga chapter (the good thing is it knew MHA is My Hero Academia), it told me it was Chapter 382, which is incorrect. I asked it to search the Internet to double-check, and it came back with the correct answer.

We cover Minecraft regularly at Beebom, so I was curious to see what Gemini knew about it. I first asked, “What was the last Minecraft update?” To this, Gemini Live replied with The Wild update, which came out in June 2022. So much for keeping up with the times. I then asked the AI for the version number, to which it responded correctly.

minecraft hallucinations - Gemini

But at the end of the response, it added a thing that puzzled me. It said the latest version right now is 1.20 (what!? Gemini, you just said the last big Minecraft update is 1.19). I then asked a bit more about Minecraft 1.20, whose release date relayed by Gemini to me was incorrect again! The feature details were fine, but the dates were usually imprecise.

I wasn’t done yet. I knew the Minecraft 1.21 update came out in June recently, so I inquired Gemini about it. To this, I got a response that baffled me even further. According to Gemini, Minecraft 1.21 wasn’t out yet. This quickly made me ask Gemini about its knowledge cut-off, and it hit me back with September 2023. But Gemini has access to the Internet, doesn’t it? Can’t it Google my question? I’m not sure what went wrong there, but I had to ask the assistant to double-check on the Internet to get the correct answer.

Role-playing

To test role-playing, I asked Gemini Live to act like an English butler. Initially, it spoke with a refined and formal tone, addressing me as ‘Sir’, but quickly forgot its role. It kept going back to its original self. I had to remind it multiple times to not forget the role. Not to mention, it can’t do accents yet…so it wasn’t really an English butler afterall.

Gemini has performed poorly on instructing following in our earlier tests, and it’s the same with Gemini Live. It forgets the role and context easily, once you move to another topic in the same chat session.

Finding Information

I asked Gemini Live to find me restaurants where I can have the best biryani in Kolkata. And it said I should consider Arsalan or Karim’s, which are indeed popular outlets for biryani. I further asked Gemini Live to find shops to get my laptop repaired, and it responded with a few legit names. For finding information, Gemini Live did a good enough job.

Gemini Live vs ChatGPT Advanced Voice Mode

Taking inspiration from Cristiano Giardina’s ChatGPT Advanced Voice Mode demo, I asked Gemini Live to count from 1 to 10 extremely fast, but it kept counting at a standard pace due to speech-to-text conversion.

In the example X post, ChatGPT Voice Mode, on the other hand, stopped to catch a breath like a human while counting! That’s a true multimodal experience you get when you natively input and output speech.

Furthermore, Gemini Live couldn’t repeat tongue twisters without pausing in between, something ChatGPT’s Advanced Voice Mode does miraculously well. After that, I asked Gemini Live to talk to me in David Attenborough’s accent, but again, it can’t do accents yet as I have mentioned above.

To conclude, Gemini Live is good for casual conversations, but it’s not groundbreaking at all. We will have to wait for Google to unlock native input/output capability on Gemini Live to truly match ChatGPT’s Advanced Voice Mode.

Hey Google, Where are the Emotions in Gemini Live?

When Gemini was launched in December 2023, Google announced that Gemini is a truly native multimodal model. For audio processing, it means that Gemini can identify the tone and tenor of speech, recognize the pronunciation, and detect the mood of the person, whether they are happy, sad, excited, etc. by processing the raw audio signals natively.

In the native input and output method, the raw speech is tokenized and processed directly by the multimodal model. It doesn’t go through intermediate layers where the speech is transcribed into text using a speech-to-text (STT) engine, losing all the nuances of the speech. The text output is further generated using a language model and, finally, the output is relayed through a text-to-speech (TTS) engine.

This traditional approach doesn’t take advantage of native end-to-end multimodal capabilities like understanding speech intonation, expressions, mood, etc. Besides that, it leads to more latency and the conversation feels more robotic than natural.

With Gemini Live, we expected Google would bring a native input/output multimodal experience like ChatGPT’s Advanced Voice Mode. However, it’s pretty clear that Gemini Live is still using the traditional approach for processing audio.

To prove this, I asked Gemini Live to identify the sound of an animal, and it responded that it can’t process sound yet. Next, Gemini Live couldn’t identify whether I was happy or sad by processing my raw speech.

Ultimately, to sum up my experience, the current implementation of Gemini Live is not equipped to deliver true natural conversations. At this point, Gemini Live simply feels like a glorified TTS engine backed by an LLM. It’s powered by Gemini 1.5 Flash to generate a quick response and then process the audio using TTS/STT engines. A slightly disappointing experience, to say the least.

Have you got access to Gemini Live? How have your conversations been with Gemini? Share your experience with us and our readers in the comments section.

comment Comments 0
Leave a Reply