I Tried Out an Open-source Multimodal LLM, And It Failed to Impress Me

A group of computer scientists from different universities have released an open-source multimodal LLM called LLaVA, and I stumbled on it while scrolling through Twitter last week. Similar to GPT-4, this LLM can process both text and image inputs. The project uses a general-purpose LLM and an image encoder to create a Large Language and Vision Assistant model. Since the touted features looked promising, I decided to test-run this large language model to understand how accurate and reliable it is and what we can expect from GPT4’s upcoming multimodal model (especially its visual capabilities). On that note, let’s go ahead and explore LLaVA.

What is LLaVA, a Multimodal Language Model?

LLaVA (Large Language-and-Vision Assistant) is a multimodal LLM, similar to OpenAI’s GPT-4, which can deal with both text and image inputs. While OpenAI has not yet added the image processing ability to GPT-4, a new open-source project has already done it by infusing a vision encoder.

Developed by computer scientists at the University of Wisconsin-Madison, Microsoft Research, and Columbia University, the project aims to demonstrate how a multimodal model would work and compare its capability with GPT-4.

It uses Vicuna as the large language model (LLM) and CLIP ViT-L/14 as a visual encoder, which, for those unaware, has been developed by OpenAI. The project has generated high-quality multimodal instruction-following data using GPT-4 and that results in excellent performance. It achieves 92.53% in the ScienceQA benchmark.

Apart from that, it has been fine-tuned for general-purpose visual chat and reasoning datasets, particularly from the science domain. Thus, overall, LLaVA is a starting point of the new multimodal reality, and I was quite excited to test it out.

How to Use LLaVA’s Vision Assistant Right Now

1. To use LLaVA, you can head over to llava.hliu.cc and check out the demo. It uses the LLaVA-13B-v1 model right now.

2. Simply add an image in the top-left corner and select “Crop“. Make sure to add square images for the best output.

3. Now, add your question at the bottom and hit “Submit”. The LLM will then study the image and explain everything in detail. You can also ask follow-up questions about the image you upload.

Multimodal LLM with Visual Capabilities: First Impressions

To check out LLaVA’s vision capability, we started with some basic examples. We uploaded a painting and asked LLaVA to identify the painting, and it correctly answered the question. I also asked some follow-up questions, and it did a good job at that as well.

In another example, I uploaded an image of food items and asked questions about the type of breakfast one can make and what would be the total calorie intake. It identified each item correctly and came up with food recipes and a rough calorie count. Though the recipes were not as detailed, the multimodal LLM did suggest ideas to incorporate the three food items into a dish/ meal.

Then, I added an image with a handwritten note asking it to write a Python script for the Bubble sort algorithm. But it failed to recognize the text on paper. And it couldn’t execute the code. So next, I added a simple mathematical question and asked the value of x, but again, it gave the wrong answer.

To probe further, I added another mathematical question, but it wasn’t handwritten to make it more readable. I thought maybe it was my writing that the AI couldn’t recognize. However, again, it simply hallucinated and made up an equation by itself and gave a wrong answer. My understanding is that it simply does not use OCR, but visualizes the pixels and matches them with ImageNet models from CLIP. In solving mathematical questions, including both handwritten and non-handwritten notes, the LLaVA model failed miserably.

Moving forward, I asked it to explain a New Yorker cartoon and why it is funny, but it failed to understand the reason behind the humor. It simply described the scene. When I pointed to the gender aspect in the image (the humor), this multimodal LLM then understood the assignment and answered correctly.

Finally, I asked LLaVA to examine a medical report, but again, it hallucinated and gave an incorrect summary. Despite repeated attempts, it couldn’t find relevant data in the uploaded picture.

LLaVA Needs a Lot of Improvements

To sum up, it’s very early, at least in the open-source space to come up with a capable multimodal LLM. In the absence of a powerful, foundational language-visual model, the open-source community might stay behind the proprietary ones. Meta sure has released a number of open-source models, but it has not released any visual models for the open-source community to work on, except Segment Anything which is not applicable in this case.

Whereas Google released PaLM-E, an embodied multimodal language model in March 2023 and OpenAI has already demonstrated GPT-4’s multimodal capabilities during the launch. When asked what is funny about an image where a VGA connector is plugged into a phone’s charging port, GPT-4 called out the absurdity with clinical precision. In another demonstration during the GPT-4 developer stream, OpenAI’s multimodal model quickly created a fully-functional website after analyzing a handwritten note in a layout scribbled on the paper.

Simply put, from what we have tested so far on LLaVA, it seems like it will take a much longer time to catch up with OpenAI in the language-visual space. Of course, with more progress, development, and innovation, things would get better. But as for now, we are eagerly waiting to test out GPT-4’s multimodal capabilities.

Comments 0
Leave a Reply