- At Microsoft Build 2024, the company released an open-source multimodal model, Phi-3 Vision.
- It has a context length of 128K tokens and Phi-3 Vision is trained on 4.2B parameters.
- Microsoft also released Phi-3 Small (7B) and Phi-3 Medium (14B) models at the event.
Earlier in April, Microsoft released its first AI model under the open-source Phi-3 family: Phi-3 Mini. And now, after almost a month, the Redmond giant has released a small multimodal model called Phi-3 Vision. At the Build 2024, Microsoft also unveiled two more Phi-3 family models including Phi-3 Small (7B) and Phi-3 Medium (14B). All of these models are open-source under the MIT license.
As for the Phi-3 Vision model, it’s trained on 4.2 billion parameters. It means that the model is fairly lightweight. This is the first time a mega-corporation like Microsoft has open-sourced a multimodal model. It has a context length of 128K and you can feed images as well. Google did release the PaliGemma model, but it’s not meant for conversational use.
Apart from that, Microsoft says that the Phi-3 Vision model was trained on publicly available, high-quality educational and code data. Microsoft has also generated synthetic data for math, reasoning, general knowledge, charts, tables, diagrams, and slides.
Despite its small size, the Phi-3 Vision model performs better than Claude 3 Haiku, LlaVa, and Gemini 1.0 Pro on many multimodal benchmarks. It even comes pretty close to OpenAI’s GPT-4V model. Microsoft says that developers can use the Phi-3 Vision model for OCR, chart and table understanding, general image understanding, and more.
If you want to check out the Phi-3 Vision model, head over to Azure AI Studio (visit).