The integration of visual and textual data in artificial intelligence presents complex challenges. Traditional models struggle to accurately interpret structured visual documents such as tables, charts, infographics, and diagrams. This limitation affects automated content extraction and understanding. This is important for applications in data analysis, information search and decision making. As organizations increasingly rely on AI-driven insights, the need for models that can effectively process both visual and textual information has increased dramatically.
IBM addressed this challenge with the release of Granite-Vision-3.1-2B, a compact vision language model designed for understanding documents. This model can extract content from a variety of visual formats, including tables, charts, and diagrams. It is trained on a well-curated dataset consisting of both public and synthetic sources, and is designed to handle a wide range of document-related tasks. Granite-Vision-3.1-2B, fine-tuned from the large-language model of granite, integrates image and text modalities to improve interpretation capabilities, making it suitable for a variety of practical applications.
The model consists of three important components:
Vision Encoder: Use Siglip to efficiently process and encode visual data. Vision-Language Connector: A two-layer multilayer perceptron (MLP) with a GELU activation function designed to bridge visual and textual information. Large-scale language model: Built on top of Granite-3.1-2B-Instruct and has a length of 128K context to handle complex and extensive input.
The training process is built on LLAVA, incorporates multi-layer encoder functionality and has grid resolution of any density. These enhancements improve the model’s ability to understand detailed visual content. This architecture allows models to perform a variety of visual document tasks, including analyzing tables and charts, performing optical character recognition (OCR), and answering document-based queries.

The assessment shows that granite-Vision-3.1-2B works well across multiple benchmarks, particularly in understanding the documents. For example, I achieved a score of 0.86 on the Chartqa benchmark, surpassing other models within the 1B-4B parameter range. The TextVQA benchmark achieved a score of 0.76 and demonstrated strong performance in interpreting and responding questions based on textual information built into the image. These results highlight the potential of models for enterprise applications that require accurate visual and text data processing.
IBM’s Granite-Vision-3.1-2B represents a notable advance in the vision language model, providing a balanced approach to visual document understanding. Its architecture and training methodology allows for efficient interpretation and analysis of complex visual and textual data. Native support for transformers and VLLM allows this model to be adapted to a variety of use cases and can be deployed in cloud-based environments such as the Colab T4. This accessibility makes it a practical tool for researchers and experts looking to enhance AI-driven document processing capabilities.
See IBM-Granite/Granite-Vision-3.1-2B-Preview and IBM-Granite/Granite-3.1-2B-Instruct. All credits for this study will be sent to researchers in this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn groups. Don’t forget to join the 75k+ ml subreddit.
Commended open source AI platform recommended: “Intelagent is an open source multi-agent framework for evaluating complex conversational AI systems” (promotion)

Asif Razzaq is CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, ASIF is committed to leveraging the possibilities of artificial intelligence for social benefits. His latest efforts are the launch of MarkTechPost, an artificial intelligence media platform. This is distinguished by its detailed coverage of machine learning and deep learning news, and is easy to understand by a technically sound and wide audience. The platform has over 2 million views each month, indicating its popularity among viewers.
✅ (Recommended) Join the Telegram Channel