IBM Releases Compact Vision AI Model for Enterprise Documents
IBM has released Granite 4.0 3B Vision, a compact multimodal AI model designed to extract structured data from business documents, charts and forms, the company announced on Hugging Face.
The 3-billion-parameter model, released under the permissive Apache 2.0 license, is engineered for enterprise document processing tasks including table extraction, chart-to-data conversion and key-value pair extraction from forms and invoices, according to the model’s technical blog post published March 31.
The release marks IBM’s latest push into open-source enterprise AI, positioning the Armonk, N.Y.-based company against proprietary document AI offerings from OpenAI, Anthropic and Google while keeping computational costs low enough for widespread corporate deployment.
Technical Approach
Rather than building a standalone vision model, IBM deployed Granite 4.0 3B Vision as a modular LoRA adapter on top of its existing Granite 4.0 Micro language model, according to the Hugging Face blog post. The approach allows organizations to run both multimodal and text-only workloads from a single model with automatic fallback between modes.
The model uses what IBM calls a “DeepStack Injection Architecture” that routes abstract visual features to earlier processing layers for semantic understanding while sending high-resolution spatial features to later layers for detail preservation, the company said.
Benchmark Results
In chart understanding tasks, the model scored 86.4% on the Chart2Summary benchmark — the highest among all evaluated models — and 62.1% on Chart2CSV, trailing only the substantially larger Qwen3.5-9B model at 63.4%, according to IBM’s published benchmarks.
For table extraction, the model achieved a 92.1 TEDS score on cropped images from the PubTables-v2 benchmark. On semantic key-value pair extraction, it posted 85.5% exact-match accuracy in zero-shot testing across 1,777 U.S. government forms with flat, nested and tabular structures, according to the blog post.
Training Data
IBM trained the model’s chart capabilities using ChartNet, a new dataset comprising 1.7 million chart samples spanning 24 chart types across six plotting libraries, the company said. Each sample includes plotting code, a rendered image, a data table, a natural language summary and question-answer pairs. The dataset is also available on Hugging Face.
A related research paper has been accepted at CVPR 2026, according to the blog post.
Enterprise Integration
The model supports two deployment modes: stand-alone image understanding for task-specific tools such as form parsers and chart analyzers, and integration with IBM’s Docling pipeline for end-to-end processing of multi-page PDF documents, according to the technical documentation.
Target use cases include form processing for invoices and receipts, financial report analysis with chart-to-CSV conversion, and research intelligence applications involving academic PDF parsing, the company said.
Market Context
The release reflects a growing enterprise AI trend toward smaller, specialized models that can handle specific business tasks at lower cost than general-purpose large language models. At 3 billion parameters, Granite 4.0 3B Vision is a fraction of the size of frontier models from OpenAI, Anthropic and Google, potentially making it attractive for organizations seeking to process documents at scale without the infrastructure costs associated with larger systems.
The model joins IBM’s broader Granite 4.0 family, which includes the Granite 4.0 Micro text model and a 1-billion-parameter speech model released in March.