Illustration for: TII Releases Falcon Perception, Open-Source Vision-Language Model

TII Releases Falcon Perception, Open-Source Vision-Language Model

ABU DHABI, UAE — The Technology Innovation Institute on Thursday released Falcon Perception, a compact open-source vision-language model that brings object segmentation and visual understanding capabilities to the established Falcon model family, according to the Hugging Face blog (https://huggingface.co/blog/tiiuae/falcon-perception).

The 0.6-billion-parameter model uses a single early-fusion Transformer architecture to perform open-vocabulary grounding and instance segmentation from natural-language prompts — allowing users to describe objects in plain text and receive precise pixel-level masks in response. A companion model, Falcon OCR, ships at 0.3 billion parameters and handles document parsing, table extraction and handwriting recognition.

Falcon Perception outperformed Meta’s SAM 3 on the SA-Co open-vocabulary segmentation benchmark, scoring 68.0 macro-F1 compared to SAM 3’s 62.3, according to results published alongside the release. The performance gap widened on more complex tasks: the model scored 21.9 points higher on spatial reasoning queries and 15.8 points higher on relational understanding, according to results from PBench, a new diagnostic benchmark released concurrently by the TII team.

The model’s OCR variant also posted competitive results, scoring 88.64 on OmniDocBench — ahead of DeepSeek OCR v2, GPT 5.2 and Mistral OCR 3, the researchers reported. On an A100-80GB GPU, the OCR model processes 5,825 tokens per second, roughly three times more efficient than comparable 0.9-billion-parameter systems.

Unlike pipeline-based approaches that chain separate vision encoders, text decoders and matching modules, Falcon Perception processes image and text tokens in a unified sequence from the first layer. The architecture uses a hybrid attention mask — bidirectional for image tokens and causal for text — eliminating the need for Hungarian matching or separate mask-query components.

Training required 700 GPU-days across three stages and drew on 54 million web-sourced images with 195 million positive text descriptions and 488 million hard negatives, according to the technical paper published on arXiv (https://arxiv.org/abs/2603.27365). The team used multi-teacher distillation from DINOv3 and SigLIP2, with an ensemble of SAM 3, Qwen3-VL-30B and Moondream3 providing consensus labels.

TII, a government-funded research body in Abu Dhabi, whose Falcon model series is distributed primarily through Hugging Face, the San Francisco-based AI platform, has seen adoption in enterprise and research contexts.

The release comes as U.S.-based labs including Meta, Google and startups such as Moondream have published open-source vision-language models. TII said Falcon Perception is positioned for developers and enterprises working with visual understanding tasks at smaller parameter counts.

The model weights, code and PBench benchmark dataset are available on Hugging Face under open licenses. An interactive playground is accessible at TII’s website.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *