VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Abramovich, Ofir; Nayman, Niv; Fogel, Sharon; Lavi, Inbal; Litman, Ron; Tsiper, Shahar; Tichauer, Royee; Appalaraju, Srikar; Mazor, Shai; Manmatha, R.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.12594 (cs)

[Submitted on 17 Jul 2024 (v1), last revised 25 Mar 2025 (this version, v2)]

Title:VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Authors:Ofir Abramovich, Niv Nayman, Sharon Fogel, Inbal Lavi, Ron Litman, Shahar Tsiper, Royee Tichauer, Srikar Appalaraju, Shai Mazor, R. Manmatha

View PDF HTML (experimental)

Abstract:In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches. Typically, the queries to the model are input exclusively to the language component, necessitating the visual features to encompass the entire document. In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. To do so, we replace the down-sampling layers with layers that receive the input prompt and allow highlighting relevant parts of the document, while disregarding others. We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder in place of the prompt, to empower the model with focusing capabilities. Consequently, VisFocus learns to allocate its attention to text patches pertinent to the provided prompt. Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance, achieving state-of-the-art results on various benchmarks.

Comments:	ECCV 2024, official code at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.12594 [cs.CV]
	(or arXiv:2407.12594v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.12594

Submission history

From: Ofir Abramovich [view email]
[v1] Wed, 17 Jul 2024 14:16:46 UTC (29,148 KB)
[v2] Tue, 25 Mar 2025 19:24:19 UTC (29,148 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators