TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Tang, Jingqun; Lin, Chunhui; Zhao, Zhen; Wei, Shu; Wu, Binghong; Liu, Qi; Feng, Hao; Li, Yang; Wang, Siqi; Liao, Lei; Shi, Wei; Liu, Yuliang; Liu, Hao; Xie, Yuan; Bai, Xiang; Huang, Can

Computer Science > Computer Vision and Pattern Recognition

arXiv:2404.12803 (cs)

[Submitted on 19 Apr 2024 (v1), last revised 22 Apr 2025 (this version, v2)]

Title:TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Authors:Jingqun Tang, Chunhui Lin, Zhen Zhao, Shu Wei, Binghong Wu, Qi Liu, Hao Feng, Yang Li, Siqi Wang, Lei Liao, Wei Shi, Yuliang Liu, Hao Liu, Yuan Xie, Xiang Bai, Can Huang

View PDF HTML (experimental)

Abstract:Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini, partly due to a lack of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, which is generated using closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs and sets a new standard on OCRBench(62.2%). It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks. 2) Additionally, we demonstrate the critical role of VQA reasoning data in offering comprehensive contextual insights for specific questions. This not only improves accuracy but also significantly mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: the exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2404.12803 [cs.CV]
	(or arXiv:2404.12803v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2404.12803

Submission history

From: Jingqun Tang [view email]
[v1] Fri, 19 Apr 2024 11:38:08 UTC (3,194 KB)
[v2] Tue, 22 Apr 2025 07:06:06 UTC (3,202 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TextSquare: Scaling up Text-Centric Visual Instruction Tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators