TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Introduction

TextHawk is a Multimodal Large Language Model (MLLM) specifically designed for document-oriented tasks, while preserving the general capabilities. It is aimed to explore efficient fine-grained perception by designing four dedicated components:

ReSampling and ReArrangement (ReSA)
Scalable Positional Embeddings (SPEs)
Query Proposal Network (QPN)
Multi-Level Cross-Attention (MLCA)

DocGemini

We create a new instruction-tuning dataset DocGemini for document-oriented tasks by enriching the multimodal document data with Gemini Pro. Each data sample contains:

A brief summary of the document topics.
Short QA pairs, up to 10.
Insights behind each answer.
[Optional] An imaginary conversations between two researchers.

DocGemini consists of 30K images and 195K QA pairs with insights.

Note: The generated dataset is undergoing legal assessment. Alternatively, you can produce data on your own using the scripts we provide.

Benchmarks

Model	ViT (Params.)	MME perception	MMB dev	SEED image	GQA	DocVQA	ChartQA	InfoVQA	TabFact	WTQ	RefCOCO val	RefCOCO test-A	RefCOCO test-B
$\text{Donut}$	$\text{Swin-B}$ (0.1B)	-	-	-	-	67.5	41.8	11.6	54.6	18.8	-	-	-
$\text{Pix2Struct}$	-	-	-	-	-	76.6	58.6	40.0	-	-	-	-	-
$\text{InternLM-XC}$	$\text{EVA-G}$ (1B)	1528.4	74.8	66.1	-	-	-	-	-	-	-	-	-
$\text{LLaVA-1.5-7B}$	$\text{CLIP-L}$ (0.3B)	1510.7	65.2	-	62.0	-	-	-	-	-	-	-	-
$\text{Shikra-7B}$	$\text{CLIP-L}$ (0.3B)	-	58.8	-	-	-	-	-	-	-	87.0	91.1	81.8
$\text{Qwen-VL-Chat}$	$\text{CLIP-G}$ (2B)	1487.6	60.6	65.4	57.5	62.6	66.3	-	-	-	88.6	92.3	84.5
$\text{Monkey}$	$\text{CLIP-G}$ (2B)	-	59.3	-	60.7	66.5	65.1	36.1	-	25.3	-	-	-
$\text{UReader}$	$\text{CLIP-L}$ (0.3B)	-	-	-	-	65.4	59.3	42.2	67.6	29.4	-	-	-
$\text{TextMonkey}$	$\text{CLIP-G}$ (2B)	-	-	-	-	73.0	66.9	-	-	31.9	-	-	-
$\textbf{TextHawk}^*$	$\text{SigLIP-SO}$ (0.4B)	1520.9	73.0	69.2	64.7	73.6	64.0	47.3	70.7	33.5	87.3	90.9	83.3
$\textbf{TextHawk}$	$\text{SigLIP-SO}$ (0.4B)	1500.0	74.6	69.2	64.6	76.4	66.6	50.6	71.1	34.7	87.2	90.8	82.5

Note: $\textbf{TextHawk}^*$ is fine-tuned without the DocGemini.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
DocGemini		DocGemini
figures		figures
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DocGemini

DocGemini

figures

figures

README.md

README.md

Repository files navigation

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Introduction

DocGemini

Benchmarks

About

Releases

Packages

Languages

yuyq96/TextHawk

Folders and files

Latest commit

History

Repository files navigation

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Introduction

DocGemini

Benchmarks

About

Resources

Stars

Watchers

Forks

Languages