Skip to content

yuyq96/TextHawk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

examples

Introduction

TextHawk is a Multimodal Large Language Model (MLLM) specifically designed for document-oriented tasks, while preserving the general capabilities. It is aimed to explore efficient fine-grained perception by designing four dedicated components:

  • ReSampling and ReArrangement (ReSA)
  • Scalable Positional Embeddings (SPEs)
  • Query Proposal Network (QPN)
  • Multi-Level Cross-Attention (MLCA)

DocGemini

We create a new instruction-tuning dataset DocGemini for document-oriented tasks by enriching the multimodal document data with Gemini Pro. Each data sample contains:

  • A brief summary of the document topics.
  • Short QA pairs, up to 10.
  • Insights behind each answer.
  • [Optional] An imaginary conversations between two researchers.

DocGemini consists of 30K images and 195K QA pairs with insights.

Note: The generated dataset is undergoing legal assessment. Alternatively, you can produce data on your own using the scripts we provide.

Benchmarks

Model ViT
(Params.)
MME
perception
MMB
dev
SEED
image
GQA DocVQA ChartQA InfoVQA TabFact WTQ RefCOCO
val
RefCOCO
test-A
RefCOCO
test-B
$\text{Donut}$ $\text{Swin-B}$
(0.1B)
- - - - 67.5 41.8 11.6 54.6 18.8 - - -
$\text{Pix2Struct}$ - - - - - 76.6 58.6 40.0 - - - - -
$\text{InternLM-XC}$ $\text{EVA-G}$
(1B)
1528.4 74.8 66.1 - - - - - - - - -
$\text{LLaVA-1.5-7B}$ $\text{CLIP-L}$
(0.3B)
1510.7 65.2 - 62.0 - - - - - - - -
$\text{Shikra-7B}$ $\text{CLIP-L}$
(0.3B)
- 58.8 - - - - - - - 87.0 91.1 81.8
$\text{Qwen-VL-Chat}$ $\text{CLIP-G}$
(2B)
1487.6 60.6 65.4 57.5 62.6 66.3 - - - 88.6 92.3 84.5
$\text{Monkey}$ $\text{CLIP-G}$
(2B)
- 59.3 - 60.7 66.5 65.1 36.1 - 25.3 - - -
$\text{UReader}$ $\text{CLIP-L}$
(0.3B)
- - - - 65.4 59.3 42.2 67.6 29.4 - - -
$\text{TextMonkey}$ $\text{CLIP-G}$
(2B)
- - - - 73.0 66.9 - - 31.9 - - -
$\textbf{TextHawk}^*$ $\text{SigLIP-SO}$
(0.4B)
1520.9 73.0 69.2 64.7 73.6 64.0 47.3 70.7 33.5 87.3 90.9 83.3
$\textbf{TextHawk}$ $\text{SigLIP-SO}$
(0.4B)
1500.0 74.6 69.2 64.6 76.4 66.6 50.6 71.1 34.7 87.2 90.8 82.5

Note: $\textbf{TextHawk}^*$ is fine-tuned without the DocGemini.

About

Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages