🤗 GitHub   |   🤖 Demo   |   📑 Technical Report
## Introduction
|
|
|
|
| report |
chemistry |
paper |
handwritten |
Logics-Parsing is a powerful, end-to-end document parsing model built upon a general Vision-Language Model (VLM) through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). It excels at accurately analyzing and structuring highly complex documents.
## Key Features
* **Effortless End-to-End Processing**
* Our single-model architecture eliminates the need for complex, multi-stage pipelines. Deployment and inference are straightforward, going directly from a document image to structured output.
* It demonstrates exceptional performance on documents with challenging layouts.
* **Advanced Content Recognition**
* It accurately recognizes and structures difficult content, including intricate scientific formulas.
* Chemical structures are intelligently identified and can be represented in the standard **SMILES** format.
* **Rich, Structured HTML Output**
* The model generates a clean HTML representation of the document, preserving its logical structure.
* Each content block (e.g., paragraph, table, figure, formula) is tagged with its **category**, **bounding box coordinates**, and **OCR text**.
* It automatically identifies and filters out irrelevant elements like headers and footers, focusing only on the core content.
* **State-of-the-Art Performance**
* Logics-Parsing achieves the best performance on our in-house benchmark, which is specifically designed to comprehensively evaluate a model’s parsing capability on complex-layout documents and STEM content.
## Benchmark
Existing document-parsing benchmarks often provide limited coverage of complex layouts and STEM content. To address this, we constructed an in-house benchmark comprising 1,078 page-level images across nine major categories and over twenty sub-categories. Our model achieves the best performance on this benchmark.
| Model Type |
Methods |
Overall Edit ↓ |
Text Edit Edit ↓ |
Formula Edit ↓ |
Table TEDS ↑ |
Table Edit ↓ |
ReadOrderEdit ↓ |
ChemistryEdit ↓ |
HandWritingEdit ↓ |
| EN |
ZH |
EN |
ZH |
EN |
ZH |
EN |
ZH |
EN |
ZH |
EN |
ZH |
ALL |
ALL |
| Pipeline Tools |
doc2x |
0.209 |
0.188 |
0.128 |
0.194 |
0.377 |
0.321 |
81.1 |
85.3 |
0.148 |
0.115 |
0.146 |
0.122 |
1.0 |
0.307 |
| Textin |
0.153 |
0.158 |
0.132 |
0.190 |
0.185 |
0.223 |
76.7 |
86.3 |
0.176 |
0.113 |
0.118 |
0.104 |
1.0 |
0.344 |
| mathpix* |
0.128 |
0.146 |
0.128 |
0.152 |
0.06 |
0.142 |
86.2 |
86.6 |
0.120 |
0.127 |
0.204 |
0.164 |
0.552 |
0.263 |
| PP_StructureV3 |
0.220 |
0.226 |
0.172 |
0.29 |
0.272 |
0.276 |
66 |
71.5 |
0.237 |
0.193 |
0.201 |
0.143 |
1.0 |
0.382 |
| Mineru2 |
0.212 |
0.245 |
0.134 |
0.195 |
0.280 |
0.407 |
67.5 |
71.8 |
0.228 |
0.203 |
0.205 |
0.177 |
1.0 |
0.387 |
| Marker |
0.324 |
0.409 |
0.188 |
0.289 |
0.285 |
0.383 |
65.5 |
50.4 |
0.593 |
0.702 |
0.23 |
0.262 |
1.0 |
0.50 |
| Pix2text |
0.447 |
0.547 |
0.485 |
0.577 |
0.312 |
0.465 |
64.7 |
63.0 |
0.566 |
0.613 |
0.424 |
0.534 |
1.0 |
0.95 |
| Expert VLMs |
Dolphin |
0.208 |
0.256 |
0.149 |
0.189 |
0.334 |
0.346 |
72.9 |
60.1 |
0.192 |
0.35 |
0.160 |
0.139 |
0.984 |
0.433 |
| dots.ocr |
0.186 |
0.198 |
0.115 |
0.169 |
0.291 |
0.358 |
79.5 |
82.5 |
0.172 |
0.141 |
0.165 |
0.123 |
1.0 |
0.255 |
| MonkeyOcr |
0.193 |
0.259 |
0.127 |
0.236 |
0.262 |
0.325 |
78.4 |
74.7 |
0.186 |
0.294 |
0.197 |
0.180 |
1.0 |
0.623 |
| OCRFlux |
0.252 |
0.254 |
0.134 |
0.195 |
0.326 |
0.405 |
58.3 |
70.2 |
0.358 |
0.260 |
0.191 |
0.156 |
1.0 |
0.284 |
| Gotocr |
0.247 |
0.249 |
0.181 |
0.213 |
0.231 |
0.318 |
59.5 |
74.7 |
0.38 |
0.299 |
0.195 |
0.164 |
0.969 |
0.446 |
| Olmocr |
0.341 |
0.382 |
0.125 |
0.205 |
0.719 |
0.766 |
57.1 |
56.6 |
0.327 |
0.389 |
0.191 |
0.169 |
1.0 |
0.294 |
| SmolDocling |
0.657 |
0.895 |
0.486 |
0.932 |
0.859 |
0.972 |
18.5 |
1.5 |
0.86 |
0.98 |
0.413 |
0.695 |
1.0 |
0.927 |
| Logics-Parsing |
0.124 |
0.145 |
0.089 |
0.139 |
0.106 |
0.165 |
76.6 |
79.5 |
0.165 |
0.166 |
0.136 |
0.113 |
0.519 |
0.252 |
| General VLMs |
Qwen2VL-72B |
0.298 |
0.342 |
0.142 |
0.244 |
0.431 |
0.363 |
64.2 |
55.5 |
0.425 |
0.581 |
0.193 |
0.182 |
0.792 |
0.359 |
| Qwen2.5VL-72B |
0.233 |
0.263 |
0.162 |
0.24 |
0.251 |
0.257 |
69.6 |
67 |
0.313 |
0.353 |
0.205 |
0.204 |
0.597 |
0.349 |
| Doubao-1.6 |
0.188 |
0.248 |
0.129 |
0.219 |
0.273 |
0.336 |
74.9 |
69.7 |
0.180 |
0.288 |
0.171 |
0.148 |
0.601 |
0.317 |
| GPT-5 |
0.242 |
0.373 |
0.119 |
0.36 |
0.398 |
0.456 |
67.9 |
55.8 |
0.26 |
0.397 |
0.191 |
0.28 |
0.88 |
0.46 |
| Gemini2.5 pro |
0.185 |
0.20 |
0.115 |
0.155 |
0.288 |
0.326 |
82.6 |
80.3 |
0.154 |
0.182 |
0.181 |
0.136 |
0.535 |
0.26 |
|
* Tested on the v3/PDF Conversion API (August 2025 deployment).
|
## Quick Start
### 1. Installation
```shell
conda create -n logis-parsing python=3.10
conda activate logis-parsing
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
```
### 2. Download Model Weights
```
# Download our model from Modelscope.
pip install modelscope
python download_model.py -t modelscope
# Download our model from huggingface.
pip install huggingface_hub
python download_model.py -t huggingface
```
### 3. Inference
```shell
python3 inference.py --image_path PATH_TO_INPUT_IMG --output_path PATH_TO_OUTPUT --model_path PATH_TO_MODEL
```
## Acknowledgments
We would like to acknowledge the following open-source projects that provided inspiration and reference for this work:
- [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
- [OmniDocBench](https://github.com/opendatalab/OmniDocBench)
- [Mathpix](https://mathpix.com/)