Update README.md
Browse files
README.md
CHANGED
|
@@ -30,15 +30,15 @@ or "counter-commonsense", ensuring the model cannot answer based on language kno
|
|
| 30 |
|
| 31 |
The horizontal axis depicts the cumulative frames constituting the video haystack. The vertical axis indicates the positioning of the needle image within that sequence. For example, a frame depth of 0% would situate the needle image at the outset of the video. The black dotted line signifies the training duration of the backbone language model, with each frame comprising 144 tokens.
|
| 32 |
|
| 33 |
-
`OmniLong-
|
| 34 |
|
| 35 |
**[2. MME: A Comprehensive Evaluation Benchmark for Image Understanding](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation)**
|
| 36 |
|
| 37 |
-
MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning. `OmniLong-
|
| 38 |
|
| 39 |
| Models | mme_cognition_score | mme_percetion_score |
|
| 40 |
|--------------------|----------------------|---------------------|
|
| 41 |
-
|**OmniLong-
|
| 42 |
|[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) | 629.64 | **1691.36** |
|
| 43 |
|
| 44 |
|
|
@@ -46,7 +46,7 @@ MME is a comprehensive evaluation benchmark for multimodal large language models
|
|
| 46 |
|
| 47 |
Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis.It covers a wide range of short videos (< 2min), Medium Video (4min ~ 15min), long video (30min ~ 60min). 900 videos with a total of 254 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. Also, subtitles are also provided with the video for evaluation.
|
| 48 |
|
| 49 |
-
`OmniLong-
|
| 50 |
|
| 51 |
| Models | LLM Params | Overall (%) - w/o subs | Overall (%) - w subs |
|
| 52 |
|--------------------|------------|-------------------------|------------------------|
|
|
@@ -98,7 +98,7 @@ pip install vllm
|
|
| 98 |
|
| 99 |
### Start the server
|
| 100 |
```shell
|
| 101 |
-
vllm serve aws-prototyping/OmniLong-
|
| 102 |
```
|
| 103 |
|
| 104 |
## Deploy the model on a SageMaker LMI Endpoint
|
|
@@ -107,18 +107,18 @@ Please refer to this [large model inference (LMI) container](https://docs.aws.am
|
|
| 107 |
|
| 108 |
|
| 109 |
## Limitations
|
| 110 |
-
Before using the `OmniLong-
|
| 111 |
|
| 112 |
## Citation
|
| 113 |
|
| 114 |
If you find our work helpful, feel free to give us a cite.
|
| 115 |
|
| 116 |
```
|
| 117 |
-
@misc{OmniLong-
|
| 118 |
author = { {Yin Song and Chen Wu} },
|
| 119 |
-
title = { {aws-prototyping/OmniLong-
|
| 120 |
year = 2025,
|
| 121 |
-
url = { https://huggingface.co/aws-prototyping/OmniLong-
|
| 122 |
publisher = { Hugging Face }
|
| 123 |
}
|
| 124 |
```
|
|
|
|
| 30 |
|
| 31 |
The horizontal axis depicts the cumulative frames constituting the video haystack. The vertical axis indicates the positioning of the needle image within that sequence. For example, a frame depth of 0% would situate the needle image at the outset of the video. The black dotted line signifies the training duration of the backbone language model, with each frame comprising 144 tokens.
|
| 32 |
|
| 33 |
+
`OmniLong-Qwen2.5-VL-7B` scored averagely `97.55%` on this NIAH benchmark across different numbers of frame depths and frames shown in this plot.
|
| 34 |
|
| 35 |
**[2. MME: A Comprehensive Evaluation Benchmark for Image Understanding](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation)**
|
| 36 |
|
| 37 |
+
MME is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning. `OmniLong-Qwen2.5-VL-7B` retains SOTAs on both perception and cognition evaluation.
|
| 38 |
|
| 39 |
| Models | mme_cognition_score | mme_percetion_score |
|
| 40 |
|--------------------|----------------------|---------------------|
|
| 41 |
+
|**OmniLong-Qwen2.5-VL-7B** | **642.85** | 1599.28|
|
| 42 |
|[Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) | 629.64 | **1691.36** |
|
| 43 |
|
| 44 |
|
|
|
|
| 46 |
|
| 47 |
Video-MME, the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis.It covers a wide range of short videos (< 2min), Medium Video (4min ~ 15min), long video (30min ~ 60min). 900 videos with a total of 254 hours are manually selected and annotated by repeatedly viewing all the video content, resulting in 2,700 question-answer pairs. Also, subtitles are also provided with the video for evaluation.
|
| 48 |
|
| 49 |
+
`OmniLong-Qwen2.5-VL-7B` scored a overall `67.9%` with subtitles and `73.4%` with as shown in this table (*adapted from the [VideoMME Leaderboard](https://video-mme.github.io/home_page.html)*), which makes it the SOTA for `7B` models.
|
| 50 |
|
| 51 |
| Models | LLM Params | Overall (%) - w/o subs | Overall (%) - w subs |
|
| 52 |
|--------------------|------------|-------------------------|------------------------|
|
|
|
|
| 98 |
|
| 99 |
### Start the server
|
| 100 |
```shell
|
| 101 |
+
vllm serve aws-prototyping/OmniLong-Qwen2.5-VL-7B --tensor-parallel-size 4
|
| 102 |
```
|
| 103 |
|
| 104 |
## Deploy the model on a SageMaker LMI Endpoint
|
|
|
|
| 107 |
|
| 108 |
|
| 109 |
## Limitations
|
| 110 |
+
Before using the `OmniLong-Qwen2.5-VL-7B` model, it is important to perform your own independent assessment, and take measures to ensure that your use would comply with your own specific quality control practices and standards, and that your use would comply with the local rules, laws, regulations, licenses and terms that apply to you, and your content.
|
| 111 |
|
| 112 |
## Citation
|
| 113 |
|
| 114 |
If you find our work helpful, feel free to give us a cite.
|
| 115 |
|
| 116 |
```
|
| 117 |
+
@misc{OmniLong-Qwen2.5-VL-7B-2025,
|
| 118 |
author = { {Yin Song and Chen Wu} },
|
| 119 |
+
title = { {aws-prototyping/OmniLong-Qwen2.5-VL-7B} },
|
| 120 |
year = 2025,
|
| 121 |
+
url = { https://huggingface.co/aws-prototyping/OmniLong-Qwen2.5-VL-7B },
|
| 122 |
publisher = { Hugging Face }
|
| 123 |
}
|
| 124 |
```
|