|
|
--- |
|
|
base_model: lmms-lab/llava-onevision-qwen2-0.5b-ov |
|
|
datasets: |
|
|
- Dataseeds/DataSeeds-Sample-Dataset-DSD |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
pipeline_tag: image-text-to-text |
|
|
tags: |
|
|
- vision-language |
|
|
- multimodal |
|
|
- llava |
|
|
- llava-onevision |
|
|
- lora |
|
|
- fine-tuned |
|
|
- photography |
|
|
- scene-analysis |
|
|
- image-captioning |
|
|
model-index: |
|
|
- name: LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune |
|
|
results: |
|
|
- task: |
|
|
type: image-captioning |
|
|
name: Image Captioning |
|
|
dataset: |
|
|
name: DataSeeds.AI Sample Dataset |
|
|
type: Dataseeds/DataSeeds-Sample-Dataset-DSD |
|
|
metrics: |
|
|
- type: bleu-4 |
|
|
value: 0.0246 |
|
|
name: BLEU-4 |
|
|
- type: rouge-l |
|
|
value: 0.214 |
|
|
name: ROUGE-L |
|
|
- type: bertscore |
|
|
value: 0.2789 |
|
|
name: BERTScore F1 |
|
|
- type: clipscore |
|
|
value: 0.326 |
|
|
name: CLIPScore |
|
|
--- |
|
|
|
|
|
# LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset |
|
|
|
|
|
This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava-onevision-qwen2-0.5b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) specialized for photography scene analysis and description generation. The model was presented in the paper [Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery](https://huggingface.co/papers/2506.05673). The model was fine-tuned on the [DataSeeds.AI Sample Dataset (DSD)](https://huggingface.co/datasets/Dataseeds/DataSeeds.AI-Sample-Dataset-DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content. |
|
|
|
|
|
Code for usage: https://github.com/DataSeeds-ai/DSD-finetune-blip-llava |
|
|
|
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Base Model**: [LLaVA-OneVision-Qwen2-0.5b](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) |
|
|
- **Vision Encoder**: [SigLIP-SO400M-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384) |
|
|
- **Language Model**: Qwen2-0.5B (896M parameters) |
|
|
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) with PEFT |
|
|
- **Total Parameters**: ~917M (513M trainable during fine-tuning, 56% of total) |
|
|
- **Multimodal Projector**: 1.84M parameters (100% trainable) |
|
|
- **Precision**: BFloat16 |
|
|
- **Task**: Photography scene analysis and detailed image description |
|
|
|
|
|
### LoRA Configuration |
|
|
|
|
|
- **LoRA Rank (r)**: 32 |
|
|
- **LoRA Alpha**: 32 |
|
|
- **LoRA Dropout**: 0.1 |
|
|
- **Target Modules**: `v_proj`, `k_proj`, `q_proj`, `up_proj`, `gate_proj`, `down_proj`, `o_proj` |
|
|
- **Tunable Components**: `mm_mlp_adapter`, `mm_language_model` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Dataset |
|
|
The model was fine-tuned on the DataSeeds.AI Sample Dataset, a curated collection of photography images with detailed scene descriptions focusing on: |
|
|
- Compositional elements and camera perspectives |
|
|
- Lighting conditions and visual ambiance |
|
|
- Product identification and technical details |
|
|
- Photographic style and mood analysis |
|
|
|
|
|
### Training Configuration |
|
|
|
|
|
| Parameter | Value | |
|
|
|-----------|-------| |
|
|
| **Learning Rate** | 1e-5 | |
|
|
| **Optimizer** | AdamW | |
|
|
| **Learning Rate Schedule** | Cosine decay | |
|
|
| **Warmup Ratio** | 0.03 | |
|
|
| **Weight Decay** | 0.01 | |
|
|
| **Batch Size** | 2 | |
|
|
| **Gradient Accumulation Steps** | 8 (effective batch size: 16) | |
|
|
| **Training Epochs** | 3 | |
|
|
| **Max Sequence Length** | 8192 | |
|
|
| **Max Gradient Norm** | 0.5 | |
|
|
| **Precision** | BFloat16 | |
|
|
| **Hardware** | Single NVIDIA A100 40GB | |
|
|
| **Training Time** | 30 hours | |
|
|
|
|
|
### Training Strategy |
|
|
- **Validation Frequency**: Every 50 steps for precise checkpoint selection |
|
|
- **Best Checkpoint**: Step 1,750 (epoch 2.9) with validation loss of 1.83 |
|
|
- **Mixed Precision**: BFloat16 with gradient checkpointing for memory efficiency |
|
|
- **System Prompt**: Consistent template requesting scene descriptions across all samples |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Quantitative Results |
|
|
|
|
|
The fine-tuned model shows significant improvements across all evaluation metrics compared to the base model: |
|
|
|
|
|
| Metric | Base Model | Fine-tuned | Absolute Δ | Relative Δ | |
|
|
|--------|------------|------------|------------|------------| |
|
|
| **BLEU-4** | 0.0199 | **0.0246** | +0.0048 | **+24.09%** | |
|
|
| **ROUGE-L** | 0.2089 | **0.2140** | +0.0051 | **+2.44%** | |
|
|
| **BERTScore F1** | 0.2751 | **0.2789** | +0.0039 | **+1.40%** | |
|
|
| **CLIPScore** | 0.3247 | **0.3260** | +0.0013 | **+0.41%** | |
|
|
|
|
|
### Key Improvements |
|
|
- **Enhanced N-gram Precision**: 24% improvement in BLEU-4 indicates significantly better word sequence accuracy |
|
|
- **Better Sequential Information**: ROUGE-L improvement shows enhanced capture of longer matching sequences |
|
|
- **Improved Semantic Understanding**: BERTScore gains demonstrate better contextual relationships |
|
|
- **Maintained Visual-Semantic Alignment**: CLIPScore preservation with slight improvement |
|
|
|
|
|
### Inference Performance |
|
|
- **Processing Speed**: 2.30 seconds per image (NVIDIA A100 40GB) |
|
|
- **Memory Requirements**: Optimized for single GPU inference |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch peft pillow |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
from peft import PeftModel |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor |
|
|
import torch |
|
|
from PIL import Image |
|
|
|
|
|
# Load base model and processor |
|
|
base_model = AutoModelForCausalLM.from_pretrained( |
|
|
"lmms-lab/llava-onevision-qwen2-0.5b-ov", |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
processor = AutoProcessor.from_pretrained("lmms-lab/llava-onevision-qwen2-0.5b-ov") |
|
|
|
|
|
# Load LoRA adapter |
|
|
model = PeftModel.from_pretrained( |
|
|
base_model, |
|
|
"Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune" |
|
|
) |
|
|
|
|
|
# Load and process image |
|
|
image = Image.open("your_image.jpg") |
|
|
prompt = "Describe this image in detail, focusing on the composition, lighting, and visual elements." |
|
|
|
|
|
inputs = processor(prompt, image, return_tensors="pt").to(model.device) |
|
|
|
|
|
# Generate description |
|
|
with torch.no_grad(): |
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=512, |
|
|
do_sample=True, |
|
|
temperature=0.7, |
|
|
top_p=0.9 |
|
|
) |
|
|
|
|
|
description = processor.decode(outputs[0], skip_special_tokens=True) |
|
|
print(description) |
|
|
``` |
|
|
|
|
|
### Advanced Usage with Custom Prompts |
|
|
|
|
|
```python |
|
|
# Photography-specific prompts that work well with this model |
|
|
prompts = [ |
|
|
"Analyze the photographic composition and lighting in this image.", |
|
|
"Describe the technical aspects and visual mood of this photograph.", |
|
|
"Provide a detailed scene description focusing on the subject and environment." |
|
|
] |
|
|
|
|
|
for prompt in prompts: |
|
|
inputs = processor(prompt, image, return_tensors="pt").to(model.device) |
|
|
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7) |
|
|
description = processor.decode(outputs[0], skip_special_tokens=True) |
|
|
print(f"Prompt: {prompt}") |
|
|
print(f"Description: {description} |
|
|
") |
|
|
``` |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The model maintains the LLaVA-OneVision architecture with the following components: |
|
|
|
|
|
- **Vision Encoder**: SigLIP-SO400M with hierarchical feature extraction |
|
|
- **Language Model**: Qwen2-0.5B with 24 layers, 14 attention heads |
|
|
- **Multimodal Projector**: 2-layer MLP with GELU activation (mlp2x_gelu) |
|
|
- **Image Processing**: Supports "anyres_max_9" aspect ratio with dynamic grid pinpoints |
|
|
- **Context Length**: 32,768 tokens with sliding window attention |
|
|
|
|
|
### Technical Specifications |
|
|
|
|
|
- **Hidden Size**: 896 |
|
|
- **Intermediate Size**: 4,864 |
|
|
- **Attention Heads**: 14 (2 key-value heads) |
|
|
- **RMS Norm Epsilon**: 1e-6 |
|
|
- **RoPE Theta**: 1,000,000 |
|
|
- **Image Token Index**: 151646 |
|
|
- **Max Image Grid**: Up to 2304×2304 pixels with dynamic tiling |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The DataSeeds.AI Sample Dataset contains curated photography images with comprehensive annotations including: |
|
|
|
|
|
- **Scene Descriptions**: Detailed textual descriptions of visual content |
|
|
- **Technical Metadata**: Camera settings, composition details |
|
|
- **Style Analysis**: Photographic techniques and artistic elements |
|
|
- **Quality Annotations**: Professional photography standards |
|
|
|
|
|
The dataset focuses on enhancing the model's ability to: |
|
|
- Identify specific products and technical details accurately |
|
|
- Describe lighting conditions and photographic ambiance |
|
|
- Analyze compositional elements and camera perspectives |
|
|
- Generate contextually aware scene descriptions |
|
|
|
|
|
## Limitations and Considerations |
|
|
|
|
|
### Model Limitations |
|
|
- **Domain Specialization**: Optimized for photography; may have reduced performance on general vision-language tasks |
|
|
- **Base Model Inheritance**: Inherits limitations from LLaVA-OneVision base model |
|
|
- **Training Data Bias**: May reflect biases present in the DataSeeds.AI dataset |
|
|
- **Language Support**: Primarily trained and evaluated on English descriptions |
|
|
|
|
|
### Recommended Use Cases |
|
|
- ✅ Photography scene analysis and description |
|
|
- ✅ Product photography captioning |
|
|
- ✅ Technical photography analysis |
|
|
- ✅ Visual content generation for photography applications |
|
|
- ⚠️ General-purpose vision-language tasks (may have reduced performance) |
|
|
- ❌ Non-photographic image analysis (not optimized for this use case) |
|
|
|
|
|
### Ethical Considerations |
|
|
- The model may perpetuate biases present in photography datasets |
|
|
- Generated descriptions should be reviewed for accuracy in critical applications |
|
|
- Consider potential cultural biases in photographic style interpretation |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research or applications, please cite: |
|
|
|
|
|
```bibtex |
|
|
@article{abdoli2025peerranked, |
|
|
title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery}, |
|
|
author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz}, |
|
|
journal={arXiv preprint arXiv:2506.05673}, |
|
|
year={2025}, |
|
|
} |
|
|
|
|
|
@misc{llava-onevision-dsd-finetune-2024, |
|
|
title={LLaVA-OneVision Fine-tuned on DataSeeds.AI Dataset for Photography Scene Analysis}, |
|
|
author={DataSeeds.AI}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune}, |
|
|
note={LoRA fine-tuned model for enhanced photography description generation} |
|
|
} |
|
|
|
|
|
@article{li2024llavaonevision, |
|
|
title={LLaVA-OneVision: Easy Visual Task Transfer}, |
|
|
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Liu, Yanwei and Wang, Ziwei and Gao, Peng}, |
|
|
journal={arXiv preprint arXiv:2408.03326}, |
|
|
year={2024} |
|
|
} |
|
|
|
|
|
@article{hu2022lora, |
|
|
title={LoRA: Low-Rank Adaptation of Large Language Models}, |
|
|
author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu}, |
|
|
journal={arXiv preprint arXiv:2106.09685}, |
|
|
year={2021} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the Apache 2.0 license, consistent with the base LLaVA-OneVision model licensing terms. |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **Base Model**: Thanks to LMMS Lab for the LLaVA-OneVision model |
|
|
- **Vision Encoder**: Thanks to Google Research for the SigLIP model |
|
|
- **Dataset**: GuruShots photography community for the source imagery |
|
|
- **Framework**: Hugging Face PEFT library for efficient fine-tuning capabilities |
|
|
|
|
|
--- |
|
|
|
|
|
*For questions, issues, or collaboration opportunities, please visit the [model repository](https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune) or contact the DataSeeds.AI team.* |