Zedge's picture
Update README.md
685c4a5 verified
---
base_model: lmms-lab/llava-onevision-qwen2-0.5b-ov
datasets:
- Dataseeds/DataSeeds-Sample-Dataset-DSD
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- vision-language
- multimodal
- llava
- llava-onevision
- lora
- fine-tuned
- photography
- scene-analysis
- image-captioning
model-index:
- name: LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune
results:
- task:
type: image-captioning
name: Image Captioning
dataset:
name: DataSeeds.AI Sample Dataset
type: Dataseeds/DataSeeds-Sample-Dataset-DSD
metrics:
- type: bleu-4
value: 0.0246
name: BLEU-4
- type: rouge-l
value: 0.214
name: ROUGE-L
- type: bertscore
value: 0.2789
name: BERTScore F1
- type: clipscore
value: 0.326
name: CLIPScore
---
# LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset
This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava-onevision-qwen2-0.5b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) specialized for photography scene analysis and description generation. The model was presented in the paper [Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery](https://huggingface.co/papers/2506.05673). The model was fine-tuned on the [DataSeeds.AI Sample Dataset (DSD)](https://huggingface.co/datasets/Dataseeds/DataSeeds.AI-Sample-Dataset-DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content.
Code for usage: https://github.com/DataSeeds-ai/DSD-finetune-blip-llava
## Model Description
- **Base Model**: [LLaVA-OneVision-Qwen2-0.5b](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov)
- **Vision Encoder**: [SigLIP-SO400M-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
- **Language Model**: Qwen2-0.5B (896M parameters)
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) with PEFT
- **Total Parameters**: ~917M (513M trainable during fine-tuning, 56% of total)
- **Multimodal Projector**: 1.84M parameters (100% trainable)
- **Precision**: BFloat16
- **Task**: Photography scene analysis and detailed image description
### LoRA Configuration
- **LoRA Rank (r)**: 32
- **LoRA Alpha**: 32
- **LoRA Dropout**: 0.1
- **Target Modules**: `v_proj`, `k_proj`, `q_proj`, `up_proj`, `gate_proj`, `down_proj`, `o_proj`
- **Tunable Components**: `mm_mlp_adapter`, `mm_language_model`
## Training Details
### Dataset
The model was fine-tuned on the DataSeeds.AI Sample Dataset, a curated collection of photography images with detailed scene descriptions focusing on:
- Compositional elements and camera perspectives
- Lighting conditions and visual ambiance
- Product identification and technical details
- Photographic style and mood analysis
### Training Configuration
| Parameter | Value |
|-----------|-------|
| **Learning Rate** | 1e-5 |
| **Optimizer** | AdamW |
| **Learning Rate Schedule** | Cosine decay |
| **Warmup Ratio** | 0.03 |
| **Weight Decay** | 0.01 |
| **Batch Size** | 2 |
| **Gradient Accumulation Steps** | 8 (effective batch size: 16) |
| **Training Epochs** | 3 |
| **Max Sequence Length** | 8192 |
| **Max Gradient Norm** | 0.5 |
| **Precision** | BFloat16 |
| **Hardware** | Single NVIDIA A100 40GB |
| **Training Time** | 30 hours |
### Training Strategy
- **Validation Frequency**: Every 50 steps for precise checkpoint selection
- **Best Checkpoint**: Step 1,750 (epoch 2.9) with validation loss of 1.83
- **Mixed Precision**: BFloat16 with gradient checkpointing for memory efficiency
- **System Prompt**: Consistent template requesting scene descriptions across all samples
## Performance
### Quantitative Results
The fine-tuned model shows significant improvements across all evaluation metrics compared to the base model:
| Metric | Base Model | Fine-tuned | Absolute Δ | Relative Δ |
|--------|------------|------------|------------|------------|
| **BLEU-4** | 0.0199 | **0.0246** | +0.0048 | **+24.09%** |
| **ROUGE-L** | 0.2089 | **0.2140** | +0.0051 | **+2.44%** |
| **BERTScore F1** | 0.2751 | **0.2789** | +0.0039 | **+1.40%** |
| **CLIPScore** | 0.3247 | **0.3260** | +0.0013 | **+0.41%** |
### Key Improvements
- **Enhanced N-gram Precision**: 24% improvement in BLEU-4 indicates significantly better word sequence accuracy
- **Better Sequential Information**: ROUGE-L improvement shows enhanced capture of longer matching sequences
- **Improved Semantic Understanding**: BERTScore gains demonstrate better contextual relationships
- **Maintained Visual-Semantic Alignment**: CLIPScore preservation with slight improvement
### Inference Performance
- **Processing Speed**: 2.30 seconds per image (NVIDIA A100 40GB)
- **Memory Requirements**: Optimized for single GPU inference
## Usage
### Installation
```bash
pip install transformers torch peft pillow
```
### Basic Usage
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
from PIL import Image
# Load base model and processor
base_model = AutoModelForCausalLM.from_pretrained(
"lmms-lab/llava-onevision-qwen2-0.5b-ov",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("lmms-lab/llava-onevision-qwen2-0.5b-ov")
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune"
)
# Load and process image
image = Image.open("your_image.jpg")
prompt = "Describe this image in detail, focusing on the composition, lighting, and visual elements."
inputs = processor(prompt, image, return_tensors="pt").to(model.device)
# Generate description
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9
)
description = processor.decode(outputs[0], skip_special_tokens=True)
print(description)
```
### Advanced Usage with Custom Prompts
```python
# Photography-specific prompts that work well with this model
prompts = [
"Analyze the photographic composition and lighting in this image.",
"Describe the technical aspects and visual mood of this photograph.",
"Provide a detailed scene description focusing on the subject and environment."
]
for prompt in prompts:
inputs = processor(prompt, image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
description = processor.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Description: {description}
")
```
## Model Architecture
The model maintains the LLaVA-OneVision architecture with the following components:
- **Vision Encoder**: SigLIP-SO400M with hierarchical feature extraction
- **Language Model**: Qwen2-0.5B with 24 layers, 14 attention heads
- **Multimodal Projector**: 2-layer MLP with GELU activation (mlp2x_gelu)
- **Image Processing**: Supports "anyres_max_9" aspect ratio with dynamic grid pinpoints
- **Context Length**: 32,768 tokens with sliding window attention
### Technical Specifications
- **Hidden Size**: 896
- **Intermediate Size**: 4,864
- **Attention Heads**: 14 (2 key-value heads)
- **RMS Norm Epsilon**: 1e-6
- **RoPE Theta**: 1,000,000
- **Image Token Index**: 151646
- **Max Image Grid**: Up to 2304×2304 pixels with dynamic tiling
## Training Data
The DataSeeds.AI Sample Dataset contains curated photography images with comprehensive annotations including:
- **Scene Descriptions**: Detailed textual descriptions of visual content
- **Technical Metadata**: Camera settings, composition details
- **Style Analysis**: Photographic techniques and artistic elements
- **Quality Annotations**: Professional photography standards
The dataset focuses on enhancing the model's ability to:
- Identify specific products and technical details accurately
- Describe lighting conditions and photographic ambiance
- Analyze compositional elements and camera perspectives
- Generate contextually aware scene descriptions
## Limitations and Considerations
### Model Limitations
- **Domain Specialization**: Optimized for photography; may have reduced performance on general vision-language tasks
- **Base Model Inheritance**: Inherits limitations from LLaVA-OneVision base model
- **Training Data Bias**: May reflect biases present in the DataSeeds.AI dataset
- **Language Support**: Primarily trained and evaluated on English descriptions
### Recommended Use Cases
- ✅ Photography scene analysis and description
- ✅ Product photography captioning
- ✅ Technical photography analysis
- ✅ Visual content generation for photography applications
- ⚠️ General-purpose vision-language tasks (may have reduced performance)
- ❌ Non-photographic image analysis (not optimized for this use case)
### Ethical Considerations
- The model may perpetuate biases present in photography datasets
- Generated descriptions should be reviewed for accuracy in critical applications
- Consider potential cultural biases in photographic style interpretation
## Citation
If you use this model in your research or applications, please cite:
```bibtex
@article{abdoli2025peerranked,
title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery},
author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
journal={arXiv preprint arXiv:2506.05673},
year={2025},
}
@misc{llava-onevision-dsd-finetune-2024,
title={LLaVA-OneVision Fine-tuned on DataSeeds.AI Dataset for Photography Scene Analysis},
author={DataSeeds.AI},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune},
note={LoRA fine-tuned model for enhanced photography description generation}
}
@article{li2024llavaonevision,
title={LLaVA-OneVision: Easy Visual Task Transfer},
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Liu, Yanwei and Wang, Ziwei and Gao, Peng},
journal={arXiv preprint arXiv:2408.03326},
year={2024}
}
@article{hu2022lora,
title={LoRA: Low-Rank Adaptation of Large Language Models},
author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
journal={arXiv preprint arXiv:2106.09685},
year={2021}
}
```
## License
This model is released under the Apache 2.0 license, consistent with the base LLaVA-OneVision model licensing terms.
## Acknowledgments
- **Base Model**: Thanks to LMMS Lab for the LLaVA-OneVision model
- **Vision Encoder**: Thanks to Google Research for the SigLIP model
- **Dataset**: GuruShots photography community for the source imagery
- **Framework**: Hugging Face PEFT library for efficient fine-tuning capabilities
---
*For questions, issues, or collaboration opportunities, please visit the [model repository](https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune) or contact the DataSeeds.AI team.*