Update README.md

685c4a5 verified 6 months ago

11.3 kB

	---
	base_model: lmms-lab/llava-onevision-qwen2-0.5b-ov
	datasets:
	- Dataseeds/DataSeeds-Sample-Dataset-DSD
	language:
	- en
	library_name: transformers
	license: apache-2.0
	pipeline_tag: image-text-to-text
	tags:
	- vision-language
	- multimodal
	- llava
	- llava-onevision
	- lora
	- fine-tuned
	- photography
	- scene-analysis
	- image-captioning
	model-index:
	- name: LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune
	results:
	- task:
	type: image-captioning
	name: Image Captioning
	dataset:
	name: DataSeeds.AI Sample Dataset
	type: Dataseeds/DataSeeds-Sample-Dataset-DSD
	metrics:
	- type: bleu-4
	value: 0.0246
	name: BLEU-4
	- type: rouge-l
	value: 0.214
	name: ROUGE-L
	- type: bertscore
	value: 0.2789
	name: BERTScore F1
	- type: clipscore
	value: 0.326
	name: CLIPScore
	---

	# LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset

	This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava-onevision-qwen2-0.5b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) specialized for photography scene analysis and description generation. The model was presented in the paper [Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery](https://huggingface.co/papers/2506.05673). The model was fine-tuned on the [DataSeeds.AI Sample Dataset (DSD)](https://huggingface.co/datasets/Dataseeds/DataSeeds.AI-Sample-Dataset-DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content.

	Code for usage: https://github.com/DataSeeds-ai/DSD-finetune-blip-llava


	## Model Description

	- Base Model: [LLaVA-OneVision-Qwen2-0.5b](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov)
	- Vision Encoder: [SigLIP-SO400M-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
	- Language Model: Qwen2-0.5B (896M parameters)
	- Fine-tuning Method: LoRA (Low-Rank Adaptation) with PEFT
	- Total Parameters: ~917M (513M trainable during fine-tuning, 56% of total)
	- Multimodal Projector: 1.84M parameters (100% trainable)
	- Precision: BFloat16
	- Task: Photography scene analysis and detailed image description

	### LoRA Configuration

	- LoRA Rank (r): 32
	- LoRA Alpha: 32
	- LoRA Dropout: 0.1
	- Target Modules: `v_proj`, `k_proj`, `q_proj`, `up_proj`, `gate_proj`, `down_proj`, `o_proj`
	- Tunable Components: `mm_mlp_adapter`, `mm_language_model`

	## Training Details

	### Dataset
	The model was fine-tuned on the DataSeeds.AI Sample Dataset, a curated collection of photography images with detailed scene descriptions focusing on:
	- Compositional elements and camera perspectives
	- Lighting conditions and visual ambiance
	- Product identification and technical details
	- Photographic style and mood analysis

	### Training Configuration

	\| Parameter \| Value \|
	\|-----------\|-------\|
	\| Learning Rate \| 1e-5 \|
	\| Optimizer \| AdamW \|
	\| Learning Rate Schedule \| Cosine decay \|
	\| Warmup Ratio \| 0.03 \|
	\| Weight Decay \| 0.01 \|
	\| Batch Size \| 2 \|
	\| Gradient Accumulation Steps \| 8 (effective batch size: 16) \|
	\| Training Epochs \| 3 \|
	\| Max Sequence Length \| 8192 \|
	\| Max Gradient Norm \| 0.5 \|
	\| Precision \| BFloat16 \|
	\| Hardware \| Single NVIDIA A100 40GB \|
	\| Training Time \| 30 hours \|

	### Training Strategy
	- Validation Frequency: Every 50 steps for precise checkpoint selection
	- Best Checkpoint: Step 1,750 (epoch 2.9) with validation loss of 1.83
	- Mixed Precision: BFloat16 with gradient checkpointing for memory efficiency
	- System Prompt: Consistent template requesting scene descriptions across all samples

	## Performance

	### Quantitative Results

	The fine-tuned model shows significant improvements across all evaluation metrics compared to the base model:

	\| Metric \| Base Model \| Fine-tuned \| Absolute Δ \| Relative Δ \|
	\|--------\|------------\|------------\|------------\|------------\|
	\| BLEU-4 \| 0.0199 \| 0.0246 \| +0.0048 \| +24.09% \|
	\| ROUGE-L \| 0.2089 \| 0.2140 \| +0.0051 \| +2.44% \|
	\| BERTScore F1 \| 0.2751 \| 0.2789 \| +0.0039 \| +1.40% \|
	\| CLIPScore \| 0.3247 \| 0.3260 \| +0.0013 \| +0.41% \|

	### Key Improvements
	- Enhanced N-gram Precision: 24% improvement in BLEU-4 indicates significantly better word sequence accuracy
	- Better Sequential Information: ROUGE-L improvement shows enhanced capture of longer matching sequences
	- Improved Semantic Understanding: BERTScore gains demonstrate better contextual relationships
	- Maintained Visual-Semantic Alignment: CLIPScore preservation with slight improvement

	### Inference Performance
	- Processing Speed: 2.30 seconds per image (NVIDIA A100 40GB)
	- Memory Requirements: Optimized for single GPU inference

	## Usage

	### Installation

	```bash
	pip install transformers torch peft pillow
	```

	### Basic Usage

	```python
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
	import torch
	from PIL import Image

	# Load base model and processor
	base_model = AutoModelForCausalLM.from_pretrained(
	"lmms-lab/llava-onevision-qwen2-0.5b-ov",
	torch_dtype=torch.bfloat16,
	device_map="auto",
	trust_remote_code=True
	)

	processor = AutoProcessor.from_pretrained("lmms-lab/llava-onevision-qwen2-0.5b-ov")

	# Load LoRA adapter
	model = PeftModel.from_pretrained(
	base_model,
	"Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune"
	)

	# Load and process image
	image = Image.open("your_image.jpg")
	prompt = "Describe this image in detail, focusing on the composition, lighting, and visual elements."

	inputs = processor(prompt, image, return_tensors="pt").to(model.device)

	# Generate description
	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	do_sample=True,
	temperature=0.7,
	top_p=0.9
	)

	description = processor.decode(outputs[0], skip_special_tokens=True)
	print(description)
	```

	### Advanced Usage with Custom Prompts

	```python
	# Photography-specific prompts that work well with this model
	prompts = [
	"Analyze the photographic composition and lighting in this image.",
	"Describe the technical aspects and visual mood of this photograph.",
	"Provide a detailed scene description focusing on the subject and environment."
	]

	for prompt in prompts:
	inputs = processor(prompt, image, return_tensors="pt").to(model.device)
	outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
	description = processor.decode(outputs[0], skip_special_tokens=True)
	print(f"Prompt: {prompt}")
	print(f"Description: {description}
	")
	```

	## Model Architecture

	The model maintains the LLaVA-OneVision architecture with the following components:

	- Vision Encoder: SigLIP-SO400M with hierarchical feature extraction
	- Language Model: Qwen2-0.5B with 24 layers, 14 attention heads
	- Multimodal Projector: 2-layer MLP with GELU activation (mlp2x_gelu)
	- Image Processing: Supports "anyres_max_9" aspect ratio with dynamic grid pinpoints
	- Context Length: 32,768 tokens with sliding window attention

	### Technical Specifications

	- Hidden Size: 896
	- Intermediate Size: 4,864
	- Attention Heads: 14 (2 key-value heads)
	- RMS Norm Epsilon: 1e-6
	- RoPE Theta: 1,000,000
	- Image Token Index: 151646
	- Max Image Grid: Up to 2304×2304 pixels with dynamic tiling

	## Training Data

	The DataSeeds.AI Sample Dataset contains curated photography images with comprehensive annotations including:

	- Scene Descriptions: Detailed textual descriptions of visual content
	- Technical Metadata: Camera settings, composition details
	- Style Analysis: Photographic techniques and artistic elements
	- Quality Annotations: Professional photography standards

	The dataset focuses on enhancing the model's ability to:
	- Identify specific products and technical details accurately
	- Describe lighting conditions and photographic ambiance
	- Analyze compositional elements and camera perspectives
	- Generate contextually aware scene descriptions

	## Limitations and Considerations

	### Model Limitations
	- Domain Specialization: Optimized for photography; may have reduced performance on general vision-language tasks
	- Base Model Inheritance: Inherits limitations from LLaVA-OneVision base model
	- Training Data Bias: May reflect biases present in the DataSeeds.AI dataset
	- Language Support: Primarily trained and evaluated on English descriptions

	### Recommended Use Cases
	- ✅ Photography scene analysis and description
	- ✅ Product photography captioning
	- ✅ Technical photography analysis
	- ✅ Visual content generation for photography applications
	- ⚠️ General-purpose vision-language tasks (may have reduced performance)
	- ❌ Non-photographic image analysis (not optimized for this use case)

	### Ethical Considerations
	- The model may perpetuate biases present in photography datasets
	- Generated descriptions should be reviewed for accuracy in critical applications
	- Consider potential cultural biases in photographic style interpretation

	## Citation

	If you use this model in your research or applications, please cite:

	```bibtex
	@article{abdoli2025peerranked,
	title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery},
	author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
	journal={arXiv preprint arXiv:2506.05673},
	year={2025},
	}

	@misc{llava-onevision-dsd-finetune-2024,
	title={LLaVA-OneVision Fine-tuned on DataSeeds.AI Dataset for Photography Scene Analysis},
	author={DataSeeds.AI},
	year={2024},
	publisher={Hugging Face},
	url={https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune},
	note={LoRA fine-tuned model for enhanced photography description generation}
	}

	@article{li2024llavaonevision,
	title={LLaVA-OneVision: Easy Visual Task Transfer},
	author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Liu, Yanwei and Wang, Ziwei and Gao, Peng},
	journal={arXiv preprint arXiv:2408.03326},
	year={2024}
	}

	@article{hu2022lora,
	title={LoRA: Low-Rank Adaptation of Large Language Models},
	author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
	journal={arXiv preprint arXiv:2106.09685},
	year={2021}
	}
	```

	## License

	This model is released under the Apache 2.0 license, consistent with the base LLaVA-OneVision model licensing terms.

	## Acknowledgments

	- Base Model: Thanks to LMMS Lab for the LLaVA-OneVision model
	- Vision Encoder: Thanks to Google Research for the SigLIP model
	- Dataset: GuruShots photography community for the source imagery
	- Framework: Hugging Face PEFT library for efficient fine-tuning capabilities

	---

	For questions, issues, or collaboration opportunities, please visit the [model repository](https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune) or contact the DataSeeds.AI team.