--- library_name: transformers license: mit base_model: - meta-llama/Llama-3.2-11B-Vision-Instruct tags: - vision-language - product-descriptions - e-commerce - fine-tuned - lora - llama datasets: - philschmid/amazon-product-descriptions-vlm language: - en pipeline_tag: image-text-to-text --- # Finetuned Llama 3.2 Vision for Product Description Generation A fine-tuned version of Meta's Llama-3.2-11B-Vision-Instruct model specialized for generating SEO-optimized product descriptions from product images, names, and categories. ## Model Details ### Model Description This model generates concise, SEO-optimized product descriptions for e-commerce applications. Given a product image, name, and category, it produces mobile-friendly descriptions suitable for online marketplaces and product catalogs. - **Developed by:** Aayush672 - **Model type:** Vision-Language Model (Multimodal) - **Language(s):** English - **License:** MIT - **Finetuned from model:** meta-llama/Llama-3.2-11B-Vision-Instruct ### Model Sources - **Repository:** [Aayush672/Finetuned-llama3.2-Vision-Model](https://huggingface.co/Aayush672/Finetuned-llama3.2-Vision-Model) - **Base Model:** [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) ## Uses ### Direct Use The model is designed for generating product descriptions in e-commerce scenarios: - Product catalog automation - SEO-optimized content generation - Mobile-friendly product descriptions - Marketplace listing optimization ### Example Usage ```python from transformers import AutoModelForVision2Seq, AutoProcessor from PIL import Image model = AutoModelForVision2Seq.from_pretrained("Aayush672/Finetuned-llama3.2-Vision-Model") processor = AutoProcessor.from_pretrained("Aayush672/Finetuned-llama3.2-Vision-Model") # Prepare your inputs image = Image.open("product_image.jpg") product_name = "Wireless Bluetooth Headphones" category = "Electronics | Audio | Headphones" prompt = f"""Create a Short Product description based on the provided ##PRODUCT NAME## and ##CATEGORY## and image. Only return description. The description should be SEO optimized and for a better mobile search experience. ##PRODUCT NAME##: {product_name} ##CATEGORY##: {category}""" messages = [{ "role": "user", "content": [ {"type": "text", "text": prompt}, {"type": "image", "image": image} ] }] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=text, images=[image], return_tensors="pt") output = model.generate(**inputs, max_new_tokens=200, temperature=0.7) description = processor.tokenizer.decode(output[0], skip_special_tokens=True) ``` ### Out-of-Scope Use - General conversation or chat applications - Complex reasoning tasks - Non-commercial product descriptions - Content outside e-commerce domain ## Training Details ### Training Data The model was fine-tuned on the [philschmid/amazon-product-descriptions-vlm](https://huggingface.co/datasets/philschmid/amazon-product-descriptions-vlm) dataset, which contains Amazon product images with corresponding names, categories, and descriptions. ### Training Procedure #### Fine-tuning Method - **Technique:** LoRA (Low-Rank Adaptation) with PEFT - **Target modules:** q_proj, v_proj - **LoRA rank (r):** 8 - **LoRA alpha:** 16 - **LoRA dropout:** 0.05 #### Training Hyperparameters - **Training regime:** bf16 mixed precision with 4-bit quantization (QLoRA) - **Number of epochs:** 1 - **Batch size:** 8 per device - **Gradient accumulation steps:** 4 - **Learning rate:** 2e-4 - **Optimizer:** AdamW (torch fused) - **LR scheduler:** Constant - **Warmup ratio:** 0.03 - **Max gradient norm:** 0.3 - **Quantization:** 4-bit with double quantization (nf4) #### Hardware & Software - **Quantization:** BitsAndBytesConfig with 4-bit precision - **Gradient checkpointing:** Enabled - **Memory optimization:** QLoRA technique - **Framework:** Transformers, TRL, PEFT ## Bias, Risks, and Limitations ### Limitations - Trained specifically on Amazon product data, may not generalize well to other e-commerce platforms - Limited to English language descriptions - Optimized for mobile/SEO format, may not suit all description styles - Performance depends on image quality and product visibility ### Recommendations - Test thoroughly on your specific product categories before production use - Consider additional fine-tuning for domain-specific products - Implement content moderation for generated descriptions - Validate SEO effectiveness for your target keywords ## Environmental Impact Training utilized quantized models (4-bit) to reduce computational requirements and carbon footprint compared to full-precision training. ## Technical Specifications ### Model Architecture - **Base Architecture:** Llama 3.2 Vision (11B parameters) - **Vision Encoder:** Integrated multimodal architecture - **Fine-tuning:** LoRA adapters (trainable parameters: ~16M) - **Quantization:** 4-bit with double quantization ### Compute Infrastructure - **Training:** Optimized with gradient checkpointing and mixed precision - **Memory:** Reduced via 4-bit quantization and LoRA - **Inference:** Supports both quantized and full precision modes ## Citation ```bibtex @misc{finetuned-llama32-vision-product, title={Fine-tuned Llama 3.2 Vision for Product Description Generation}, author={Aayush672}, year={2025}, howpublished={\url{https://huggingface.co/Aayush672/Finetuned-llama3.2-Vision-Model}} } ``` ## Model Card Contact For questions or issues, please open an issue in the model repository or contact the model author.