# VINE: Video Understanding with Natural Language [![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine) [![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER) VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships. ## Quick Start ```python from transformers import AutoModel from vine_hf import VineConfig, VineModel, VinePipeline # Load VINE model from HuggingFace model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) # Create pipeline with your checkpoint paths vine_pipeline = VinePipeline( model=model, tokenizer=None, sam_config_path="/path/to/sam2_config.yaml", sam_checkpoint_path="/path/to/sam2_checkpoint.pt", gd_config_path="/path/to/grounding_dino_config.py", gd_checkpoint_path="/path/to/grounding_dino_checkpoint.pth", device="cuda", trust_remote_code=True ) # Process a video results = vine_pipeline( 'path/to/video.mp4', categorical_keywords=['human', 'dog', 'frisbee'], unary_keywords=['running', 'jumping'], binary_keywords=['chasing', 'behind'], return_top_k=3 ) ``` ## Installation ### Option 1: Automated Setup (Recommended) ```bash # Download the setup script wget https://raw.githubusercontent.com/kevinxuez/vine_hf/main/setup_vine_demo.sh # Run the setup bash setup_vine_demo.sh # Activate environment conda activate vine_demo ``` ### Option 2: Manual Installation ```bash # 1. Create conda environment conda create -n vine_demo python=3.10 -y conda activate vine_demo # 2. Install PyTorch with CUDA support pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126 # 3. Install core dependencies pip install transformers huggingface-hub safetensors # 4. Clone and install required repositories git clone https://github.com/video-fm/video-sam2.git git clone https://github.com/video-fm/GroundingDINO.git git clone https://github.com/kevinxuez/LASER.git git clone https://github.com/kevinxuez/vine_hf.git # Install in editable mode pip install -e ./video-sam2 pip install -e ./GroundingDINO pip install -e ./LASER pip install -e ./vine_hf # Build GroundingDINO extensions cd GroundingDINO && python setup.py build_ext --force --inplace && cd .. ``` ## Required Checkpoints VINE requires SAM2 and GroundingDINO checkpoints for segmentation. Download these separately: ### SAM2 Checkpoint ```bash wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml ``` ### GroundingDINO Checkpoint ```bash wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py ``` ## Architecture ``` video-fm/vine (HuggingFace Hub) ├── VINE Model Weights (~1.8GB) │ ├── Categorical CLIP model (fine-tuned) │ ├── Unary CLIP model (fine-tuned) │ └── Binary CLIP model (fine-tuned) └── Architecture Files ├── vine_config.py ├── vine_model.py ├── vine_pipeline.py └── utilities User Provides: ├── Dependencies (via pip/conda) │ ├── laser (video processing utilities) │ ├── sam2 (segmentation) │ └── groundingdino (object detection) └── Checkpoints (downloaded separately) ├── SAM2 model files └── GroundingDINO model files ``` ## Why This Architecture? This separation of concerns provides several benefits: 1. **Lightweight Distribution**: Only VINE-specific weights (~1.8GB) are on HuggingFace 2. **Version Control**: Users can choose their preferred SAM2/GroundingDINO versions 3. **Licensing**: Keeps different model licenses separate 4. **Flexibility**: Easy to swap segmentation backends 5. **Standard Practice**: Similar to models like LLaVA, BLIP-2, etc. ## Full Usage Example ```python import os from pathlib import Path from transformers import AutoModel from vine_hf import VinePipeline # Set up paths checkpoint_dir = Path("/path/to/checkpoints") sam_config = checkpoint_dir / "sam2_hiera_t.yaml" sam_checkpoint = checkpoint_dir / "sam2_hiera_tiny.pt" gd_config = checkpoint_dir / "GroundingDINO_SwinT_OGC.py" gd_checkpoint = checkpoint_dir / "groundingdino_swint_ogc.pth" # Load VINE from HuggingFace model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) # Create pipeline vine_pipeline = VinePipeline( model=model, tokenizer=None, sam_config_path=str(sam_config), sam_checkpoint_path=str(sam_checkpoint), gd_config_path=str(gd_config), gd_checkpoint_path=str(gd_checkpoint), device="cuda:0", trust_remote_code=True ) # Process video results = vine_pipeline( "path/to/video.mp4", categorical_keywords=['person', 'dog', 'ball'], unary_keywords=['running', 'jumping', 'sitting'], binary_keywords=['chasing', 'next to', 'holding'], object_pairs=[(0, 1), (0, 2)], # person-dog, person-ball return_top_k=5, include_visualizations=True ) # Access results print(f"Detected {results['summary']['num_objects_detected']} objects") print(f"Top categories: {results['summary']['top_categories']}") print(f"Top actions: {results['summary']['top_actions']}") print(f"Top relations: {results['summary']['top_relations']}") # Access detailed predictions for obj_id, predictions in results['categorical_predictions'].items(): print(f"\nObject {obj_id}:") for prob, category in predictions: print(f" {category}: {prob:.3f}") ``` ## Output Format ```python { "categorical_predictions": { object_id: [(probability, category), ...] }, "unary_predictions": { (frame_id, object_id): [(probability, action), ...] }, "binary_predictions": { (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...] }, "confidence_scores": { "categorical": float, "unary": float, "binary": float }, "summary": { "num_objects_detected": int, "top_categories": [(category, probability), ...], "top_actions": [(action, probability), ...], "top_relations": [(relation, probability), ...] }, "visualizations": { # if include_visualizations=True "vine": { "all": {"frames": [...], "video_path": "..."}, ... } } } ``` ## Configuration Options ```python from vine_hf import VineConfig config = VineConfig( model_name="openai/clip-vit-base-patch32", # CLIP backbone segmentation_method="grounding_dino_sam2", # or "sam2" box_threshold=0.35, # GroundingDINO threshold text_threshold=0.25, # GroundingDINO threshold target_fps=5, # Video sampling rate visualize=True, # Enable visualizations visualization_dir="outputs/", # Output directory debug_visualizations=False, # Debug mode device="cuda:0" # Device ) ``` ## Deployment Examples ### Local Script ```python # test_vine.py from transformers import AutoModel from vine_hf import VinePipeline model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) pipeline = VinePipeline(model=model, ...) results = pipeline("video.mp4", ...) ``` ### HuggingFace Spaces ```python # app.py for Gradio Space import gradio as gr from transformers import AutoModel from vine_hf import VinePipeline model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) # ... set up pipeline and Gradio interface ``` ### API Server ```python # FastAPI server from fastapi import FastAPI from transformers import AutoModel from vine_hf import VinePipeline app = FastAPI() model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) pipeline = VinePipeline(model=model, ...) @app.post("/process") async def process_video(video_path: str): return pipeline(video_path, ...) ``` ## Troubleshooting ### Import Errors ```bash # Make sure all dependencies are installed pip list | grep -E "laser|sam2|groundingdino" # Reinstall if needed pip install -e ./LASER pip install -e ./video-sam2 pip install -e ./GroundingDINO ``` ### CUDA Errors ```python # Check CUDA availability import torch print(torch.cuda.is_available()) print(torch.version.cuda) # Use CPU if needed pipeline = VinePipeline(model=model, device="cpu", ...) ``` ### Checkpoint Not Found ```bash # Verify checkpoint paths ls -lh /path/to/sam2_hiera_tiny.pt ls -lh /path/to/groundingdino_swint_ogc.pth ``` ## System Requirements - **Python**: 3.10+ - **CUDA**: 11.8+ (for GPU) - **GPU**: 8GB+ VRAM recommended (T4, V100, A100, etc.) - **RAM**: 16GB+ recommended - **Storage**: ~3GB for checkpoints ## Citation ```bibtex @article{laser2024, title={LASER: Language-guided Object Grounding and Relation Understanding in Videos}, author={Your Authors}, journal={Your Conference/Journal}, year={2024} } ``` ## License This model and code are released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses. ## Links - **Model**: https://huggingface.co/video-fm/vine - **Code**: https://github.com/kevinxuez/LASER - **vine_hf Package**: https://github.com/kevinxuez/vine_hf - **SAM2**: https://github.com/facebookresearch/sam2 - **GroundingDINO**: https://github.com/IDEA-Research/GroundingDINO ## Support For issues or questions: - **Model/Architecture**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions) - **LASER Framework**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues) - **vine_hf Package**: [GitHub Issues](https://github.com/kevinxuez/vine_hf/issues)