Spaces:
Running
on
Zero
Running
on
Zero
| # VINE: Video Understanding with Natural Language | |
| [](https://huggingface.co/video-fm/vine) | |
| [](https://github.com/kevinxuez/LASER) | |
| VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships. | |
| ## Quick Start | |
| ```python | |
| from transformers import AutoModel | |
| from vine_hf import VineConfig, VineModel, VinePipeline | |
| # Load VINE model from HuggingFace | |
| model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) | |
| # Create pipeline with your checkpoint paths | |
| vine_pipeline = VinePipeline( | |
| model=model, | |
| tokenizer=None, | |
| sam_config_path="/path/to/sam2_config.yaml", | |
| sam_checkpoint_path="/path/to/sam2_checkpoint.pt", | |
| gd_config_path="/path/to/grounding_dino_config.py", | |
| gd_checkpoint_path="/path/to/grounding_dino_checkpoint.pth", | |
| device="cuda", | |
| trust_remote_code=True | |
| ) | |
| # Process a video | |
| results = vine_pipeline( | |
| 'path/to/video.mp4', | |
| categorical_keywords=['human', 'dog', 'frisbee'], | |
| unary_keywords=['running', 'jumping'], | |
| binary_keywords=['chasing', 'behind'], | |
| return_top_k=3 | |
| ) | |
| ``` | |
| ## Installation | |
| ### Option 1: Automated Setup (Recommended) | |
| ```bash | |
| # Download the setup script | |
| wget https://raw.githubusercontent.com/kevinxuez/vine_hf/main/setup_vine_demo.sh | |
| # Run the setup | |
| bash setup_vine_demo.sh | |
| # Activate environment | |
| conda activate vine_demo | |
| ``` | |
| ### Option 2: Manual Installation | |
| ```bash | |
| # 1. Create conda environment | |
| conda create -n vine_demo python=3.10 -y | |
| conda activate vine_demo | |
| # 2. Install PyTorch with CUDA support | |
| pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126 | |
| # 3. Install core dependencies | |
| pip install transformers huggingface-hub safetensors | |
| # 4. Clone and install required repositories | |
| git clone https://github.com/video-fm/video-sam2.git | |
| git clone https://github.com/video-fm/GroundingDINO.git | |
| git clone https://github.com/kevinxuez/LASER.git | |
| git clone https://github.com/kevinxuez/vine_hf.git | |
| # Install in editable mode | |
| pip install -e ./video-sam2 | |
| pip install -e ./GroundingDINO | |
| pip install -e ./LASER | |
| pip install -e ./vine_hf | |
| # Build GroundingDINO extensions | |
| cd GroundingDINO && python setup.py build_ext --force --inplace && cd .. | |
| ``` | |
| ## Required Checkpoints | |
| VINE requires SAM2 and GroundingDINO checkpoints for segmentation. Download these separately: | |
| ### SAM2 Checkpoint | |
| ```bash | |
| wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt | |
| wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml | |
| ``` | |
| ### GroundingDINO Checkpoint | |
| ```bash | |
| wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth | |
| wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py | |
| ``` | |
| ## Architecture | |
| ``` | |
| video-fm/vine (HuggingFace Hub) | |
| ├── VINE Model Weights (~1.8GB) | |
| │ ├── Categorical CLIP model (fine-tuned) | |
| │ ├── Unary CLIP model (fine-tuned) | |
| │ └── Binary CLIP model (fine-tuned) | |
| └── Architecture Files | |
| ├── vine_config.py | |
| ├── vine_model.py | |
| ├── vine_pipeline.py | |
| └── utilities | |
| User Provides: | |
| ├── Dependencies (via pip/conda) | |
| │ ├── laser (video processing utilities) | |
| │ ├── sam2 (segmentation) | |
| │ └── groundingdino (object detection) | |
| └── Checkpoints (downloaded separately) | |
| ├── SAM2 model files | |
| └── GroundingDINO model files | |
| ``` | |
| ## Why This Architecture? | |
| This separation of concerns provides several benefits: | |
| 1. **Lightweight Distribution**: Only VINE-specific weights (~1.8GB) are on HuggingFace | |
| 2. **Version Control**: Users can choose their preferred SAM2/GroundingDINO versions | |
| 3. **Licensing**: Keeps different model licenses separate | |
| 4. **Flexibility**: Easy to swap segmentation backends | |
| 5. **Standard Practice**: Similar to models like LLaVA, BLIP-2, etc. | |
| ## Full Usage Example | |
| ```python | |
| import os | |
| from pathlib import Path | |
| from transformers import AutoModel | |
| from vine_hf import VinePipeline | |
| # Set up paths | |
| checkpoint_dir = Path("/path/to/checkpoints") | |
| sam_config = checkpoint_dir / "sam2_hiera_t.yaml" | |
| sam_checkpoint = checkpoint_dir / "sam2_hiera_tiny.pt" | |
| gd_config = checkpoint_dir / "GroundingDINO_SwinT_OGC.py" | |
| gd_checkpoint = checkpoint_dir / "groundingdino_swint_ogc.pth" | |
| # Load VINE from HuggingFace | |
| model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) | |
| # Create pipeline | |
| vine_pipeline = VinePipeline( | |
| model=model, | |
| tokenizer=None, | |
| sam_config_path=str(sam_config), | |
| sam_checkpoint_path=str(sam_checkpoint), | |
| gd_config_path=str(gd_config), | |
| gd_checkpoint_path=str(gd_checkpoint), | |
| device="cuda:0", | |
| trust_remote_code=True | |
| ) | |
| # Process video | |
| results = vine_pipeline( | |
| "path/to/video.mp4", | |
| categorical_keywords=['person', 'dog', 'ball'], | |
| unary_keywords=['running', 'jumping', 'sitting'], | |
| binary_keywords=['chasing', 'next to', 'holding'], | |
| object_pairs=[(0, 1), (0, 2)], # person-dog, person-ball | |
| return_top_k=5, | |
| include_visualizations=True | |
| ) | |
| # Access results | |
| print(f"Detected {results['summary']['num_objects_detected']} objects") | |
| print(f"Top categories: {results['summary']['top_categories']}") | |
| print(f"Top actions: {results['summary']['top_actions']}") | |
| print(f"Top relations: {results['summary']['top_relations']}") | |
| # Access detailed predictions | |
| for obj_id, predictions in results['categorical_predictions'].items(): | |
| print(f"\nObject {obj_id}:") | |
| for prob, category in predictions: | |
| print(f" {category}: {prob:.3f}") | |
| ``` | |
| ## Output Format | |
| ```python | |
| { | |
| "categorical_predictions": { | |
| object_id: [(probability, category), ...] | |
| }, | |
| "unary_predictions": { | |
| (frame_id, object_id): [(probability, action), ...] | |
| }, | |
| "binary_predictions": { | |
| (frame_id, (obj1_id, obj2_id)): [(probability, relation), ...] | |
| }, | |
| "confidence_scores": { | |
| "categorical": float, | |
| "unary": float, | |
| "binary": float | |
| }, | |
| "summary": { | |
| "num_objects_detected": int, | |
| "top_categories": [(category, probability), ...], | |
| "top_actions": [(action, probability), ...], | |
| "top_relations": [(relation, probability), ...] | |
| }, | |
| "visualizations": { # if include_visualizations=True | |
| "vine": { | |
| "all": {"frames": [...], "video_path": "..."}, | |
| ... | |
| } | |
| } | |
| } | |
| ``` | |
| ## Configuration Options | |
| ```python | |
| from vine_hf import VineConfig | |
| config = VineConfig( | |
| model_name="openai/clip-vit-base-patch32", # CLIP backbone | |
| segmentation_method="grounding_dino_sam2", # or "sam2" | |
| box_threshold=0.35, # GroundingDINO threshold | |
| text_threshold=0.25, # GroundingDINO threshold | |
| target_fps=5, # Video sampling rate | |
| visualize=True, # Enable visualizations | |
| visualization_dir="outputs/", # Output directory | |
| debug_visualizations=False, # Debug mode | |
| device="cuda:0" # Device | |
| ) | |
| ``` | |
| ## Deployment Examples | |
| ### Local Script | |
| ```python | |
| # test_vine.py | |
| from transformers import AutoModel | |
| from vine_hf import VinePipeline | |
| model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) | |
| pipeline = VinePipeline(model=model, ...) | |
| results = pipeline("video.mp4", ...) | |
| ``` | |
| ### HuggingFace Spaces | |
| ```python | |
| # app.py for Gradio Space | |
| import gradio as gr | |
| from transformers import AutoModel | |
| from vine_hf import VinePipeline | |
| model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) | |
| # ... set up pipeline and Gradio interface | |
| ``` | |
| ### API Server | |
| ```python | |
| # FastAPI server | |
| from fastapi import FastAPI | |
| from transformers import AutoModel | |
| from vine_hf import VinePipeline | |
| app = FastAPI() | |
| model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True) | |
| pipeline = VinePipeline(model=model, ...) | |
| @app.post("/process") | |
| async def process_video(video_path: str): | |
| return pipeline(video_path, ...) | |
| ``` | |
| ## Troubleshooting | |
| ### Import Errors | |
| ```bash | |
| # Make sure all dependencies are installed | |
| pip list | grep -E "laser|sam2|groundingdino" | |
| # Reinstall if needed | |
| pip install -e ./LASER | |
| pip install -e ./video-sam2 | |
| pip install -e ./GroundingDINO | |
| ``` | |
| ### CUDA Errors | |
| ```python | |
| # Check CUDA availability | |
| import torch | |
| print(torch.cuda.is_available()) | |
| print(torch.version.cuda) | |
| # Use CPU if needed | |
| pipeline = VinePipeline(model=model, device="cpu", ...) | |
| ``` | |
| ### Checkpoint Not Found | |
| ```bash | |
| # Verify checkpoint paths | |
| ls -lh /path/to/sam2_hiera_tiny.pt | |
| ls -lh /path/to/groundingdino_swint_ogc.pth | |
| ``` | |
| ## System Requirements | |
| - **Python**: 3.10+ | |
| - **CUDA**: 11.8+ (for GPU) | |
| - **GPU**: 8GB+ VRAM recommended (T4, V100, A100, etc.) | |
| - **RAM**: 16GB+ recommended | |
| - **Storage**: ~3GB for checkpoints | |
| ## Citation | |
| ```bibtex | |
| @article{laser2024, | |
| title={LASER: Language-guided Object Grounding and Relation Understanding in Videos}, | |
| author={Your Authors}, | |
| journal={Your Conference/Journal}, | |
| year={2024} | |
| } | |
| ``` | |
| ## License | |
| This model and code are released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses. | |
| ## Links | |
| - **Model**: https://huggingface.co/video-fm/vine | |
| - **Code**: https://github.com/kevinxuez/LASER | |
| - **vine_hf Package**: https://github.com/kevinxuez/vine_hf | |
| - **SAM2**: https://github.com/facebookresearch/sam2 | |
| - **GroundingDINO**: https://github.com/IDEA-Research/GroundingDINO | |
| ## Support | |
| For issues or questions: | |
| - **Model/Architecture**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions) | |
| - **LASER Framework**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues) | |
| - **vine_hf Package**: [GitHub Issues](https://github.com/kevinxuez/vine_hf/issues) | |