LASER / vine_hf /README_HF.md
ASethi04's picture
updates
f9a6349
# VINE: Video Understanding with Natural Language
[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine)
[![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER)
VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.
## Quick Start
```python
from transformers import AutoModel
from vine_hf import VineConfig, VineModel, VinePipeline
# Load VINE model from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
# Create pipeline with your checkpoint paths
vine_pipeline = VinePipeline(
model=model,
tokenizer=None,
sam_config_path="/path/to/sam2_config.yaml",
sam_checkpoint_path="/path/to/sam2_checkpoint.pt",
gd_config_path="/path/to/grounding_dino_config.py",
gd_checkpoint_path="/path/to/grounding_dino_checkpoint.pth",
device="cuda",
trust_remote_code=True
)
# Process a video
results = vine_pipeline(
'path/to/video.mp4',
categorical_keywords=['human', 'dog', 'frisbee'],
unary_keywords=['running', 'jumping'],
binary_keywords=['chasing', 'behind'],
return_top_k=3
)
```
## Installation
### Option 1: Automated Setup (Recommended)
```bash
# Download the setup script
wget https://raw.githubusercontent.com/kevinxuez/vine_hf/main/setup_vine_demo.sh
# Run the setup
bash setup_vine_demo.sh
# Activate environment
conda activate vine_demo
```
### Option 2: Manual Installation
```bash
# 1. Create conda environment
conda create -n vine_demo python=3.10 -y
conda activate vine_demo
# 2. Install PyTorch with CUDA support
pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126
# 3. Install core dependencies
pip install transformers huggingface-hub safetensors
# 4. Clone and install required repositories
git clone https://github.com/video-fm/video-sam2.git
git clone https://github.com/video-fm/GroundingDINO.git
git clone https://github.com/kevinxuez/LASER.git
git clone https://github.com/kevinxuez/vine_hf.git
# Install in editable mode
pip install -e ./video-sam2
pip install -e ./GroundingDINO
pip install -e ./LASER
pip install -e ./vine_hf
# Build GroundingDINO extensions
cd GroundingDINO && python setup.py build_ext --force --inplace && cd ..
```
## Required Checkpoints
VINE requires SAM2 and GroundingDINO checkpoints for segmentation. Download these separately:
### SAM2 Checkpoint
```bash
wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml
```
### GroundingDINO Checkpoint
```bash
wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
```
## Architecture
```
video-fm/vine (HuggingFace Hub)
├── VINE Model Weights (~1.8GB)
│ ├── Categorical CLIP model (fine-tuned)
│ ├── Unary CLIP model (fine-tuned)
│ └── Binary CLIP model (fine-tuned)
└── Architecture Files
├── vine_config.py
├── vine_model.py
├── vine_pipeline.py
└── utilities
User Provides:
├── Dependencies (via pip/conda)
│ ├── laser (video processing utilities)
│ ├── sam2 (segmentation)
│ └── groundingdino (object detection)
└── Checkpoints (downloaded separately)
├── SAM2 model files
└── GroundingDINO model files
```
## Why This Architecture?
This separation of concerns provides several benefits:
1. **Lightweight Distribution**: Only VINE-specific weights (~1.8GB) are on HuggingFace
2. **Version Control**: Users can choose their preferred SAM2/GroundingDINO versions
3. **Licensing**: Keeps different model licenses separate
4. **Flexibility**: Easy to swap segmentation backends
5. **Standard Practice**: Similar to models like LLaVA, BLIP-2, etc.
## Full Usage Example
```python
import os
from pathlib import Path
from transformers import AutoModel
from vine_hf import VinePipeline
# Set up paths
checkpoint_dir = Path("/path/to/checkpoints")
sam_config = checkpoint_dir / "sam2_hiera_t.yaml"
sam_checkpoint = checkpoint_dir / "sam2_hiera_tiny.pt"
gd_config = checkpoint_dir / "GroundingDINO_SwinT_OGC.py"
gd_checkpoint = checkpoint_dir / "groundingdino_swint_ogc.pth"
# Load VINE from HuggingFace
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
# Create pipeline
vine_pipeline = VinePipeline(
model=model,
tokenizer=None,
sam_config_path=str(sam_config),
sam_checkpoint_path=str(sam_checkpoint),
gd_config_path=str(gd_config),
gd_checkpoint_path=str(gd_checkpoint),
device="cuda:0",
trust_remote_code=True
)
# Process video
results = vine_pipeline(
"path/to/video.mp4",
categorical_keywords=['person', 'dog', 'ball'],
unary_keywords=['running', 'jumping', 'sitting'],
binary_keywords=['chasing', 'next to', 'holding'],
object_pairs=[(0, 1), (0, 2)], # person-dog, person-ball
return_top_k=5,
include_visualizations=True
)
# Access results
print(f"Detected {results['summary']['num_objects_detected']} objects")
print(f"Top categories: {results['summary']['top_categories']}")
print(f"Top actions: {results['summary']['top_actions']}")
print(f"Top relations: {results['summary']['top_relations']}")
# Access detailed predictions
for obj_id, predictions in results['categorical_predictions'].items():
print(f"\nObject {obj_id}:")
for prob, category in predictions:
print(f" {category}: {prob:.3f}")
```
## Output Format
```python
{
"categorical_predictions": {
object_id: [(probability, category), ...]
},
"unary_predictions": {
(frame_id, object_id): [(probability, action), ...]
},
"binary_predictions": {
(frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
},
"confidence_scores": {
"categorical": float,
"unary": float,
"binary": float
},
"summary": {
"num_objects_detected": int,
"top_categories": [(category, probability), ...],
"top_actions": [(action, probability), ...],
"top_relations": [(relation, probability), ...]
},
"visualizations": { # if include_visualizations=True
"vine": {
"all": {"frames": [...], "video_path": "..."},
...
}
}
}
```
## Configuration Options
```python
from vine_hf import VineConfig
config = VineConfig(
model_name="openai/clip-vit-base-patch32", # CLIP backbone
segmentation_method="grounding_dino_sam2", # or "sam2"
box_threshold=0.35, # GroundingDINO threshold
text_threshold=0.25, # GroundingDINO threshold
target_fps=5, # Video sampling rate
visualize=True, # Enable visualizations
visualization_dir="outputs/", # Output directory
debug_visualizations=False, # Debug mode
device="cuda:0" # Device
)
```
## Deployment Examples
### Local Script
```python
# test_vine.py
from transformers import AutoModel
from vine_hf import VinePipeline
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)
results = pipeline("video.mp4", ...)
```
### HuggingFace Spaces
```python
# app.py for Gradio Space
import gradio as gr
from transformers import AutoModel
from vine_hf import VinePipeline
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
# ... set up pipeline and Gradio interface
```
### API Server
```python
# FastAPI server
from fastapi import FastAPI
from transformers import AutoModel
from vine_hf import VinePipeline
app = FastAPI()
model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
pipeline = VinePipeline(model=model, ...)
@app.post("/process")
async def process_video(video_path: str):
return pipeline(video_path, ...)
```
## Troubleshooting
### Import Errors
```bash
# Make sure all dependencies are installed
pip list | grep -E "laser|sam2|groundingdino"
# Reinstall if needed
pip install -e ./LASER
pip install -e ./video-sam2
pip install -e ./GroundingDINO
```
### CUDA Errors
```python
# Check CUDA availability
import torch
print(torch.cuda.is_available())
print(torch.version.cuda)
# Use CPU if needed
pipeline = VinePipeline(model=model, device="cpu", ...)
```
### Checkpoint Not Found
```bash
# Verify checkpoint paths
ls -lh /path/to/sam2_hiera_tiny.pt
ls -lh /path/to/groundingdino_swint_ogc.pth
```
## System Requirements
- **Python**: 3.10+
- **CUDA**: 11.8+ (for GPU)
- **GPU**: 8GB+ VRAM recommended (T4, V100, A100, etc.)
- **RAM**: 16GB+ recommended
- **Storage**: ~3GB for checkpoints
## Citation
```bibtex
@article{laser2024,
title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
author={Your Authors},
journal={Your Conference/Journal},
year={2024}
}
```
## License
This model and code are released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.
## Links
- **Model**: https://huggingface.co/video-fm/vine
- **Code**: https://github.com/kevinxuez/LASER
- **vine_hf Package**: https://github.com/kevinxuez/vine_hf
- **SAM2**: https://github.com/facebookresearch/sam2
- **GroundingDINO**: https://github.com/IDEA-Research/GroundingDINO
## Support
For issues or questions:
- **Model/Architecture**: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions)
- **LASER Framework**: [GitHub Issues](https://github.com/kevinxuez/LASER/issues)
- **vine_hf Package**: [GitHub Issues](https://github.com/kevinxuez/vine_hf/issues)