Spaces:

jiani-huang
/

LASER

Running on Zero

App Files Files Community

LASER / vine_hf /README_HF.md

ASethi04

updates

f9a6349 16 days ago

preview code

raw

history blame contribute delete

10.1 kB

	# VINE: Video Understanding with Natural Language

	[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-video--fm%2Fvine-blue)](https://huggingface.co/video-fm/vine)
	[![GitHub](https://img.shields.io/badge/GitHub-LASER-green)](https://github.com/kevinxuez/LASER)

	VINE is a video understanding model that processes videos along with categorical, unary, and binary keywords to return probability distributions over those keywords for detected objects and their relationships.

	## Quick Start

	```python
	from transformers import AutoModel
	from vine_hf import VineConfig, VineModel, VinePipeline

	# Load VINE model from HuggingFace
	model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

	# Create pipeline with your checkpoint paths
	vine_pipeline = VinePipeline(
	model=model,
	tokenizer=None,
	sam_config_path="/path/to/sam2_config.yaml",
	sam_checkpoint_path="/path/to/sam2_checkpoint.pt",
	gd_config_path="/path/to/grounding_dino_config.py",
	gd_checkpoint_path="/path/to/grounding_dino_checkpoint.pth",
	device="cuda",
	trust_remote_code=True
	)

	# Process a video
	results = vine_pipeline(
	'path/to/video.mp4',
	categorical_keywords=['human', 'dog', 'frisbee'],
	unary_keywords=['running', 'jumping'],
	binary_keywords=['chasing', 'behind'],
	return_top_k=3
	)
	```

	## Installation

	### Option 1: Automated Setup (Recommended)

	```bash
	# Download the setup script
	wget https://raw.githubusercontent.com/kevinxuez/vine_hf/main/setup_vine_demo.sh

	# Run the setup
	bash setup_vine_demo.sh

	# Activate environment
	conda activate vine_demo
	```

	### Option 2: Manual Installation

	```bash
	# 1. Create conda environment
	conda create -n vine_demo python=3.10 -y
	conda activate vine_demo

	# 2. Install PyTorch with CUDA support
	pip install torch==2.7.1 torchvision==0.22.1 --index-url https://download.pytorch.org/whl/cu126

	# 3. Install core dependencies
	pip install transformers huggingface-hub safetensors

	# 4. Clone and install required repositories
	git clone https://github.com/video-fm/video-sam2.git
	git clone https://github.com/video-fm/GroundingDINO.git
	git clone https://github.com/kevinxuez/LASER.git
	git clone https://github.com/kevinxuez/vine_hf.git

	# Install in editable mode
	pip install -e ./video-sam2
	pip install -e ./GroundingDINO
	pip install -e ./LASER
	pip install -e ./vine_hf

	# Build GroundingDINO extensions
	cd GroundingDINO && python setup.py build_ext --force --inplace && cd ..
	```

	## Required Checkpoints

	VINE requires SAM2 and GroundingDINO checkpoints for segmentation. Download these separately:

	### SAM2 Checkpoint
	```bash
	wget https://dl.fbaipublicfiles.com/segment_anything_2/072824/sam2_hiera_tiny.pt
	wget https://raw.githubusercontent.com/facebookresearch/sam2/main/sam2/configs/sam2.1/sam2.1_hiera_t.yaml
	```

	### GroundingDINO Checkpoint
	```bash
	wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
	wget https://raw.githubusercontent.com/IDEA-Research/GroundingDINO/main/groundingdino/config/GroundingDINO_SwinT_OGC.py
	```

	## Architecture

	```
	video-fm/vine (HuggingFace Hub)
	├── VINE Model Weights (~1.8GB)
	│ ├── Categorical CLIP model (fine-tuned)
	│ ├── Unary CLIP model (fine-tuned)
	│ └── Binary CLIP model (fine-tuned)
	└── Architecture Files
	├── vine_config.py
	├── vine_model.py
	├── vine_pipeline.py
	└── utilities

	User Provides:
	├── Dependencies (via pip/conda)
	│ ├── laser (video processing utilities)
	│ ├── sam2 (segmentation)
	│ └── groundingdino (object detection)
	└── Checkpoints (downloaded separately)
	├── SAM2 model files
	└── GroundingDINO model files
	```

	## Why This Architecture?

	This separation of concerns provides several benefits:

	1. Lightweight Distribution: Only VINE-specific weights (~1.8GB) are on HuggingFace
	2. Version Control: Users can choose their preferred SAM2/GroundingDINO versions
	3. Licensing: Keeps different model licenses separate
	4. Flexibility: Easy to swap segmentation backends
	5. Standard Practice: Similar to models like LLaVA, BLIP-2, etc.

	## Full Usage Example

	```python
	import os
	from pathlib import Path
	from transformers import AutoModel
	from vine_hf import VinePipeline

	# Set up paths
	checkpoint_dir = Path("/path/to/checkpoints")
	sam_config = checkpoint_dir / "sam2_hiera_t.yaml"
	sam_checkpoint = checkpoint_dir / "sam2_hiera_tiny.pt"
	gd_config = checkpoint_dir / "GroundingDINO_SwinT_OGC.py"
	gd_checkpoint = checkpoint_dir / "groundingdino_swint_ogc.pth"

	# Load VINE from HuggingFace
	model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)

	# Create pipeline
	vine_pipeline = VinePipeline(
	model=model,
	tokenizer=None,
	sam_config_path=str(sam_config),
	sam_checkpoint_path=str(sam_checkpoint),
	gd_config_path=str(gd_config),
	gd_checkpoint_path=str(gd_checkpoint),
	device="cuda:0",
	trust_remote_code=True
	)

	# Process video
	results = vine_pipeline(
	"path/to/video.mp4",
	categorical_keywords=['person', 'dog', 'ball'],
	unary_keywords=['running', 'jumping', 'sitting'],
	binary_keywords=['chasing', 'next to', 'holding'],
	object_pairs=[(0, 1), (0, 2)], # person-dog, person-ball
	return_top_k=5,
	include_visualizations=True
	)

	# Access results
	print(f"Detected {results['summary']['num_objects_detected']} objects")
	print(f"Top categories: {results['summary']['top_categories']}")
	print(f"Top actions: {results['summary']['top_actions']}")
	print(f"Top relations: {results['summary']['top_relations']}")

	# Access detailed predictions
	for obj_id, predictions in results['categorical_predictions'].items():
	print(f"\nObject {obj_id}:")
	for prob, category in predictions:
	print(f" {category}: {prob:.3f}")
	```

	## Output Format

	```python
	{
	"categorical_predictions": {
	object_id: [(probability, category), ...]
	},
	"unary_predictions": {
	(frame_id, object_id): [(probability, action), ...]
	},
	"binary_predictions": {
	(frame_id, (obj1_id, obj2_id)): [(probability, relation), ...]
	},
	"confidence_scores": {
	"categorical": float,
	"unary": float,
	"binary": float
	},
	"summary": {
	"num_objects_detected": int,
	"top_categories": [(category, probability), ...],
	"top_actions": [(action, probability), ...],
	"top_relations": [(relation, probability), ...]
	},
	"visualizations": { # if include_visualizations=True
	"vine": {
	"all": {"frames": [...], "video_path": "..."},
	...
	}
	}
	}
	```

	## Configuration Options

	```python
	from vine_hf import VineConfig

	config = VineConfig(
	model_name="openai/clip-vit-base-patch32", # CLIP backbone
	segmentation_method="grounding_dino_sam2", # or "sam2"
	box_threshold=0.35, # GroundingDINO threshold
	text_threshold=0.25, # GroundingDINO threshold
	target_fps=5, # Video sampling rate
	visualize=True, # Enable visualizations
	visualization_dir="outputs/", # Output directory
	debug_visualizations=False, # Debug mode
	device="cuda:0" # Device
	)
	```

	## Deployment Examples

	### Local Script
	```python
	# test_vine.py
	from transformers import AutoModel
	from vine_hf import VinePipeline

	model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
	pipeline = VinePipeline(model=model, ...)
	results = pipeline("video.mp4", ...)
	```

	### HuggingFace Spaces
	```python
	# app.py for Gradio Space
	import gradio as gr
	from transformers import AutoModel
	from vine_hf import VinePipeline

	model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
	# ... set up pipeline and Gradio interface
	```

	### API Server
	```python
	# FastAPI server
	from fastapi import FastAPI
	from transformers import AutoModel
	from vine_hf import VinePipeline

	app = FastAPI()
	model = AutoModel.from_pretrained('video-fm/vine', trust_remote_code=True)
	pipeline = VinePipeline(model=model, ...)

	@app.post("/process")
	async def process_video(video_path: str):
	return pipeline(video_path, ...)
	```

	## Troubleshooting

	### Import Errors
	```bash
	# Make sure all dependencies are installed
	pip list \| grep -E "laser\|sam2\|groundingdino"

	# Reinstall if needed
	pip install -e ./LASER
	pip install -e ./video-sam2
	pip install -e ./GroundingDINO
	```

	### CUDA Errors
	```python
	# Check CUDA availability
	import torch
	print(torch.cuda.is_available())
	print(torch.version.cuda)

	# Use CPU if needed
	pipeline = VinePipeline(model=model, device="cpu", ...)
	```

	### Checkpoint Not Found
	```bash
	# Verify checkpoint paths
	ls -lh /path/to/sam2_hiera_tiny.pt
	ls -lh /path/to/groundingdino_swint_ogc.pth
	```

	## System Requirements

	- Python: 3.10+
	- CUDA: 11.8+ (for GPU)
	- GPU: 8GB+ VRAM recommended (T4, V100, A100, etc.)
	- RAM: 16GB+ recommended
	- Storage: ~3GB for checkpoints

	## Citation

	```bibtex
	@article{laser2024,
	title={LASER: Language-guided Object Grounding and Relation Understanding in Videos},
	author={Your Authors},
	journal={Your Conference/Journal},
	year={2024}
	}
	```

	## License

	This model and code are released under the MIT License. Note that SAM2 and GroundingDINO have their own respective licenses.

	## Links

	- Model: https://huggingface.co/video-fm/vine
	- Code: https://github.com/kevinxuez/LASER
	- vine_hf Package: https://github.com/kevinxuez/vine_hf
	- SAM2: https://github.com/facebookresearch/sam2
	- GroundingDINO: https://github.com/IDEA-Research/GroundingDINO

	## Support

	For issues or questions:
	- Model/Architecture: [HuggingFace Discussions](https://huggingface.co/video-fm/vine/discussions)
	- LASER Framework: [GitHub Issues](https://github.com/kevinxuez/LASER/issues)
	- vine_hf Package: [GitHub Issues](https://github.com/kevinxuez/vine_hf/issues)