FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation
A state-of-the-art text-to-motion generation model based on Latent Diffusion Forcing
Paper | Github | Project Page
Overview
We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.
Model Architecture
The model consists of three main components:
- Text Encoder: UMT5-XXL encoder for text feature extraction
- Latent Diffusion Model: Transformer-based diffusion model operating in latent space
- VAE Decoder: 1D convolutional VAE for decoding latent features to motion sequences
Technical Specifications:
- Input: Natural language text
- Output: Motion sequences in two formats:
- 263-dimensional HumanML3D features (default)
- 22ร3 joint coordinates (optional)
- Latent dimension: 4
- Upsampling factor: 4ร (VAE decoder)
- Frame rate: 20 FPS
Installation
Prerequisites
- Python 3.8+
- CUDA-capable GPU with 16GB+ VRAM (recommended)
- 16GB+ system RAM
Dependencies
Step 1: Install basic dependencies
pip install torch transformers huggingface_hub
pip install lightning diffusers omegaconf ftfy numpy
Step 2: Install Flash Attention (Required)
Flash attention requires CUDA and may need compilation. Choose the appropriate method:
pip install flash-attn --no-build-isolation
Note: Flash attention is required for this model. If installation fails, please refer to the official flash-attention installation guide.
Quick Start
Basic Usage
from transformers import AutoModel
# Load model
model = AutoModel.from_pretrained(
"ShandaAI/FloodDiffusion",
trust_remote_code=True
)
# Generate motion from text (263-dim HumanML3D features)
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}") # (~240, 263)
# Generate motion as joint coordinates (22 joints ร 3 coords)
motion_joints = model("a person walking forward", length=60, output_joints=True)
print(f"Generated joints: {motion_joints.shape}") # (~240, 22, 3)
Batch Generation
# Generate multiple motions efficiently
texts = [
"a person walking forward",
"a person running quickly",
"a person jumping up and down"
]
lengths = [60, 50, 40] # Different lengths for each motion
motions = model(texts, length=lengths)
for i, motion in enumerate(motions):
print(f"Motion {i}: {motion.shape}")
Multi-Text Motion Transitions
# Generate a motion sequence with smooth transitions between actions
motion = model(
text=[["walk forward", "turn around", "run back"]],
length=[120],
text_end=[[40, 80, 120]] # Transition points in latent tokens
)
# Output: ~480 frames showing all three actions smoothly connected
print(f"Transition motion: {motion[0].shape}")
API Reference
model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False)
Generate motion sequences from text descriptions.
Parameters:
text (
str,List[str], orList[List[str]]): Text description(s)- Single string: Generate one motion
- List of strings: Batch generation
- Nested list: Multiple text prompts per motion (for transitions)
length (
intorList[int], default=60): Number of latent tokens to generate- Output frames โ
length ร 4(due to VAE upsampling) - Example:
length=60โ240 frames (12 seconds at 20 FPS)
- Output frames โ
text_end (
List[int]orList[List[int]], optional): Latent token positions for text transitions- Only used when
textis a nested list - Specifies when to switch between different text descriptions
- IMPORTANT: Must have the same length as the corresponding text list
- Example:
text=[["walk", "turn", "sit"]]requirestext_end=[[20, 40, 60]](3 endpoints for 3 texts)
- Example:
- Must be in ascending order
- Only used when
num_denoise_steps (
int, optional): Number of denoising iterations- Higher values produce better quality but slower generation
- Recommended range: 10-50
output_joints (
bool, default=False): Output format selectorFalse: Returns 263-dimensional HumanML3D featuresTrue: Returns 22ร3 joint coordinates for direct visualization
Returns:
- Single motion:
output_joints=False:numpy.ndarrayof shape(frames, 263)output_joints=True:numpy.ndarrayof shape(frames, 22, 3)
- Batch:
List[numpy.ndarray]with shapes as above
Example:
# Single generation (263-dim features)
motion = model("walk forward", length=60) # Returns (240, 263)
# Single generation (joint coordinates)
joints = model("walk forward", length=60, output_joints=True) # Returns (240, 22, 3)
# Batch generation
motions = model(["walk", "run"], length=[60, 50]) # Returns list of 2 arrays
# Multi-text transitions
motion = model(
[["walk", "turn"]],
length=[60],
text_end=[[30, 60]]
) # Returns list with 1 array of shape (240, 263)
Citation
If you use this model in your research, please cite:
@article{cai2025flooddiffusion,
title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
journal={arXiv preprint arXiv:2512.03520},
year={2025}
}
Troubleshooting
Common Issues
ImportError with trust_remote_code:
# Solution: Add trust_remote_code=True
model = AutoModel.from_pretrained(
"ShandaAI/FloodDiffusion",
trust_remote_code=True # Required!
)
Out of Memory:
# Solution: Generate shorter sequences
motion = model("walk", length=30) # Shorter = less memory
Slow first load: The first load downloads ~14GB of model files and may take 5-30 minutes depending on internet speed. Subsequent loads use cached files and are instant.
Module import errors: Ensure all dependencies are installed:
pip install lightning diffusers omegaconf ftfy numpy
- Downloads last month
- 49