FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

A state-of-the-art text-to-motion generation model based on Latent Diffusion Forcing

Overview

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.

Model Architecture

The model consists of three main components:

Text Encoder: UMT5-XXL encoder for text feature extraction
Latent Diffusion Model: Transformer-based diffusion model operating in latent space
VAE Decoder: 1D convolutional VAE for decoding latent features to motion sequences

Technical Specifications:

Input: Natural language text
Output: Motion sequences in two formats:
- 263-dimensional HumanML3D features (default)
- 22×3 joint coordinates (optional)
Latent dimension: 4
Upsampling factor: 4× (VAE decoder)
Frame rate: 20 FPS

Installation

Prerequisites

Python 3.8+
CUDA-capable GPU with 16GB+ VRAM (recommended)
16GB+ system RAM

Dependencies

Step 1: Install basic dependencies

pip install torch transformers huggingface_hub
pip install lightning diffusers omegaconf ftfy numpy

Step 2: Install Flash Attention (Required)

Flash attention requires CUDA and may need compilation. Choose the appropriate method:

pip install flash-attn --no-build-isolation

Note: Flash attention is required for this model. If installation fails, please refer to the official flash-attention installation guide.

Quick Start

Basic Usage

from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True
)

# Generate motion from text (263-dim HumanML3D features)
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}")  # (~240, 263)

# Generate motion as joint coordinates (22 joints × 3 coords)
motion_joints = model("a person walking forward", length=60, output_joints=True)
print(f"Generated joints: {motion_joints.shape}")  # (~240, 22, 3)

Batch Generation

# Generate multiple motions efficiently
texts = [
    "a person walking forward",
    "a person running quickly", 
    "a person jumping up and down"
]
lengths = [60, 50, 40]  # Different lengths for each motion

motions = model(texts, length=lengths)

for i, motion in enumerate(motions):
    print(f"Motion {i}: {motion.shape}")

Multi-Text Motion Transitions

# Generate a motion sequence with smooth transitions between actions
motion = model(
    text=[["walk forward", "turn around", "run back"]],
    length=[120],
    text_end=[[40, 80, 120]]  # Transition points in latent tokens
)

# Output: ~480 frames showing all three actions smoothly connected
print(f"Transition motion: {motion[0].shape}")

API Reference

`model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False)`

Generate motion sequences from text descriptions.

Parameters:

text (str, List[str], or List[List[str]]): Text description(s)
- Single string: Generate one motion
- List of strings: Batch generation
- Nested list: Multiple text prompts per motion (for transitions)
length (int or List[int], default=60): Number of latent tokens to generate
- Output frames ≈ length × 4 (due to VAE upsampling)
- Example: length=60 → ~~240 frames (~~12 seconds at 20 FPS)
text_end (List[int] or List[List[int]], optional): Latent token positions for text transitions
- Only used when text is a nested list
- Specifies when to switch between different text descriptions
- IMPORTANT: Must have the same length as the corresponding text list
  - Example: text=[["walk", "turn", "sit"]] requires text_end=[[20, 40, 60]] (3 endpoints for 3 texts)
- Must be in ascending order
num_denoise_steps (int, optional): Number of denoising iterations
- Higher values produce better quality but slower generation
- Recommended range: 10-50
output_joints (bool, default=False): Output format selector
- False: Returns 263-dimensional HumanML3D features
- True: Returns 22×3 joint coordinates for direct visualization

Returns:

Single motion:
- output_joints=False: numpy.ndarray of shape (frames, 263)
- output_joints=True: numpy.ndarray of shape (frames, 22, 3)
Batch: List[numpy.ndarray] with shapes as above

Example:

# Single generation (263-dim features)
motion = model("walk forward", length=60)  # Returns (240, 263)

# Single generation (joint coordinates)
joints = model("walk forward", length=60, output_joints=True)  # Returns (240, 22, 3)

# Batch generation
motions = model(["walk", "run"], length=[60, 50])  # Returns list of 2 arrays

# Multi-text transitions
motion = model(
    [["walk", "turn"]],
    length=[60],
    text_end=[[30, 60]]
)  # Returns list with 1 array of shape (240, 263)

Citation

If you use this model in your research, please cite:

@article{cai2025flooddiffusion,
  title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
  author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
  journal={arXiv preprint arXiv:2512.03520},
  year={2025}
}

Troubleshooting

Common Issues

ImportError with trust_remote_code:

# Solution: Add trust_remote_code=True
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True  # Required!
)

Out of Memory:

# Solution: Generate shorter sequences
motion = model("walk", length=30)  # Shorter = less memory

Slow first load: The first load downloads ~14GB of model files and may take 5-30 minutes depending on internet speed. Subsequent loads use cached files and are instant.

Module import errors: Ensure all dependencies are installed:

pip install lightning diffusers omegaconf ftfy numpy

Downloads last month: 49

Inference Providers NEW

Other

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support