FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation

A state-of-the-art text-to-motion generation model based on Latent Diffusion Forcing

Paper | Github | Project Page

Overview

We present FloodDiffusion, a new framework for text-driven, streaming human motion generation. Given time-varying text prompts, FloodDiffusion generates text-aligned, seamless motion sequences with real-time latency.

Model Architecture

The model consists of three main components:

  1. Text Encoder: UMT5-XXL encoder for text feature extraction
  2. Latent Diffusion Model: Transformer-based diffusion model operating in latent space
  3. VAE Decoder: 1D convolutional VAE for decoding latent features to motion sequences

Technical Specifications:

  • Input: Natural language text
  • Output: Motion sequences in two formats:
    • 263-dimensional HumanML3D features (default)
    • 22ร—3 joint coordinates (optional)
  • Latent dimension: 4
  • Upsampling factor: 4ร— (VAE decoder)
  • Frame rate: 20 FPS

Installation

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU with 16GB+ VRAM (recommended)
  • 16GB+ system RAM

Dependencies

Step 1: Install basic dependencies

pip install torch transformers huggingface_hub
pip install lightning diffusers omegaconf ftfy numpy

Step 2: Install Flash Attention (Required)

Flash attention requires CUDA and may need compilation. Choose the appropriate method:

pip install flash-attn --no-build-isolation

Note: Flash attention is required for this model. If installation fails, please refer to the official flash-attention installation guide.

Quick Start

Basic Usage

from transformers import AutoModel

# Load model
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True
)

# Generate motion from text (263-dim HumanML3D features)
motion = model("a person walking forward", length=60)
print(f"Generated motion: {motion.shape}")  # (~240, 263)

# Generate motion as joint coordinates (22 joints ร— 3 coords)
motion_joints = model("a person walking forward", length=60, output_joints=True)
print(f"Generated joints: {motion_joints.shape}")  # (~240, 22, 3)

Batch Generation

# Generate multiple motions efficiently
texts = [
    "a person walking forward",
    "a person running quickly", 
    "a person jumping up and down"
]
lengths = [60, 50, 40]  # Different lengths for each motion

motions = model(texts, length=lengths)

for i, motion in enumerate(motions):
    print(f"Motion {i}: {motion.shape}")

Multi-Text Motion Transitions

# Generate a motion sequence with smooth transitions between actions
motion = model(
    text=[["walk forward", "turn around", "run back"]],
    length=[120],
    text_end=[[40, 80, 120]]  # Transition points in latent tokens
)

# Output: ~480 frames showing all three actions smoothly connected
print(f"Transition motion: {motion[0].shape}")

API Reference

model(text, length=60, text_end=None, num_denoise_steps=None, output_joints=False)

Generate motion sequences from text descriptions.

Parameters:

  • text (str, List[str], or List[List[str]]): Text description(s)

    • Single string: Generate one motion
    • List of strings: Batch generation
    • Nested list: Multiple text prompts per motion (for transitions)
  • length (int or List[int], default=60): Number of latent tokens to generate

    • Output frames โ‰ˆ length ร— 4 (due to VAE upsampling)
    • Example: length=60 โ†’ 240 frames (12 seconds at 20 FPS)
  • text_end (List[int] or List[List[int]], optional): Latent token positions for text transitions

    • Only used when text is a nested list
    • Specifies when to switch between different text descriptions
    • IMPORTANT: Must have the same length as the corresponding text list
      • Example: text=[["walk", "turn", "sit"]] requires text_end=[[20, 40, 60]] (3 endpoints for 3 texts)
    • Must be in ascending order
  • num_denoise_steps (int, optional): Number of denoising iterations

    • Higher values produce better quality but slower generation
    • Recommended range: 10-50
  • output_joints (bool, default=False): Output format selector

    • False: Returns 263-dimensional HumanML3D features
    • True: Returns 22ร—3 joint coordinates for direct visualization

Returns:

  • Single motion:
    • output_joints=False: numpy.ndarray of shape (frames, 263)
    • output_joints=True: numpy.ndarray of shape (frames, 22, 3)
  • Batch: List[numpy.ndarray] with shapes as above

Example:

# Single generation (263-dim features)
motion = model("walk forward", length=60)  # Returns (240, 263)

# Single generation (joint coordinates)
joints = model("walk forward", length=60, output_joints=True)  # Returns (240, 22, 3)

# Batch generation
motions = model(["walk", "run"], length=[60, 50])  # Returns list of 2 arrays

# Multi-text transitions
motion = model(
    [["walk", "turn"]],
    length=[60],
    text_end=[[30, 60]]
)  # Returns list with 1 array of shape (240, 263)

Citation

If you use this model in your research, please cite:

@article{cai2025flooddiffusion,
  title={FloodDiffusion: Tailored Diffusion Forcing for Streaming Motion Generation},
  author={Yiyi Cai, Yuhan Wu, Kunhang Li, You Zhou, Bo Zheng, Haiyang Liu},
  journal={arXiv preprint arXiv:2512.03520},
  year={2025}
}

Troubleshooting

Common Issues

ImportError with trust_remote_code:

# Solution: Add trust_remote_code=True
model = AutoModel.from_pretrained(
    "ShandaAI/FloodDiffusion",
    trust_remote_code=True  # Required!
)

Out of Memory:

# Solution: Generate shorter sequences
motion = model("walk", length=30)  # Shorter = less memory

Slow first load: The first load downloads ~14GB of model files and may take 5-30 minutes depending on internet speed. Subsequent loads use cached files and are instant.

Module import errors: Ensure all dependencies are installed:

pip install lightning diffusers omegaconf ftfy numpy
Downloads last month
49
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support