Skip to content

Kaleido: Open-sourced multi-subject reference video generation model, enabling controllable, high-fidelity video synthesis from multiple image references.

Notifications You must be signed in to change notification settings

zai-org/Kaleido

Repository files navigation

Kaleido Icon

KALEIDO: OPEN-SOURCED MULTI-SUBJECT REFERENCE VIDEO GENERATION MODEL


*Corresponding Author, Project Lead
Hefei University of Technology • Tsinghua University • Zhipu AI

Examples


Update and News

  • 2025.10.28: 🔥 We release the checkpoints of Kaleido-14B-S2V.
  • 2025.10.22: 🔥 We propose Kaleido, a novel multi-subject reference video generation model. Both the training and inference code have been open-sourced to facilitate further research and reproduction.

Qucik Start

Prompt Optimization

Before running the model, please refer to this guide to see how we use large models like GLM-4.5 (or other comparable products, such as GPT-5) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.

Diffusers

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

Checkpoints Download

ckpts Download Link Notes
Kaleido-14B 🤗 Hugging Face Supports 512P

Use the following commands to download the model weights (We have integrated both Wan VAE and T5 modules into this checkpoint for convenience).

# Download the repository (skip automatic LFS file downloads)
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/zai-org/Kaleido-14B-S2V

# Enter the repository folder
cd Kaleido-14B-S2V

# Merge the checkpoint files
python merge_kaleido.py

Arrange the model files into the following structure:

.
├── Kaleido-14B-S2V
│   ├── model
│   │   └── ....
│   ├── Wan2.1_VAE.pth
│   │
│   └── umt5-xxl
│       └── ....
├── configs
├── sat
└── sgm

Usage

Inference

python sample_video.py --base configs/video_model/dit_crossattn_14B_wanvae.yaml configs/sampling sample_wanvae_concat_14b.yaml

You can also use multiple GPUs to accelerate the inference process:

bash torchrun_multi_gpu.sh

You can accelerate the inference process by utilizing multiple GPUs. Additionally, you can enable Sequence Parallelism in the YAML configuration file to further speed up inference.

args:
  s2v_concat: True
  ....
  sequence_parallel_size: 8

Note: The condition input txt file should contain lines in the following format:

prompt@@image1.png@@image2.png@@image3.png

Training

Preparing the Dataset

The dataset should be structured as follows:

.
├── labels
│   ├── 1.txt
│   ├── 2.txt
│   ├── 3.txt
│   ├── ...
├── videos
│   ├── 1.mp4
│   ├── 2.mp4
│   ├── 3.mp4
│   ├── ...
└── references
    ├── 1
    │   ├── ref1.png
    │   ├── ref2.png
    │   └── ref3.png
    ├── 2
    │   ├── ref1.png
    │   ├── ref2.png
    │   └── ref3.png
    ├── ...

After you have prepared the dataset, you can execute the following command to generate the training data. Note: Please update the dataset directory paths in the YAML configuration file to match your local setup before running.

bash multi_gpu_training.sh 

Note: Our training strategy is based on the CogivideoX model. For detailed information about the training process, please refer to the CogivideoX repository. In addition to the DeepSpeed training approach, we also provide an implementation using FSDP2 for distributed training.

Gallery

Our model can broadly reference various types of images, including humans, objects, and diverse scenarios such as try-on. This demonstrates its versatility and generalization ability across different tasks.

Reference Images Kaleido Results
Image 1
Image 1 Image 2
Image 1 Image 2
Image 1 Image 2
Image 1 Image 2
Image 1 Image 2 Image 3
Image 1 Image 2 Image 3

Todo List

  • Inference codes and Training codes for Kaleido
  • Checkpoint of Kaleido
  • Datapipline of Kaleido

Citation

If you find our work helpful, please cite our paper:

@article{DBLP:journals/corr/abs-2510-18573,
  author       = {Zhenxing Zhang and
                  Jiayan Teng and
                  Zhuoyi Yang and
                  Tiankun Cao and
                  Cheng Wang and
                  Xiaotao Gu and
                  Jie Tang and
                  Dan Guo and
                  Meng Wang},
  title        = {Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model},
  journal      = {CoRR},
  volume       = {abs/2510.18573},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2510.18573},
  doi          = {10.48550/ARXIV.2510.18573},
  eprinttype    = {arXiv},
  eprint       = {2510.18573},
  timestamp    = {Sat, 15 Nov 2025 15:31:50 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2510-18573.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

About

Kaleido: Open-sourced multi-subject reference video generation model, enabling controllable, high-fidelity video synthesis from multiple image references.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages