KALEIDO: OPEN-SOURCED MULTI-SUBJECT REFERENCE VIDEO GENERATION MODEL

Zhenxing Zhang Jiayan Teng^† Zhuoyi Yang Tiankun Cao Cheng Wang Xiaotao Gu Jie Tang Dan Guo Meng Wang^*

^*Corresponding Author, ^†Project Lead
Hefei University of Technology • Tsinghua University • Zhipu AI

Update and News

2025.10.28: 🔥 We release the checkpoints of Kaleido-14B-S2V.
2025.10.22: 🔥 We propose Kaleido, a novel multi-subject reference video generation model. Both the training and inference code have been open-sourced to facilitate further research and reproduction.

Qucik Start

Prompt Optimization

Before running the model, please refer to this guide to see how we use large models like GLM-4.5 (or other comparable products, such as GPT-5) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.

Diffusers

Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.

pip install -r requirements.txt

Checkpoints Download

ckpts	Download Link	Notes
Kaleido-14B	🤗 Hugging Face	Supports 512P

Use the following commands to download the model weights (We have integrated both Wan VAE and T5 modules into this checkpoint for convenience).

# Download the repository (skip automatic LFS file downloads)
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/zai-org/Kaleido-14B-S2V

# Enter the repository folder
cd Kaleido-14B-S2V

# Merge the checkpoint files
python merge_kaleido.py

Arrange the model files into the following structure:

.
├── Kaleido-14B-S2V
│   ├── model
│   │   └── ....
│   ├── Wan2.1_VAE.pth
│   │
│   └── umt5-xxl
│       └── ....
├── configs
├── sat
└── sgm

Usage

Inference

python sample_video.py --base configs/video_model/dit_crossattn_14B_wanvae.yaml configs/sampling sample_wanvae_concat_14b.yaml

You can also use multiple GPUs to accelerate the inference process:

bash torchrun_multi_gpu.sh

You can accelerate the inference process by utilizing multiple GPUs. Additionally, you can enable Sequence Parallelism in the YAML configuration file to further speed up inference.

args:
  s2v_concat: True
  ....
  sequence_parallel_size: 8

Note: The condition input txt file should contain lines in the following format:

prompt@@image1.png@@image2.png@@image3.png

Training

Preparing the Dataset

The dataset should be structured as follows:

.
├── labels
│   ├── 1.txt
│   ├── 2.txt
│   ├── 3.txt
│   ├── ...
├── videos
│   ├── 1.mp4
│   ├── 2.mp4
│   ├── 3.mp4
│   ├── ...
└── references
    ├── 1
    │   ├── ref1.png
    │   ├── ref2.png
    │   └── ref3.png
    ├── 2
    │   ├── ref1.png
    │   ├── ref2.png
    │   └── ref3.png
    ├── ...

After you have prepared the dataset, you can execute the following command to generate the training data. Note: Please update the dataset directory paths in the YAML configuration file to match your local setup before running.

bash multi_gpu_training.sh

Note: Our training strategy is based on the CogivideoX model. For detailed information about the training process, please refer to the CogivideoX repository. In addition to the DeepSpeed training approach, we also provide an implementation using FSDP2 for distributed training.

Gallery

Our model can broadly reference various types of images, including humans, objects, and diverse scenarios such as try-on. This demonstrates its versatility and generalization ability across different tasks.

Reference Images	Kaleido Results

Todo List

Inference codes and Training codes for Kaleido
Checkpoint of Kaleido
Datapipline of Kaleido

Citation

If you find our work helpful, please cite our paper:

@article{DBLP:journals/corr/abs-2510-18573,
  author       = {Zhenxing Zhang and
                  Jiayan Teng and
                  Zhuoyi Yang and
                  Tiankun Cao and
                  Cheng Wang and
                  Xiaotao Gu and
                  Jie Tang and
                  Dan Guo and
                  Meng Wang},
  title        = {Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model},
  journal      = {CoRR},
  volume       = {abs/2510.18573},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2510.18573},
  doi          = {10.48550/ARXIV.2510.18573},
  eprinttype    = {arXiv},
  eprint       = {2510.18573},
  timestamp    = {Sat, 15 Nov 2025 15:31:50 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2510-18573.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
configs		configs
resources		resources
sat		sat
sgm		sgm
.gitignore		.gitignore
README.md		README.md
arguments.py		arguments.py
convert_ckpt_init_wan2sat.py		convert_ckpt_init_wan2sat.py
data_video.py		data_video.py
diffusion_video.py		diffusion_video.py
dit_video_crossattn.py		dit_video_crossattn.py
inference_examples.txt		inference_examples.txt
multi_gpu_training.sh		multi_gpu_training.sh
requirements.txt		requirements.txt
sample_video.py		sample_video.py
torchrun_multi_gpu.sh		torchrun_multi_gpu.sh
train_video_concat.py		train_video_concat.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

KALEIDO: OPEN-SOURCED MULTI-SUBJECT REFERENCE VIDEO GENERATION MODEL

Update and News

Qucik Start

Prompt Optimization

Diffusers

Checkpoints Download

Usage

Inference

Training

Preparing the Dataset

Gallery

Todo List

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Languages

zai-org/Kaleido

Folders and files

Latest commit

History

Repository files navigation

KALEIDO: OPEN-SOURCED MULTI-SUBJECT REFERENCE VIDEO GENERATION MODEL

Update and News

Qucik Start

Prompt Optimization

Diffusers

Checkpoints Download

Usage

Inference

Training

Preparing the Dataset

Gallery

Todo List

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages