*Corresponding Author, †Project Lead
Hefei University of Technology • Tsinghua University • Zhipu AI
- 2025.10.28: 🔥 We release the checkpoints of Kaleido-14B-S2V.
- 2025.10.22: 🔥 We propose Kaleido, a novel multi-subject reference video generation model. Both the training and inference code have been open-sourced to facilitate further research and reproduction.
Before running the model, please refer to this guide to see how we use large models like GLM-4.5 (or other comparable products, such as GPT-5) to optimize the model. This is crucial because the model is trained with long prompts, and a good prompt directly impacts the quality of the video generation.
Please make sure your Python version is between 3.10 and 3.12, inclusive of both 3.10 and 3.12.
pip install -r requirements.txt
| ckpts | Download Link | Notes |
|---|---|---|
| Kaleido-14B | 🤗 Hugging Face | Supports 512P |
Use the following commands to download the model weights (We have integrated both Wan VAE and T5 modules into this checkpoint for convenience).
# Download the repository (skip automatic LFS file downloads)
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/zai-org/Kaleido-14B-S2V
# Enter the repository folder
cd Kaleido-14B-S2V
# Merge the checkpoint files
python merge_kaleido.pyArrange the model files into the following structure:
.
├── Kaleido-14B-S2V
│ ├── model
│ │ └── ....
│ ├── Wan2.1_VAE.pth
│ │
│ └── umt5-xxl
│ └── ....
├── configs
├── sat
└── sgm
python sample_video.py --base configs/video_model/dit_crossattn_14B_wanvae.yaml configs/sampling sample_wanvae_concat_14b.yamlYou can also use multiple GPUs to accelerate the inference process:
bash torchrun_multi_gpu.shYou can accelerate the inference process by utilizing multiple GPUs. Additionally, you can enable Sequence Parallelism in the YAML configuration file to further speed up inference.
args:
s2v_concat: True
....
sequence_parallel_size: 8Note: The condition input txt file should contain lines in the following format:
prompt@@image1.png@@image2.png@@image3.png
The dataset should be structured as follows:
.
├── labels
│ ├── 1.txt
│ ├── 2.txt
│ ├── 3.txt
│ ├── ...
├── videos
│ ├── 1.mp4
│ ├── 2.mp4
│ ├── 3.mp4
│ ├── ...
└── references
├── 1
│ ├── ref1.png
│ ├── ref2.png
│ └── ref3.png
├── 2
│ ├── ref1.png
│ ├── ref2.png
│ └── ref3.png
├── ...
After you have prepared the dataset, you can execute the following command to generate the training data. Note: Please update the dataset directory paths in the YAML configuration file to match your local setup before running.
bash multi_gpu_training.sh
Note: Our training strategy is based on the CogivideoX model. For detailed information about the training process, please refer to the CogivideoX repository. In addition to the DeepSpeed training approach, we also provide an implementation using FSDP2 for distributed training.
Our model can broadly reference various types of images, including humans, objects, and diverse scenarios such as try-on. This demonstrates its versatility and generalization ability across different tasks.
| Reference Images | Kaleido Results |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
- Inference codes and Training codes for Kaleido
- Checkpoint of Kaleido
- Datapipline of Kaleido
If you find our work helpful, please cite our paper:
@article{DBLP:journals/corr/abs-2510-18573,
author = {Zhenxing Zhang and
Jiayan Teng and
Zhuoyi Yang and
Tiankun Cao and
Cheng Wang and
Xiaotao Gu and
Jie Tang and
Dan Guo and
Meng Wang},
title = {Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model},
journal = {CoRR},
volume = {abs/2510.18573},
year = {2025},
url = {https://doi.org/10.48550/arXiv.2510.18573},
doi = {10.48550/ARXIV.2510.18573},
eprinttype = {arXiv},
eprint = {2510.18573},
timestamp = {Sat, 15 Nov 2025 15:31:50 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2510-18573.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}





















