Señorita-2M : A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists
Bojia Zi 1, *, Penghui Ruan 2, *, Marco Chen 3, Xianbiao Qi †4, Shaozhe Hao 5, Shihao Zhao 5, Youze Huang 6, Bin Liang 1, Rong Xiao 4, Kam-Fai Wong 1
1The Chinese University of Hong Kong
2The Hong Kong Polytechnic University
3Tsinghua University
4IntelliFusion Inc.
5The University of Hong Kong
6University of Electronic Science and Technology of China
* is equal contribution †is the corresponding author.
Add a hat on her head.
Make it oil painting style.
Make the narcissus pink.
Add a hat on her head.
Remove the girl.
Make it anime style.
Remove the bird.
Transform dog into lion.
Make it watercolor style.
Add rainbow.
Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still suffer from several challenges. Inversion-based methods, though training free and flexible, are time-consuming during inference, struggle with fine-grained editing instructions, and produce artifacts and jitter. On the other hand, end-to-end methods, which rely on edited video pairs for training, offer faster inference speeds but often produce poor editing results due to a lack of high-quality training video pairs. In this paper, to close the gap in end-to-end methods, we introduce Señorita-2M, a high-quality video editing dataset. Señorita-2M consists of approximately 2 million video editing pairs. It is built by crafting four high-quality, specialized video editing models, each crafted and trained by our team to achieve state-of-the-art editing results. We also propose a filtering pipeline to eliminate poorly edited video pairs. Furthermore, we explore common video editing architectures to identify the most effective structure based on current pre-trained generative model. Extensive experiments show that our dataset can help to yield remarkably high-quality video editing results.
We build a dataset by using high-quality video editing experts. Specially, we trained four high-quality video editing experts using CogVideoX: a global stylizer, a local stylizer, an inpainting model, and a remover. These experts, along with other specialized models, are used to construct a large-scale dataset of high-quality video editing samples. Additionally, we designed a filtering pipeline that effectively removes failed video samples. We also utilized a large language model to convert video editing prompts, achieving clear and effective instructions. As a result, our dataset, Señorita-2M, contains approximately 2 million high-quality video editing pairs. Furthermore, we trained multiple video editors based on different video editing architectures using this dataset to evaluate the effectiveness of various editing frameworks, ultimately achieving impressive editing capabilities.
Our dataset consists of 17 editing tasks, 5 of these tasks are edited by our trained experts, while the rest are edited by computer vision tasks. The former sub dataset occupies around 76.8% of the video pairs in dataset, the rest 12 tasks take the 23.2% video pairs in total.
Swap the bear for cat.
Replace the girl by boy.
Transform the fountain into sculpture.
Make th old man to old lady.
Remove the girl.
Omit the bird.
Wipe out the plant.
Remove the giraffe.
Add cloud.
Add flower.
Add a girl.
Add the flower.
Make this controller pink.
Turn champagne yellow.
Make the trees green.
Make the plant yellow.
Make it Van Gogh style.
Make it french art style.
Make it cyber punk style.
Make it watercolor style.
Citation
@article{zi2025senorita, title={Se\~norita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists},
author={Bojia Zi and Penghui Ruan and Marco Chen and Xianbiao Qi and Shaozhe Hao and Shihao Zhao and Youze Huang and Bin Liang and Rong Xiao and Kam-Fai Wong},
journal={arXiv preprint arXiv:2502.06734},
year={2025},
}