Señorita-2M : A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists

Bojia Zi 1, *, Penghui Ruan 2, *, Marco Chen 3, Xianbiao Qi †4, Shaozhe Hao 5, Shihao Zhao 5, Youze Huang 6, Bin Liang 1, Rong Xiao 4, Kam-Fai Wong 1

1The Chinese University of Hong Kong 2The Hong Kong Polytechnic University 3Tsinghua University 4IntelliFusion Inc. 5The University of Hong Kong 6University of Electronic Science and Technology of China
* is equal contribution †is the corresponding author.

Abstract

Recent advancements in video generation have spurred the development of video editing techniques, which can be divided into inversion-based and end-to-end methods. However, current video editing methods still suffer from several challenges. Inversion-based methods, though training free and flexible, are time-consuming during inference, struggle with fine-grained editing instructions, and produce artifacts and jitter. On the other hand, end-to-end methods, which rely on edited video pairs for training, offer faster inference speeds but often produce poor editing results due to a lack of high-quality training video pairs. In this paper, to close the gap in end-to-end methods, we introduce Señorita-2M, a high-quality video editing dataset. Señorita-2M consists of approximately 2 million video editing pairs. It is built by crafting four high-quality, specialized video editing models, each crafted and trained by our team to achieve state-of-the-art editing results. We also propose a filtering pipeline to eliminate poorly edited video pairs. Furthermore, we explore common video editing architectures to identify the most effective structure based on current pre-trained generative model. Extensive experiments show that our dataset can help to yield remarkably high-quality video editing results.

Method

We build a dataset by using high-quality video editing experts. Specially, we trained four high-quality video editing experts using CogVideoX: a global stylizer, a local stylizer, an inpainting model, and a remover. These experts, along with other specialized models, are used to construct a large-scale dataset of high-quality video editing samples. Additionally, we designed a filtering pipeline that effectively removes failed video samples. We also utilized a large language model to convert video editing prompts, achieving clear and effective instructions. As a result, our dataset, Señorita-2M, contains approximately 2 million high-quality video editing pairs. Furthermore, we trained multiple video editors based on different video editing architectures using this dataset to evaluate the effectiveness of various editing frameworks, ultimately achieving impressive editing capabilities.

Our dataset consists of 17 editing tasks, 5 of these tasks are edited by our trained experts, while the rest are edited by computer vision tasks. The former sub dataset occupies around 76.8% of the video pairs in dataset, the rest 12 tasks take the 23.2% video pairs in total.

Visualization of Señorita-2M

Object Swap

Object Removal

Object Addition

Object Stylization

Style Transfer

Citation

@article{zi2025senorita, title={Se\~norita-2M: A High-Quality Instruction-based Dataset for General Video Editing by Video Specialists}, 
      author={Bojia Zi and Penghui Ruan and Marco Chen and Xianbiao Qi and Shaozhe Hao and Shihao Zhao and Youze Huang and Bin Liang and Rong Xiao and Kam-Fai Wong}, 
      journal={arXiv preprint arXiv:2502.06734}, 
      year={2025}, 
}