This paper presents \texttt{SkyReels-A2}, a controllable video generation framework capable of assembling arbitrary visual elements (e.g., characters, objects, backgrounds) into synthesized videos based on textual prompts while maintaining strict consistency with reference images for each element. We term this task \emph{elements-to-video (E2V)}, whose primary challenges lie in preserving per-element fidelity to references, ensuring coherent scene composition, and achieving natural outputs. To address these, we first design a comprehensive data pipeline to construct prompt-reference-video triplets for model training. Next, we propose a novel image-text joint embedding model to inject multi-element representations into the generative process, balancing element-specific consistency with global coherence and text alignment. We also optimize the inference pipeline for both speed and output stability. Moreover, we introduce a carefully curated benchmark for systematic evaluation, i.e, \texttt{A2 Bench}. Experiments demonstrate that our framework can generate diverse, high-quality videos with precise element control. \texttt{SkyReels-A2} is the first commercial-grade open-source model for \emph{E2V} generation, performing favorably against advanced commercial closed-source models. We anticipate \texttt{SkyReels-A2} will advance creative applications such as drama and virtual e-commerce, pushing the boundaries of controllable video generation.
Overview of SkyReels-A2 framework. Our approach initiates by encoding all reference images using two distinct branches. The first, termed the spatial feature branch (represented in red, top row), leverages a fine-grained VAE encoder to process per-composition images. The second, identified as the semantic feature branch (represented in red, bottom row), utilizes a CLIP vision encoder followed by an MLP projection to encode semantic references. Subsequently, the spatial features are concatenated with the noised video tokens along the channel dimension before being passed through the diffusion transformer blocks. Meanwhile, the semantic features extracted from the reference images are incorporated into the diffusion transformers via supplementary cross-attention layers, ensuring that the semantic context is effectively integrated during diffusion.
SkyReels-A2 enables character, objectives, and background reference image to composite a natural videos.
SkyReels-A2 also support multiple-human reference composition, creating high-quality interactive videos.
An important application of SkyReels-A2 is the ability to generate corresponding recommendation scenarios based on the anchor and product images.
SkyReels-A2 also highlights the effectiveness in building music multimedia creation scene.
The pipeline begins with preprocessing, where raw videos are filtered by resolution, labels, types, and sources, followed by temporal segmentation based on key-frames. Next, a proprietary multi-expert video captioning model generates both holistic descriptions for video clips and structured concept annotations. Subsequently, detection and segmentation models extract visual elements (e.g., humans, objects, environments). To mitigate duplication, reference images are retrieved from other clips based on clip/facial similarity score. Further refinement includes face detection and human parsing to obtain facial/attire elements. Finally, the extracted elements are matched with structured descriptions to form training triplets (visual elements, video clips, and textual captions)
The dimensions covered in A2-Bench. Our evaluation consider both automatic metrics and user study, meantime, it covers multiple perspectives that precisely reflects the quality of elements-to-video (E2V) task.
[1] Huang Y, Yuan Z, Liu Q, et al. ConceptMaster: Multi-Concept Video Customization on Diffusion Transformer Models Without Test-Time Tuning[J]. arXiv preprint arXiv:2501.04698, 2025.
[2] Liu L, Ma T, Li B, et al. Phantom: Subject-consistent video generation via cross-modal alignment[J]. arXiv preprint arXiv:2502.11079, 2025.
[3] Wang A, Ai B, Wen B, et al. Wan: Open and Advanced Large-Scale Video Generative Models[J]. arXiv preprint arXiv:2503.20314, 2025.