SkyReels-A3

Towards Ultra-Long Audio-Conditioned Video Generation

SkyReels Team, Skywork AI

Overview

We introduce SkyReels-A3, an end-to-end multimodality-conditioned framework for synthesizing high-fidelity and temporally coherent human videos. Building on pretrained video diffusion transformers, our framework enables long-form video generation with diverse and controllable conditioning via multimodal inputs. It accepts image inputs of any aspect ratio – including portraits, half-body, or full-body shots – delivering highly realistic and superior quality results across diverse scenarios. We employ a learning-based interpolation strategy to support minute-level controlable video generation, while introducing reinforcement learning to enhance inter-action naturalness. Comprehensive benchmarks demonstrate that SkyReels-A3 excels in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.

Framework

Architecture of SkyReels-A3.

References

[1] Chen G, Lin D, Yang J, et al. Skyreels-v2: Infinite-length film generative model[J]. arXiv preprint arXiv:2504.13074, 2025.

[2] Wang A, Ai B, Wen B, et al. Wan: Open and Advanced Large-Scale Video Generative Models[J]. arXiv preprint arXiv:2503.20314, 2025.

[3] Fei Z, Jiang H, Qiu D, et al. SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers[J]. arXiv preprint arXiv:2506.00830, 2025.

SkyReels-A3

Towards Ultra-Long Audio-Conditioned Video Generation

Overview

Framework

Examples

Online-shopping and Presentation

MV

Camera Control

Multiple Styles and Subjects

References