We introduce SkyReels-A3, an end-to-end multimodality-conditioned framework for synthesizing high-fidelity and temporally coherent human videos. Building on pretrained video diffusion transformers, our framework enables long-form video generation with diverse and controllable conditioning via multimodal inputs. It accepts image inputs of any aspect ratio – including portraits, half-body, or full-body shots – delivering highly realistic and superior quality results across diverse scenarios. We employ a learning-based interpolation strategy to support minute-level controlable video generation, while introducing reinforcement learning to enhance inter-action naturalness. Comprehensive benchmarks demonstrate that SkyReels-A3 excels in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.
Architecture of SkyReels-A3.
[2] Wang A, Ai B, Wen B, et al. Wan: Open and Advanced Large-Scale Video Generative Models[J]. arXiv preprint arXiv:2503.20314, 2025.
[3] Fei Z, Jiang H, Qiu D, et al. SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers[J]. arXiv preprint arXiv:2506.00830, 2025.