We present SkyReels-A1, a simple yet effective framework based on a video diffusion transformer architecture for portrait image animation. Prior approaches still face challenges such as identity leakage, unstable backgrounds, and lack of realism in facial expressions, especially when limited to head-only animation. Extending these methods to include diverse body compositions often results in artifacts or unnatural movements. To address these challenges, SkyReels-A1 leverages the robust generative capabilities of a DiT-based framework, improving facial motion transfer accuracy, identity preservation, and temporal consistency. We integrate an expression-aware conditioning mechanism to enable the generation of continuous video driven by expression-aware landmarks, while facial image-text alignment facilitates the deep fusion of facial features with video dynamics, further enhancing identity consistency. Additionally, SkyReels-A1 incorporates a multistage training strategy that progressively improves expression-motion adherence and identity consistency. Extensive experiments demonstrate that our method achieves excellent results that adapt seamlessly to diverse compositions, making it suitable for applications such as virtual avatars, video conferencing, and digital content creation.
Overview of SkyReels-A1 framework. Given an input video sequence and a reference portrait image, we extract facial expression-aware landmarks from the video, which serve as motion descriptors for transferring expressions onto the portrait. Utilizing a conditional video generation framework based on DiT, our approach directly integrates these facial expression-aware landmarks into the input latent space. In alignment with prior research, we employ a pose guidance mechanism constructed within a VAE architecture. This component encodes facial expression-aware landmarks as conditional input for the DiT framework, thereby enabling the model to capture essential low- dimensional visual attributes while preserving the semantic integrity of facial features.
SkyReels-A1 enables character image animation from a still image driven by a video, transferring not only the motion of talking head movements but also natural body dynamics, producing realistic and lifelike animations.
The presented cases showcase animations with various head poses, including frontal and profile views, demonstrating our method's consistency and realism across perspectives.
The presented cases demonstrate the effectiveness of our method across various aspect ratios in both generated and real images, ensuring consistent and realistic animations.
The SkyReels-A1 also highlights the effectiveness of our method in handling animations with large-scale facial expressions.
[1] Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, Di Zhang. LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control.
[2] Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, Qifeng Chen. Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation. Siggraph Asia 2024