We compare to Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss [Chen et al. 2019] and Realistic Speech-Driven Animation with GANs [Vougioukas et al. 2019] that animates a single input frame using audio input. Note how their results are of lower resolution and are tightly cropped around a centered head, making them ill-suited for a video editing workflow.
Chen et al. [2019] | Vougioukas et al. [2019] | Ours |
---|---|---|
Chen et al. [2019] | Vougioukas et al. [2019] | Ours |
Chen et al. [2019] | Vougioukas et al. [2019] | Ours |
Chen et al. [2019] | Vougioukas et al. [2019] | Ours |
Chen et al. [2019] | Vougioukas et al. [2019] | Ours |