Iterative Text-based Editing of Talking-heads Using Neural Retargeting

Supplemental Materials

Comparison to Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss [Chen et al. 2019] and Realistic Speech-Driven Animation with GANs [Vougioukas et al. 2019].

We compare to Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss [Chen et al. 2019] and Realistic Speech-Driven Animation with GANs [Vougioukas et al. 2019] that animates a single input frame using audio input. Note how their results are of lower resolution and are tightly cropped around a centered head, making them ill-suited for a video editing workflow.

Chen et al. [2019]	Vougioukas et al. [2019]	Ours



Chen et al. [2019]	Vougioukas et al. [2019]	Ours


Chen et al. [2019]	Vougioukas et al. [2019]	Ours


Chen et al. [2019]	Vougioukas et al. [2019]	Ours


Chen et al. [2019]	Vougioukas et al. [2019]	Ours