We compare to Text-based Editing of Talking-head Video [Fried et al. 2019] using the same amount of target video (< 5 minutes) as well as over 12 times (1 hour) the amount of target video.
Notice the jumpiness and erroneous mouth motions in the Fried et al. results.
Fried et al. (< 5 min) | Fried et al. (> 1 hr) | Ours (< 5 min) |
---|---|---|
Fried et al. (< 5 min) | Fried et al. (> 1 hr) | Ours (< 5 min) |
Fried et al. (< 5 min) | Fried et al. (> 1 hr) | Ours (< 5 min) |
Fried et al. (< 5 min) | Fried et al. (> 1 hr) | Ours (< 5 min) |
Fried et al. (< 5 min) | Fried et al. (> 1 hr) | Ours (< 5 min) |