MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Note: Audio length is trimmed to max 9 seconds, using 580 GPU credits.

Consider adding word at beginning of clip that is later trimmable since lipsync does not always start immediately.

Upload Input Image

Upload Input Audio

Seed (0 for Random)

Generated Video