Lip sync is the bridge between audio and face. A lip-sync model takes a speech waveform, extracts the sequence of phonemes (the smallest units of sound — /a/, /b/, /th/), maps each phoneme to a corresponding mouth shape (a viseme), and regenerates the face for each video frame.
Good lip-sync isn't just mouth shape — it includes jaw drop, tongue visibility, lip rounding, and natural micro-pauses between words. Bad lip-sync looks 'rubbery' or shows the mouth moving when the audio is silent.