Learning Spatial-Temporal Coherent Correlations for Speech-Preserving Facial Expression Manipulation

Tianshui Chen1,†, Jianman Lin2,†, Zhijing Yang1,*, Chunmei Qing2, Guangrun Wang3, Liang Lin3
1Guangdong University of Technology, 2South China University of Technology, 3Sun Yat-sen University
† Equal contribution.   * Corresponding author.
💻 Code 📦 Models

Abstract

Speech-preserving facial expression manipulation (SPFEM) aims to modify facial emotions while meticulously maintaining the mouth animation associated with spoken content. Current works depend on inaccessible paired training samples for the person, where two aligned frames exhibit the same speech content yet differ in emotional expression, limiting the SPFEM applications in real-world scenarios. In this work, we discover that speakers who convey the same content with different emotions exhibit highly correlated local facial animations in both spatial and temporal spaces, providing valuable supervision for SPFEM. To capitalize on this insight, we propose a novel spatial-tempral coherent correlation learning (STCCL) algorithm, which models the aforementioned correlations as explicit metrics and integrates the metrics to supervise manipulating facial expression and meanwhile better preserving the facial animation of spoken contents. To this end, it first learns a spatial coherent correlation metric, ensuring that the visual correlations of adjacent local regions within an image linked to a specific emotion closely resemble those of corresponding regions in an image linked to a different emotion. Simultaneously, it develops a temporal coherent correlation metric, ensuring that the visual correlations of specific regions across adjacent image frames associated with one emotion are similar to those in the corresponding regions of frames associated with another emotion. Recognizing that visual correlations are not uniform across all regions, we have also crafted a correlation-aware adaptive strategy that prioritizes regions that present greater challenges. During SPFEM model training, we construct the spatial-temporal coherent correlation metric between corresponding local regions of the input and output image frames as addition loss to supervise the generation process. We conduct extensive experiments on variant datasets, and the results demonstrate the effectiveness of the proposed STCCL algorithm.

Integrate STCCL into NED

STCCL Framework
An overall pipeline of incorporating the proposed STCCL algorithm to the current advanced NED method to supervise generating the intermediate 3DMM meshes and final rendered images. It computes visual correlation of corresponding and non-corresponding local regions between the source and generated images, followed by the correlation-aware adaptive strategy to obtain the final loss to supervise final image generation. An identical process is performed on the source and generated 3DMM meshes to supervise the intermediate 3DMM mesh generation.

STCCL Algorithm Overview

STCCL
Left half: An illustration of spatial-temporal coherent correlation metric learning based on visual disparity. It retrieves the corresponding adjacent local regions of the input and output images (sequence) as positive samples and the non-corresponding counterparts as negative samples. The process is performed in the feature maps to construct dense positive and negative samples to train the metric. Right half: Similar to the Left half, but based on correlation matrices.

Video Examples

Integration with EAT (Transformer-based)
STCCL significantly reduces lip blurriness and improves articulatory precision in EAT.
Integration with DICE (Diffusion-based)
STCCL significantly reduces lip blurriness and improves articulatory precision in in DICE.
SPFEM Results (Integrated into NED)