Talk2Face: A Unified Sequence-based Framework for
Diverse Face Generation and Analysis Tasks
ACM Multimedia (MM)
Yudong Li1 Xianxu Hou1 Zhe Zhao2 Linlin Shen1* Xuefeng Yang2 Kimmo Yan2
2Tencent AI Lab
Figure 1: Faces generated on different textual conditions. We use captions (left) and attributes (top) to jointly control diverse face generation.
Facial analysis is an important domain in computer vision and has received extensive research attention. For numerous downstream tasks with different input/output formats and modalities, existing methods usually design task-specific architectures and train them using face datasets collected in the particular task domain. In this work, we proposed a single model, Talk2Face, to simultaneously tackle a large number of face generation and analysis tasks, e.g. text guided face synthesis, face captioning and age estimation. Specifically, we cast different tasks into a sequence-to-sequence format with the same architecture, parameters and objectives. While text and facial images are tokenized to sequences, the annotation labels of faces for different tasks are also converted to natural languages for unified representation. We collect a set of 2.3M face-text pairs from available datasets across different tasks, to train the proposed model. Uniform templates are then designed to enable the model to perform different downstream tasks, according to the task context and target. Experiments on different tasks show that our model achieves better face generation and caption performances than SOTA approaches. On age estimation and multi-attribute classification, our model reaches competitive performance with those models specially designed and trained for these particular tasks. In practice, our model is much easier to be deployed to different facial analysis related tasks. Code and dataset will be available at https://github.com/ydli-ai/Talk2Face.
Figure 2: Our approach represents different face-related tasks with a unified sequence-to-sequence model and uses it for text/face generation.
Figure 3: Overview of our approach. Text and faces are represented as discrete sequences conditioned on prompts. Input sequences are left-shifted for language modeling objective. We use self-attention mask to control the access to context for each token.
Figure 4: Inference templates for (a) text-guided face synthesis, (b) face captioning, (c) age estimation and (d) expression recognition.
Figure 5: Qualitative comparison of text-guided face synthesis.
Table 1: Performance on facial analysis tasks. Note that baselines are modeled for single task, whereas our model generalizes to all tasks. AE: age estimation, ER: expression recognition, RC: race classification, MAC: multi-attribute classification