When Hearing the Voice, Who Will Come to Your Mind

Abstract

Speech is a carrier containing rich biological information, such as speaker identity information including age, gender, race. In this paper, we explore the use of a self-supervised method to obtain speaker identity information from high-dimensional speech representations to generate face image. At the same time, considering that the biological information contained in the same piece of speech has different expression forms (such as images), we designed a cross-modal knowledge distillation method to transform the feature information from the visual domain to the speech domain. The feature vectors obtained through self-supervised learning and knowledge distillation are fed into a GAN-based generative model to obtain facial images containing speaker information. Subjective experiments show that our model can reach a well performance in the task of speaker identification. Experiments show that our proposed method can effectively establish the connection between different modalities and generate a face with rich biological information.

Type
Publication
International Joint Conference on Neural Networks
Click the Cite button above to demo the feature to enable visitors to import publication metadata into their reference management software.