Speech is a carrier containing rich biological information, such as speaker identity information including age, gender, race. In this paper, we explore the use of a self-supervised method to obtain speaker identity information from high-dimensional speech representations to generate face image. At the same time, considering that the biological information contained in the same piece of speech has different expression forms (such as images), we designed a cross-modal knowledge distillation method to transform the feature information from the visual domain to the speech domain. The feature vectors obtained through self-supervised learning and knowledge distillation are fed into a GAN-based generative model to obtain facial images containing speaker information. Subjective experiments show that our model can reach a well performance in the task of speaker identification. Experiments show that our proposed method can effectively establish the connection between different modalities and generate a face with rich biological information.