In virtual reality, talking face generation is committed to using voice and face images to generate real face speech videos to improve the communication experience in the case of limited… Click to show full abstract
In virtual reality, talking face generation is committed to using voice and face images to generate real face speech videos to improve the communication experience in the case of limited user information exchange. In a real video, blinking is an action often accompanied by speech, and it is also one of the indispensable actions in real face speech videos. However, the current methods either do not pay attention to the generation of eye movements, or cannot control the blinking in the generated results. To this end, this paper proposes a novel system which produces vivid talking face with controllable eye blinks driven by the joint features including identity feature, audio feature, and blink feature. In order to disentangle the blinking action, we designed three independent features to individually drive the main components in the generated frame, namely the facial appearance, mouth movements, and eye movements. Through the adversarial training of the identity encoder, we filter out the information of the eye state from the identity feature, thereby strengthening the independence of the blinking feature. We introduced the blink score as the leading information of the blink feature, and through training, the value can be consistent with human perception to form a complete and independent control of the eyes. Experimental results on multiple datasets show that our method can not only reproduce real talking faces, but also ensure that the blinking pattern and time are fully controllable.
               
Click one of the above tabs to view related content.