Recent advances in deep learning and computer vision have spawned a new class of media forgeries known as deepfakes, which typically consist of artificially generated human faces or voices. The creation and distribution of deepfakes raise many legal and ethical concerns. As a result, the ability to distinguish between deepfakes and authentic media is vital. While deepfakes can create plausible video and audio, it may be challenging for them to to generate content that is consistent in terms of high-level semantic features, such as emotions. Unnatural displays of emotion, measured by features such as valence and arousal, can provide significant evidence that a video has been synthesized. In this paper, we propose a novel method for detecting deepfakes of a human speaker using the emotion predicted from the speaker's face and voice. The proposed technique leverages LSTM networks that predict emotion from audio and video LLDs. Predicted emotion in time is used to classify videos as authentic or deepfakes through an additional supervised classifier.

Do Deepfakes Feel Emotions? A Semantic Approach to Detecting Deepfakes via Emotional Inconsistencies

Salvi Davide;Antonacci Fabio;Bestagini Paolo;Tubaro Stefano;
2021

Abstract

Recent advances in deep learning and computer vision have spawned a new class of media forgeries known as deepfakes, which typically consist of artificially generated human faces or voices. The creation and distribution of deepfakes raise many legal and ethical concerns. As a result, the ability to distinguish between deepfakes and authentic media is vital. While deepfakes can create plausible video and audio, it may be challenging for them to to generate content that is consistent in terms of high-level semantic features, such as emotions. Unnatural displays of emotion, measured by features such as valence and arousal, can provide significant evidence that a video has been synthesized. In this paper, we propose a novel method for detecting deepfakes of a human speaker using the emotion predicted from the speaker's face and voice. The proposed technique leverages LSTM networks that predict emotion from audio and video LLDs. Predicted emotion in time is used to classify videos as authentic or deepfakes through an additional supervised classifier.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops
File in questo prodotto:
File Dimensione Formato  
Hosler_Do_Deepfakes_Feel_Emotions_A_Semantic_Approach_to_Detecting_Deepfakes_CVPRW_2021_paper.pdf

accesso aperto

: Publisher’s version
Dimensione 8.49 MB
Formato Adobe PDF
8.49 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11311/1183572
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 10
  • ???jsp.display-item.citation.isi??? 5
social impact