With the development of speech synthesis technology, the existing synthetic speech detection (SSD) methods cannot generalize well for unknown synthesis algorithms. And thus, this kind of speech forensics task meets… Click to show full abstract
With the development of speech synthesis technology, the existing synthetic speech detection (SSD) methods cannot generalize well for unknown synthesis algorithms. And thus, this kind of speech forensics task meets a great challenge and has attracted great enthusiasm. We observe that the process of speech synthesis always includes the resampling and pooling/smoothing operations, which will change the speech’s local autoregressive (AR) and statistic distribution. In this paper, based on AR modeling and standard deviation statistics, we propose novel front-end speech features, i.e., ARS in short form, as the input of an SSD classifier. In addition, a new back-end classifier is constructed based on the dense convolution and short connection, and we name it scDenseNet. Experimental results on the ASVspoof2019 logical access (LA) dataset demonstrate that the ARS has a strong representation and sensitivity to spoofing attacks, and achieves promising performance on SSD. The proposed scDenseNet outperforms the previous version DenseNet on both EER and t-DCF scores, and achieves the best performance when compared with other state-of-the-art classifiers studied in this paper. Furthermore, based on the proposed scDenseNet, incorporating ARS with popular features such as the linear frequency cepstral coefficients (LFCC) significantly enhances the fusion performance and yields an EER score of 0.98%.
               
Click one of the above tabs to view related content.