Abstract Voice Gender Recognition is a non-trivial task that is extensively studied in the literature, however, when the voice gets surrounded by noises and unconstrained environments, the task becomes more… Click to show full abstract
Abstract Voice Gender Recognition is a non-trivial task that is extensively studied in the literature, however, when the voice gets surrounded by noises and unconstrained environments, the task becomes more challenging. This paper presents two Self-Attention-based models to deliver an end-to-end voice gender recognition system under unconstrained environments. The first model consists of a stack of six self-attention layers and a dense layer. The second model adds a set of convolution layers and six inception-residual blocks to the first model before the self-attention layers. These models depend on Mel-frequency cepstral coefficients (MFCC) as a representation of the audio data, and Logistic Regression for classification. The experiments were done under unconstrained environments such as background noise and different languages, accents, ages and emotional states of the speakers. The results demonstrate that the proposed models were able to achieve an accuracy of 95.11%, 96.23%, respectively. These models achieved superior performance in all criteria and are believed to be state-of-the-art for Voice Gender Recognition under unconstrained environments.
               
Click one of the above tabs to view related content.