Voice Activity Detection (VAD) is a widely used technique for separating vocal regions from audio signals, with applications in voice language coding, noise reduction, and other domains. While various strategies… Click to show full abstract
Voice Activity Detection (VAD) is a widely used technique for separating vocal regions from audio signals, with applications in voice language coding, noise reduction, and other domains. While various strategies have been proposed to improve VAD performance, such as ACAM, DCU-10, and Tr-VAD, these approaches often suffer from common limitations, including being unsuitable for long audio and being time-consuming. To address these issues, a new method called AAT-VAD is proposed, which integrates an adaptive width attention learning mechanism into the classic transformer framework. The approach involves extracting Mel-scale Frequency Cepstral Coefficients (MFCC) from the Mel scale frequency domain, adding a masking function to each transformer attention head, and inputting the features processed by the transformer encoder layer into the classifier. Experimental results indicate that a 12.8% higher F1-score is achieved by the method compared to DCU-10, and a 0.6% higher F1-score is achieved compared to Tr-VAD under different noise interferences. Furthermore, the average detection cost function (DCF) value of the method is only 14.3% of DCU-10 and 92.4% of Tr-VAD, and the test time of AAT-VAD is only 37.4% of that of Tr-VAD for the same noisy vocal mixed audio.
               
Click one of the above tabs to view related content.