To make full use of the key information in frame-level features, a DNN-based model for speech enhancement is proposed using self-attention on the feature dimension. Two improvement strategies are adopted… Click to show full abstract
To make full use of the key information in frame-level features, a DNN-based model for speech enhancement is proposed using self-attention on the feature dimension. Two improvement strategies are adopted to strengthen the attention of the fully connected layers to the effective information in the features. First, the model introduces the fusion of feature domains on the input features, using a 136-dimensional combination of features including the MFCC, AMS, RASTA-PLP, cochleagram, and PNCC. The fusion of features complements information from different domains, including the mel domain and gammatone domain, thus providing more effective information for self-attention. Second, a feature-level self-attention mechanism is applied to the output of the fully connected layer to obtain the information related to the task. The feature-level attention enables the fully connected layers to capture the internal correlations between different features and to reduce the redundancy brought by multiple features. The experimental results show that under the matched noisy condition, compared to the noisy signals, the proposed algorithm increased the PESQ, fwsegSNR and STOI by 40.92%, 60.2% and 8.31%, respectively, and by 23.64%, 32.55% and 3.4%, respectively, under the mismatched noisy condition. Comparisons between different neural networks indicate that the proposed algorithm is superior to the compared algorithms in both the matched and mismatched situations using fewer context frames. Therefore, the proposed model can effectively utilize the key information of the features to suppress the noise, thereby improving the speech quality and generalizing the mismatched samples.
               
Click one of the above tabs to view related content.