Rainstorms, insect swarms, and galloping horses produce “sound textures,” which are the resulting natural sounds of many similar acoustic events. With new achievements emerging regularly for generative models, the deep… Click to show full abstract
Rainstorms, insect swarms, and galloping horses produce “sound textures,” which are the resulting natural sounds of many similar acoustic events. With new achievements emerging regularly for generative models, the deep convolutional neural network (CNN) has proven to be a tremendously successful approach for image and sound synthesis. Existing state-of-the-art sound texture generative models simply treat sound texture signals as 1-D images while discarding the difference between the human vision and auditory systems. This paper considers mel-frequency statistical features, which are designed according to the human auditory system and have been viewed as the dominant features for sound identification. We first construct a CNN structure for extracting mel-frequency features from sounds losslessly. This structure is called mel-frequency CNN (MF-CNN). Next, we investigate a novel sound texture generative model by incorporating the MF-CNN into a convolutional generative network composed of cascading upsampling groups. A jointly alternating back propagation algorithm is proposed to train the overall network. The feedback of the MF-CNN is used to advise the gradients in the inferential and learning back propagation to make the mel-frequency features of the synthesized sounds more similar to the natural ones. Moreover, the proposed generative model can also be extended to other sound synthesis tasks.
               
Click one of the above tabs to view related content.