Abstract Among various Sound Event Detection (SED) systems, Recurrent Neural Networks (RNN), such as long short-term memory unit and gated recurrent unit, is used to capture temporal dependencies, but it… Click to show full abstract
Abstract Among various Sound Event Detection (SED) systems, Recurrent Neural Networks (RNN), such as long short-term memory unit and gated recurrent unit, is used to capture temporal dependencies, but it is confined in its length of temporal dependencies, resulting in a failure to model sound events with long duration. What’s more, RNN is incapable to process datasets in parallel, leading to low efficiency and low industrial value. Given these shortcomings, we propose to use dilated convolution (and causal dilated convolution) to capture temporal dependencies, as its great ability to ensure high time resolution and obtain longer temporal dependencies under the filter size and the network depth unchanged. In addition, dilated convolution can be parallelized, so it has higher efficiency and industrial value. Based on this, we propose Single-Scale Fully Convolutional Networks (SS-FCN) composed of convolutional neural networks and dilated convolutional networks, with the former to provide frequency invariance and the later to capture temporal dependencies. With the help of dilated convolution to control the length of temporal dependencies, we observe SS-FCN modeling a single length of temporal dependencies achieves superior detection performance for finite kinds of events. For better performance, we propose Multi-Scale Fully Convolutional Networks (MS-FCN), in which the feature fusion module is introduced to capture long short-term dependencies by fusing features with different length of temporal dependencies. The proposed methods achieve competitive performance on three main datasets with higher efficiency. The results show that SED systems based on Fully Convolutional Networks have further research value and potential.
               
Click one of the above tabs to view related content.