Compared with traditional handcrafted features, deep learning has greatly improved the performance of scene parsing. However, it remains challenging under various environmental conditions caused by imaging limitations. Thermal imaging cameras… Click to show full abstract
Compared with traditional handcrafted features, deep learning has greatly improved the performance of scene parsing. However, it remains challenging under various environmental conditions caused by imaging limitations. Thermal imaging cameras have several advantages over cameras for the visible spectrum, such as operation in total darkness, robustness to shadow effects, insensitivity to illumination variations, and strong ability to penetrate smog and haze. These advantages of thermal imaging cameras make them ideal for the scene parsing of semantic objects in daytime and nighttime. In this paper, we propose a novel multiscale feature fusion and enhancement network (MFFENet) for accurate parsing of RGB–thermal urban road scenes even when the quality of the available RGB data is compromised. The proposed MFFENet consists of two encoders, a feature fusion layer, and a multi-label supervision layer. We concatenate the multi-scale features with the features that contain global semantic information. Furthermore, we explore the cross-modal fusion of RGB and thermal features at multiple stages, rather than fusing them once at the low or high stage. Then, we propose a spatial attention mechanism module that provides a higher weight to (focuses more on) the foreground area, allowing MFFENet to emphasize foreground objects. Finally, multi-label supervision is introduced to optimize parameters of the proposed MFFENet. Experimental results confirm that the proposed MFFENet outperforms similar high-performing methods.
               
Click one of the above tabs to view related content.