Sign language recognition (SLR) enables the deaf and speech-impaired community to integrate and communicate effectively with the rest of society. Word level or isolated SLR is a fundamental yet complex… Click to show full abstract
Sign language recognition (SLR) enables the deaf and speech-impaired community to integrate and communicate effectively with the rest of society. Word level or isolated SLR is a fundamental yet complex task with the main objective of using models to correctly recognize signed words. Sign language consists of very fast and complex hand, body, face movements, and mouthing cues that make the task very challenging. Several input modalities; RGB, optical Flow, RGB-D, and pose/skeleton have been proposed for SLR. However, the complexity of these modalities and the state-of-the-art (SOTA) methodologies tend to be exceedingly sophisticated and over-parametrized. In this paper, our focus is to use the hands and body poses as an input modality. One major problem in pose-based SLR is extracting the most valuable and distinctive features for all skeleton joints. In this regard, we propose an accurate, efficient, and lightweight pose-based pipeline leveraging a graph convolution network (GCN) along with residual connections and a bottleneck structure. The proposed architecture not only facilitates efficient learning during model training providing significantly improved accuracy scores but also alleviates computational complexity. With the proposed architecture in place, we are able to achieve improved accuracies on three different subsets of the WLASL dataset and the LSA-64 dataset. Our proposed model outperforms previous SOTA pose-based methods by providing a relative improvement of 8.91%, 27.62%, and 26.97% for WLASL-100, WLASL-300, and WLASL-1000 subsets. Moreover, our proposed model also outperforms previous SOTA appearance-based methods by providing a relative improvement of 2.65% and 5.15% for WLASL-300 and WLASL-1000 subsets. For the LSA-64 dataset, our model is able to achieve 100% test recognition accuracy. We are able to achieve this improved performance with far less computational cost as compared to existing appearance-based methods.
               
Click one of the above tabs to view related content.