According to the World Health Organization, several factors have affected the accurate reporting of SARS-CoV-2 outbreak status, such as limited data collection resources, cultural and educational diversity, and inconsistent outbreak… Click to show full abstract
According to the World Health Organization, several factors have affected the accurate reporting of SARS-CoV-2 outbreak status, such as limited data collection resources, cultural and educational diversity, and inconsistent outbreak reporting from different sectors. Driven by this challenging situation, this study investigates the potential expediency of using social network data to develop reliable early information surveillance and warning system for pandemic outbreaks. As such, an enhanced framework of three inherently interlinked subsystems is proposed. The first subsystem includes data collection and integration mechanisms, data preprocessing, and hybrid sentiment analysis tools to identify tweet sentiment taxonomies and quantitatively estimate public awareness. The second subsystem comprises the feature extraction unit that identifies, selects, embeds, and balances feature vectors and the classifier fitting and training unit. This subsystem is designed to capture the most effective linguistic feature combinations with more spatial evidence by using a variety of approaches, including linear classifiers, MLPs, RNNs, and CNNs, as well as pre-trained word embedding algorithms. The last is the modeling and situational awareness evaluation subsystem, which measures temporal associations between pandemic-relevant social network activities and officially announced infection counts in the most hazardous geolocations. The proposed framework was developed and tested using a combination of static datasets and real-time scraped Twitter data. The results of these experiments showed the remarkable performance of the framework in assessing the temporal associations between public awareness and outbreak status. It also showed that the Decision Tree Classifier with Unigram+TF-IDF feature vectors outperformed other conventional models for sentiment classification and geolocation classification with an accuracy of 94.3% and 80.8, respectively. As indicated, conventional machine learning algorithms didn’t achieve a precision of more than 80%, while, for instance, MLP with self-embedding layer, Word2Vec, and GloVe pre-trained word embedding resulted in very poor accuracy of 10%, 36%, and 32%, respectively. However, adding the PoS tag one-hot encoding embedding increased the validation accuracy from 36% to approximately 89%, while the best performance for the second subsystem was achieved by Bi-LSTM with RoBERTa word embedding, with an accuracy of 96%. The achieved results reveal that the proposed framework can proactively capture the potential hazards associated with the prevalence of infectious diseases as an effective early detection and info-surveillance awareness system.
               
Click one of the above tabs to view related content.