Recognizing author identity from digital footprints without having a large corpus of documents from an individual is of keen interest to security researchers and government agencies. Users reveal aspects of… Click to show full abstract
Recognizing author identity from digital footprints without having a large corpus of documents from an individual is of keen interest to security researchers and government agencies. Users reveal aspects of their personality via the content they share with their social media followers and through the patterns in their interactions on online networking platforms. This study examines the potency of emerging natural language processing (NLP) methods in analyzing social network activity. A linguostylistic personality traits assessment (LPTA) system is developed to estimate Twitter users’ personality traits based on their tweets using the Myers-Briggs-type indicator (MBTI) and big-five personality scales. A novel input representation mechanism is proposed to process tweets by converting them into real-valued vectors using frequency, co-occurrence, and context (FCC) measures. Other prevalent text representation schemes, such as one-hot encoding, count-based vectorization, and pretrained language model representations are used as comparators. A genetic algorithm (GA) approach is proposed to reduce the feature set and increase the efficacy of the features extracted. The developed system outperforms the state-of-the-art research by reliably estimating the user’s latent personality traits while using 50 or fewer tweets per user.
               
Click one of the above tabs to view related content.