The mix of native language with other languages (code-mixing) in social media has posed a severe challenge for language identification (LID) systems. It has encouraged research on code-mixed LID solutions.… Click to show full abstract
The mix of native language with other languages (code-mixing) in social media has posed a severe challenge for language identification (LID) systems. It has encouraged research on code-mixed LID solutions. Four things have been identified in this study, such as techniques, challenges, and dataset availability with corresponding quality criteria and developed a comprehensive framework for code-mixed LID. Also, we identified gaps and future work opportunities in tackling code-mixed LID challenges. Based on our analysis of reviewed studies, we outlined key points for future research in code-mixed LID. We demonstrated a taxonomy of applied techniques for code-mixed LID and highlighted the different technique variants. In code-mixed LID tasks, we discovered four significant challenges: ambiguity, lexical borrowing, non-standard words, and intra-word code-mixing. This systematic literature review recognised 32 code-mixed datasets available for LID. We proposed five features to describe the quality criteria datasets, such as the number of instances or sentences, percentage of code-mixed types in the data, number of tokens, number of unique tokens, and average sentence length. Finally, we synthesised the methodologies and proposed a conceptual framework for subsequent studies through our literature analysis.
               
Click one of the above tabs to view related content.