The languages spoken in Asia share common morphological analysis errors in word segmentation which normally propagate to higher-level processing, i.e., POS tagging, syntactic parsing, word extraction, and NER, as we… Click to show full abstract
The languages spoken in Asia share common morphological analysis errors in word segmentation which normally propagate to higher-level processing, i.e., POS tagging, syntactic parsing, word extraction, and NER, as we discuss in this research. We introduce the Thai character cluster (TCC) to reduce the errors propagated from word segmentation and POS tagging by incorporating into character representation layer of BiLSTM for NER. The initial NER model is created from the original THAI-NEST named-entity tagged corpus by applying the best performing BiLSTM-CNN-CRF model with word, part-of-speech, and character cluster embedding. We determine the errors and improve the consistency of the NE annotation through our holdout method by retraining the model with the corrected training set. After the iteration, the overall result of annotation F1-score has been improved to reach 89.22%, which improves 16.21% from the model trained on the original corpus. The result of our iterative verification is a promising method for low resource language modeling. As a result, The NE silver standard corpus is newly generated for the Thai NER task, called BKD Corpus (Bangkok Data NE tagged Corpus). The consistency of annotation is checked and revised according to the improvement of the scope of NE detection by TCC which can recover the errors in word segmentation.
               
Click one of the above tabs to view related content.