A large part of big data consists of text documents such as papers, patents or articles. To analyze text data, we have to preprocess the text documents and build a… Click to show full abstract
A large part of big data consists of text documents such as papers, patents or articles. To analyze text data, we have to preprocess the text documents and build a structured data based on a document-word matrix using various text mining techniques. This is because statistics and machine learning algorithms used in text analysis require structured train data. The row and column of the matrix are document and word, respectively. The element of the matrix represents the frequency value of the word occurring in each document. In general, because the number of words is much larger than the number of documents, most elements have zero values. Due to the sparsity problem caused by inflated zeros, the performance of the predictive model has decreased. In this paper, we propose a method to solve the sparsity problem and improve the model performance in text data analysis. We perform compound Poisson linear modeling to make the proposed method. To show the performance of our proposed method, we collect and analyze the patent documents from patent databases. In our experimental results, we compared the value of the Akaike information criterion (AIC) of the proposed model with traditional models, such as linear model, generalized linear model and zero-inflated Poisson model. Additionally, we illustrated that the AIC value of our proposed model is smaller than others. Therefore, we verify the validity of this paper.
               
Click one of the above tabs to view related content.