EDCM, the Exponential-family approximation to the Dirichlet Compound Multinomial (DCM), proposed by Elkan (2006), is an efficient statistical model for high-dimensional and sparse count data. EDCM models take into account… Click to show full abstract
EDCM, the Exponential-family approximation to the Dirichlet Compound Multinomial (DCM), proposed by Elkan (2006), is an efficient statistical model for high-dimensional and sparse count data. EDCM models take into account the burstiness phenomenon correctly while being many times faster than DCM. This work proposes the use of Minimum Message Length (MML) criterion for determining the number of components that best describes the data with a finite EDCM mixture model. Parameters estimation is based on the previously proposed Deterministic Annealing Expectation-Maximization (DAEM) approach. The validation of the proposed unsupervised algorithm involves different real applications: text document modeling, topic novelty detection and hierarchical image clustering. A comparison with results obtained for other information-theory based selection criteria is provided.
               
Click one of the above tabs to view related content.