LAUSR.org creates dashboard-style pages of related content for over 1.5 million academic articles. Sign Up to like articles & get recommendations!

Topic‐Sentiment Hybrid Networks for Explainable Document Clustering: A Probabilistic Multi‐Dimensional Similarity Analysis

This study introduces a statistical methodology for document clustering that integrates multiple dimensions of textual similarity through network topology analysis. The proposed methodology, which we call Multi‐dimensional Similarity Network Analysis… Click to show full abstract

This study introduces a statistical methodology for document clustering that integrates multiple dimensions of textual similarity through network topology analysis. The proposed methodology, which we call Multi‐dimensional Similarity Network Analysis (MSNA), extends traditional document‐clustering approaches by combining semantic embeddings, topic probability distributions, and emotional probability distribution into a unified similarity measure. We formalize this through a weighted combination of Jensen‐Shannon divergences across different probability spaces, creating a comprehensive similarity network. The clustering is achieved through a community detection algorithm that optimizes a multi‐objective modularity function, accounting for the different similarity dimensions. We prove the statistical consistency of our approach and derive bounds for the clustering performance under mild regularity conditions. The methodology is validated on a large‐scale data set of Airbnb reviews from Sardinia, Italy, containing text content, topic distributions, and emotional features. Results show significant improvements in both clustering quality (average silhouette score increased) and interpretability compared to traditional single‐dimension approaches. From an empirical perspective, the synthetic data validation demonstrates robust performance with topic strength in the range and emotion strength in , achieving mean Adjusted Rand Index scores of 0.44. The application to real‐world data identifies five distinct clusters through PROCSIMA (PRObabilistic Clustering SIMilarity Analysis), with subsequent SMARTS (SeMantic Analysis of Review Topics and Sentiment) analysis revealing interpretable community structures within each cluster. The framework's ability to simultaneously capture text's semantic, thematic, and emotional aspects makes it particularly valuable for applications in customer experience analysis and service quality monitoring.

Keywords: document clustering; methodology; analysis; topic; similarity

Journal Title: Applied Stochastic Models in Business and Industry
Year Published: 2025

Link to full text (if available)


Share on Social Media:                               Sign Up to like & get
recommendations!

Related content

More Information              News              Social Media              Video              Recommended



                Click one of the above tabs to view related content.