Language independent algorithm for clustering text documents with respect to their sentiment

Jerzy  Korzeniewski; Adam Idczak

doi:https://doi.org/10.59170/stattrans-2024-034

Language independent algorithm for clustering text documents with respect to their sentiment

Jerzy Korzeniewski Department of Demography, Faculty of Economics and Sociology, University of Lodz, Lodz, Poland ORCID:https://orcid.org/0000-0001-6526-5921 , Adam Idczak Department of Statistical Methods, Faculty of Economics and Sociology, University of Lodz, Lodz, Poland ORCID:https://orcid.org/0000-0001-9676-2410 Statistics in Transition new series, vol. 25, 2024, 3, pages: 175-185 Published online: 4 September 2024 https://doi.org/10.59170/stattrans-2024-034 Citation: Idczak A., Korzeniewski J., 2024. Language independent algorithm for clustering text documents with respect to their sentyment. Statistics in Transition new series, 25(3), pp. 175-185. https://doi.org/10.59170/stattrans-2024-034

942 Views 109 Downloads

ARTICLE

(English) PDF

ABSTRACT

Determining the sentiment of a written text is an important task in text research. This task can be performed either in the supervised or unsupervised version. In this paper, we propose a novel unsupervised algorithm for documents written in any language using documents written in Polish as an example. The clustering of Polish language texts with respect to their sentiment is poorly developed in the literature on the subject. The novelty of the proposed algorithm involves the abandonment of stoplists and lemmatisation. Instead, we propose translating all documents into English and performing a two-stage document grouping. In the first step of the algorithm, selected documents are assigned to a class of positive or negative documents based on a set of lexical and grammatical rules as well as a set of key-terms. Key-terms do not have to be entered by the user, the algorithm finds them. In the second step, the remaining documents are attached to one of the classes according to the rules based on the vocabulary found in the documents grouped in the first step. The algorithm was tested on three corpora of documents and achieved very good results.

KEYWORDS

text mining, document sentiment, document clustering.

REFERENCES

Chifu, E., Chifu, V., Letia, T., (2015). Unsupervised Aspect Level Sentiment Analysis Using Self-Organizing Maps, IEEE, https://doi.org/10.1109/SYNASC.2015.75.

Eder, M., Górski, R. L. (2023). Stylistic Fingerprints, POS-tags, and Inflected Languages: A Case Study in Polish. Journal of Quantitative Linguistics, 30(1), pp. 86–103.

Kocon, J., Milkowski, P., Zasko-Zielinska, M., (2019). Multi-Level Sentiment Analysis of PolEmo 2.0: Extended Corpus of Multi-Domain Consumer Reviews, Proceedings of the 23rd Conference on Computational Natural Language Learning, pp. 980–991.

Manaa, M. E., Abdulameer, G., (2018). Web Documents Similarity using k-Shingle tokens and MinHash technique. J. Eng. Appl. Sci., 13, pp. 1499–1505.

Lin, C., He, Y., (2009). Joint sentiment/topic model for sentiment analysis, 18th ACM conference on Information and knowledge management, pp. 375–384.

Li, G., Liu, F., (2012). Application of a clustering method on sentiment analysis. Journal of Information Science, 38(2) pp. 127–139.

Sharma, A., Dey, S., (2013). Using self-organizing maps for sentiment analysis, Cornell University Library.

Souza, E., Santos, D., Oliveira, G. et al., (2020). Swarm optimization clustering methods for opinion mining. Nat Comput, 19, pp. 547–575, https://doi.org/10.1007/s11047-018-9681-2.

Yuqiang Tong, Lize Gu, (2018). A News Text Clustering Method Based on Similarity of Text Labels, Advanced Hybrid Information Processing – Second EAI International Conference, ADHIP 2018.

Zhang, W. M., Jiang, W. U., Yuan, X. J., (2010). K-means text clustering algorithm based on density and nearest neighbor, J. Comput. Appl., 30(7), pp. 1933–1935.