MODIFICATION OF THE METHOD OF LARGE TEXT SETS CLUSTERING

Authors

DOI:

https://doi.org/10.35546/kntu2078-4481.2024.4.47

Keywords:

clustering, clusterization, classification, dbscan, k-means, text processing, preprocessing, text, accuracy, cluster

Abstract

In this paper, a comparative analysis of common clustering methods such as k-means, Latent Dirichlet Distribution or LDA, Hierarchical Clustering Algorithm or HC, Density-based spatial clustering of applications with noise or DBSCAN, and Gaussian Mixture Model or GMM was conducted. The analysis was performed according to the selected criteria, such as scalability, computational complexity, presence (or absence) of a predefined number of clusters, and the evaluation approach (absolute with a clear relation to the cluster or relative using probabilities). According to the results, the DBSCAN method was chosen for further consideration due to a number of advantages, and the modification mod_DBSCAN was proposed, which reduces the number of potential calculations at each iteration, as a result, reduces computational complexity, and also increases system performance in conditions of limited resources. The modification consists of two changes: the vectorization stage, which is based on the annotation and keywords of the text specified by the author instead of the full text, and distance estimation for the so-called noisy points, which is performed in two steps. Popular datasets for the clustering task were analyzed. The proposed modification was tested on own Academ Lib Set dataset, formed on the basis of materials in the electronic catalog of the NURE scientific library. The analysis of the results showed an improvement in Precision by 5.6%, Recall by 12.5%, and F-score by 9.65%, which proves the effectiveness of the proposed modification. Further developments include testing combinations of methods and modules into larger functional blocks to identify and eliminate potential problems, as well as further optimization of such blocks. A separate work will investigate the approach to re-clustering after the dataset is updated. The quality of the new distribution is planned to be assessed based on the Rand index.

References

Ahmed M. H., Tiun S., Omar N., Sani, N. S. Short Text Clustering Algorithms, Application and Challenges: A Survey. Applied Sciences. 2023. Vol. 13, No 1. P. 342. https://doi.org/10.3390/app13010342.

Dhar A., Mukherjee H., Dash N.S. та ін. Text categorization: past and present. Artificial Intelligence Review. 2021. Vol. 54. P 3007–3054. https://doi.org/10.1007/s10462-020-09919-1.

Барковська О., Холєв В., Пивоварова Д., Іващенко Г., Росінський Д. Система обміну знаннями молодих науковців із різних країн. Сучасні інформаційні системи. 2021. № 5(1). С. 69–74. https://doi.org/10.20998/2522-9052.2021.1.09.

Ester M., Kriegel H., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. AAAI Press : In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96), 1996. P. 226–231.

Blei, David M.; Ng, Andrew Y.; Jordan, Michael I. Latent Dirichlet allocation. Journal of Machine Learning Research. 2003.Vol. 3, PP. 993–1022. doi:10.1162/jmlr.2003.3.4-5.993

Suyal H., Panwar A., Singh Negi A. Text Clustering Algorithms: A Review. International Journal of Computer Applications. 2014. Т. 96, № 24. С. 36–40. URL: https://doi.org/10.5120/16946-7075.

Hotho A., Nürnberger A., Paaß G. A Brief Survey of Text Mining. Journal for Language Technology and Computational Linguistics. 2005. Т. 20, № 1. С. 19–62. URL: https://doi.org/10.21248/jlcl.20.2005.68.

Zheng, Y., Cheng, X., Huang, R., Man, Y. A Comparative Study on Text Clustering Methods. Springer, Berlin, Heidelberg : In Advanced Data Mining and Applications. ADMA 2006, vol 4093. 2006. https://doi.org/10.1007/11811305_71.

Afzali M., Kumar S. Text Document Clustering: Issues and Challenges. 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), м. Faridabad, 14–16 лют. 2019 р. 2019. URL: https://doi.org/10.1109/comitcon.2019.8862247.

Електронний каталог – Наукова бібліотека ХНУРЕ. Головна – Наукова бібліотека ХНУРЕ. URL: https://lib.nure.ua/el-katalog.

Downloads

Published

2024-12-30