IMPROVING SENTIMENT CLASSIFICATION FOR THE UKRAINIAN LANGUAGE: TRANSITIONING FROM RULE-BASED ALGORITHMS TO SUPERVISED LEARNING MODELS

Authors

DOI:

https://doi.org/10.35546/kntu2078-4481.2025.3.2.4

Keywords:

Ukrainian language, sentiment analysis, rule-based algorithm, supervised learning, Random Forest, hybrid NLP, emoji sentiment, dependency parsing

Abstract

This paper suggests a better way to sort out feelings in Ukrainian-language content. It does this by taking into account how hard it is to understand the language’s morphology, how flexible its syntax is, and how few NLP tools there are.The study improves an existing rule-based sentiment analysis algorithm by adding a larger Ukrainian lexicon, polarity scores, emoji sentiment mapping, phrase-level sentiment scoring, and dependency parsing. These features find subtle signals of sentiment that most English-optimized tools miss.To make things even better, the linguistic features that were taken out are turned into structured numerical vectors and added to a hybrid pipeline that uses both rule-based processing and supervised machine learning models. Four classifiers with a set of Ukrainian-language tweets that have labels were trained and tested. K-Nearest Neighbours, Support Vector Machine, Decision Tree, and Random Forest. The Random Forest model was the most accurate (90 %) and had the best F1-score of all the classifiers that were tested in comparative experiments. It was better than other models at handling changes in how people feel and what is going on.The findings indicate that employing both handcrafted linguistic insights and supervised learning constitutes an effective method for conducting sentiment analysis in languages with limited resources, such as Ukrainian. This research illustrates the importance of language-specific resources and customized pipelines to guarantee the precision of sentiment detection.This has real-world effects on keeping an eye on social media, reading customer reviews, and getting opinions from Ukrainians. More domains will be added in the future, the ability to analyze data in real time, and the ability to compare our work to deep learning models in the future. This will make sentiment classification for low-resource languages even better.

References

Syed, A., Aslam, M., & Saeed, F. (2020). A hybrid sentiment analysis approach for Urdu language social media data. Journal of King Saud University – Computer and Information Sciences, 32(4), 453–459. https://doi.org/10.1016/j.jksuci.2020.04.004

Elmadany, A., Abdul-Mageed, M., & Hashemi, H. (2021). Lexicon-augmented neural networks for dialectal Arabic sentiment analysis. У Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 3788–3802). Association for Computational Linguistics. https://aclanthology.org/2021.emnlp-main.306/

Kharde, V., & Sonawane, P. (2021). Hybrid techniques for sentiment analysis: A review. Journal of Theoretical and Applied Information Technology, 99(8), 1840–1852.

Vázquez, S., Balage Filho, P. P., & Pardo, T. A. S. (2022). Emoji and intensifier handling in multilingual sentiment classification. Language Resources and Evaluation, 56(2), 527–550. https://doi.org/10.1007/s10579-021-09561-1

Xia, R., Liu, Q., & Chen, S. (2023). Sentiment analysis of code-switched texts using hybrid pipelines. Information Processing & Management, 60(2), 103183. https://doi.org/10.1016/j.ipm.2022.103183

Zhou, L., Li, Q., & Zhang, X. (2024). A comparative study of rule-based, neural, and hybrid sentiment analysis models for low-resource languages. Natural Language Engineering, 30(1), 45–70. https://doi.org/10.1017/S1351324923000345

Syed, S., & Spruit, M. (2020). Full-text or abstract? Examining topic coherence scores using Latent Dirichlet Allocation. У 2020 IEEE International Conference on Data Science and Advanced Analytics (DSAA) (pp. 528–537). IEEE. https://doi.org/10.1109/DSAA49011.2020.00056

Islam, M. R., Islam, M. M., Rahman, M. M., & Islam, M. S. (2020). Sentiment analysis of low-resource Bengali text using hybrid learning models. У 2020 11th International Conference on Computing, Communication and Networking Technologies (ICCCNT) (pp. 1–6). IEEE. https://doi.org/10.1109/ICCCNT49239.2020.9225540

Bhaskar, M., & Saini, H. K. (2021). Hybrid sentiment analysis using machine learning techniques for Hindi tweets. Procedia Computer Science, 192, 3732–3741. https://doi.org/10.1016/j.procs.2021.09.148

Abozinadah, E. A., & Jones, J. (2021). A hybrid deep learning approach for sentiment analysis of Arabic tweets. IEEE Access, 9, 10241–10258. https://doi.org/10.1109/ACCESS.2021.3050634

Almanea, M., & Habash, N. (2020). Investigating the use of transformers for Arabic dialect sentiment analysis. У Proceedings of the Fifth Arabic Natural Language Processing Workshop (WANLP 2020) (pp. 63–77). Association for Computational Linguistics. https://aclanthology.org/2020.wanlp-1.7/

Mohammad, S. M., & Bravo-Marquez, F. (2020). EmoLex: An expanded emotion lexicon for social media analysis. У Proceedings of the 12th Language Resources and Evaluation Conference (LREC 2020) (pp. 3678–3684). European Language Resources Association (ELRA). https://aclanthology.org/2020.lrec-1.453/

Koto, F., Rahman, M., & Baldwin, T. (2020). IndoBERTweet: A pre-trained language model for Indonesian Twitter with emotion and sentiment understanding. У Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings (pp. 2347–2352). Association for Computational Linguistics. https://aclanthology.org/2020.findings-emnlp.212/

Published

2025-11-28