MACHINE LEARNING ENSEMBLE METHODS IMPLEMENTATION FOR DECEPTIVE TEXT DETECTION

Authors

DOI:

https://doi.org/10.35546/kntu2078-4481.2024.1.36

Keywords:

assembly methods of machine learning, classification algorithms, analysis of texts for falsity, TF-IDF, Python

Abstract

The article discusses the application of ensemble techniques to improve the accuracy of predictions and conducts an evaluation of different classifiers across distinct datasets. It explores the effectiveness of the Naive Bayesian, Passive-aggressive, Linear Support Vector, Logistic Regression, k-nearest neighbors, and Random Forest classifiers. Moreover, it investigates the performance of ensembles that integrate these classifiers in various combinations. Python programming technologies (sklearn, pandas, numpy), AMD Ryzen 5 4500U 6-core processor, 16 gigabytes of RAM were used for the research. The findings demonstrate that while individual classifiers achieve commendable accuracy levels, their performance is further enhanced through ensemble approaches. The study details the outcomes of these classifications, highlighting the efficacy of the applied methods. Particularly, the study underscores the value of ensemble strategies in identifying false news text, offering valuable directions for subsequent inquiries. In terms of TF-IDF Vectorization, the Support Vector Machine (SVM) classifier emerges as the most accurate, with an average accuracy rate of 95.74%. This suggests that the SVM is superior in generating correct predictions more frequently than its counterparts when applied to data transformed via TF-IDF Vectorization. With Hashing Vectorization, the SVM consistently outperforms other classifiers, reaching an average accuracy of 97.26%. Among ensemble methods, the Voting Ensemble 3 (Ens3 – SVM + PA + LR) stands out, especially with Hashing Vectorization, achieving an average accuracy of 96.93%. The core concept behind the methodology is the analysis of purely journalistic text, devoid of any irrelevant details such as publication dates, website names, or additional media content. The analysis categorizes the text based on three separate criteria: the veracity of the news, whether it is satirical, or if it constitutes hate speech. To train the models, datasets from the Kaggle platform were utilized according to these criteria, and a selection of arbitrarily chosen news texts and comments were tested under "real-world conditions". The datasets are structured with text and a binary label in another column indicating the respective criterion. The datasets include 6,335 news text entries labeled as true or false. The dataset for satire combines two distinct datasets, one from the BBC news service and another from the satirical news site The Onion.

References

Vasu Agarwal, H. Parveen Sultana, Srijan Malhotra, Amitrajit Sarkar (2019). Analysis of classifiers for fake news detection, Procedia Comput. Sci., 165 (2019), pp. 377–383, DOI: 10.1016/j.procs.2020.01.035.

Chary Deekshith P., Singh R.P. (2020). Review on Advanced Machine Learning Model: Scikit-Learn (July 4, 2020), International Journal of Scientific Research and Engineering Development (IJSRED) Vol. 3, Issue 4, 526–529.

Dietterich T.G. (2000). Ensemble Methods in Machine Learning. In: Multiple Classifier Systems. MCS 2000. Lecture Notes in Computer Science, vol 1857. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45014-9_1.

Urszula Krzeszewska, Aneta Poniszewska-Maranda, Joanna Ochelska-Mierzejewska (2022). Systematic Comparison of Vectorization Methods in Classification Context. Applied Sciences. 12. 5119. DOI: 10.3390/app12105119.

Shu K., Sliva A., Wang S., Tang J., & Liu H. (2017). Fake News Detection on Social Media: A Data Mining Perspective. ACM SIGKDD Explorations Newsletter, 19(1), 22–36. DOI: 10.1145/3137597.3137600.

Wang W., Cui P., Zhu W., & Yang S. (2018). Fake News Detection with Deep Diffusive Neural Network. Proceedings of the 2018 World Wide Web Conference on World Wide Web (pp. 797–806).

Rubin V. L., Conroy N. J., & Chen Y. (2015). Fake News or Truth? Using Satirical Cues to Detect Potentially Misleading News. Proceedings of the Association for Information Science and Technology, 52(1), 1–4. DOI: 10.18653/v1/W16-0802.

Reis J. C., Correia A., Murai F., Veloso A., Benevenuto F., & Cambria E. (2019). Supervised Learning for Fake News Detection. IEEE Intelligent Systems, 34(2), 76–81. DOI: 10.1109/MIS.2019.2899143.

Published

2024-05-01