TEXT DATA STREAM ANALYSIS SYSTEM

Authors

  • Yu.O. OLIINYK

DOI:

https://doi.org/10.32782/2618-0340/2020.1-3.15

Keywords:

text stream, online data processing, text mining, Apache Spark

Abstract

The study is devoted to text data stream analysis system development. The problem statement deals with the problem of text data stream processing and the lack of software for simultaneous processing of text data streams in Ukrainian and Russian. The analysis of the last researches is carried out and established that for data flow processing it is necessary to apply the specialized software of data stream processing. It was found that there are only few tools for processing Ukrainian-language texts are exists, as well as the fact that there are no tools for simultaneous support of procession Ukrainian-language and Russian-language texts. The purpose of this study is to software architecture development and implementation of the data streams analysis software. A description of the mathematical model of text data flow based on a sliding window is given. The tasks for processing text data streams are defined. Tasks from basic text transformations and pre-processing to intellectual analysis of text data streams are given. The mathematical definition of the problem of determining the emotional color of text data streams on the base of the sliding window model is formulated. For subsystems are allocated: a collecting and transporting messages of data streams subsystem, an analysis of text streams subsystem, a storage of results of the analysis of data streams subsystem and a visualization subsystem. A systems features are support of Ukrainian-language texts processing, for this purpose the UANLP library was specially developed. This library also supports Russianlanguage texts processing. Processing of text data streams is performed on the base of the Spark Streaming component, which supports work with sliding windows. The Spark MLib and ML libraries allow the use machine learning tools for analytical processing of text data streams, such as sentiment analysis, detection of anomalies, elements of propaganda, misinformation are performed. Main software component − messaging service Kafka, distributed data processing technology Apache Spark, Elasticsearch database and Kibana visualization service. Made data processing description from data streams generation to analysis results visualization.

References

Олійник Ю. О., Афанасьєва О. Є., Аршакян Г. Д. Підхід до виявлення аномалій в потоках текстових даних. Системні технології. 2020. № 2(127). C. 126−139. DOI: https://doi.org/10.34185/1562-9945-2-127-2020-10

Tomashevskii V. M., Oliynik Y. O., Yaskov V. V., Romanchuk V. M. Realtime Text Stream Anomalies Analysis System. Вісник Херсонського національного технічного університету. 2018. № 3 (1). Р. 361−365.

Oram A. Streaming Data. USA, Newton: O'Reilly Media, Inc., 2019. 28 p.

Степанюк Є. Ю., Олійник Ю. О. Дослідження методів аналізу тональності тексту. Інформаційні системи та технології управління – ІСТУ-2019: матеріали Всеукраїнської науково-практичної конференції молодих вчених та студентів. (м. Київ, 26 листопада 2019 р.), Київ: НТУУ «КПІ ім. Ігоря Сікорського», 2019. С. 32–39.

Гавриленко О. В., Олійник Ю. О., Ханько Г. В. Огляд та аналіз алгоритмів TEXT MINING. Управління проектами, системний аналіз і логістика. 2017. № 19. С. 15–23

Apache Spark Streaming. URL: http://spark.apache.org/docs/latest/streamingprogramming-guide.html

Набір даних URL: https://github.com/dmytro-verner/sentiment-analysis-ukrainiantweets

Ukrainian NLP Library for Apache Spark. URL: https://github.com/oliyura/UANLP/ [Назва з екрана].

Морфологійчний аналізатор pymorphy2. URL: https://pymorphy2.readthedocs.io/ [Назва з екрана].

Kibana. Your window into the Elastic Stack. URL: https://www.elastic.co/kibana [Назва з екрана].

Establishing Modern Master-level Studies in Information Systems URL: https://mastis.pro/[Назва з екрана]

Published

2023-09-25