CRITERIA FOR SIMILARITY OF LINGUISTIC MODELS
DOI:
https://doi.org/10.32782/2618-0340/2019.2-2.2Keywords:
time series, text metrics, linguistic modeling, linguistic modelAbstract
Anomaly detection, as a form of data analytics, is finding more and more applications in various fields of human activity every year. Thus, with the growth of the IT- sector and global integration, there is an increasing need for tools to monitor cybernetic systems, respond to and remedy disruptions and failures. Intrusion detection, unauthorized access, malfunctioning in critical security systems and infrastructure management systems are among the priorities in the modern information technology world. Anomaly detection is the task of finding patterns in data that do not match expected behavior. Anomalies are inappropriate observations, emissions that are contrary to the nature of the process under study. The purpose of this work is to increase the speed of anomaly detection, in comparison with the algorithms used in automated control systems, through the use of linguistic modeling and a syntax approach to time series analysis. The presence of an anomaly is established by comparing two models - for a training dataset in which anomalies are not guaranteed and for the actual data in which we look for an anomaly. Various metrics and peer review methods are used to assess similarity. The first approach is to create a model that characterizes the normal course of the process. For this purpose, all anomalous sections are excluded from the series, or one is selected from a section of the series that does not contain deviations from the norma − thus the so-called reference row is formed. For him the construction of a linguistic model is performed. An important component of the linguistic approach to detecting anomalies in time series is the criterion by which the similarity of the two models is evaluated. The choice of criteria depends on the possibility of applying a linguistic approach to the analysis of time series of different nature. The basic metrics of the similarity of the texts by Hamming, Lowenstein, Jaro-Winkler and others are considered. During the development of the algorithm, 4 different metrics for evaluating linguistic models were considered. Given their strengths and weaknesses, root mean square was chosen as the most appropriate approach.
References
Chandola V., Banerjee A., Kumar V. Anomaly Detection: A Survey. ACM Computing Surveys. 2009. Vol. 41. № 3. Article 15. 58 р.
Gupta M., Gao J., Aggarwal C. C. Han J. Outlier Detection for Temporal Data: A Survey. IEEE Transactions on Knowledge and Data Engineering. 2014. Vol. 25. № 1. 9 p.
Hodge V. J., Austin J. A. Survey of Outlier Detection Methodologies. Artificial Intelligence Review. 2004. Vol. 22. P. 85–126.
Лінгвістичне моделювання (математичне моделювання). URL: https://uk.wikipedia.org/wiki/Лінгвістичне_моделювання_(математичне_моделювання).
Логвинчук А. І., Баклан І. В. Застосування лінгвістичного моделювання до вирішення задачі пошуку аномалій. Інформаційні системи та технології управління (ІСТУ2019): Матеріали ІІІ всеукраїнської науково-практичної конференції молодих вчених та студентів. (Київ, 20-22 листопада 2019 р). Київ: НТУУ «КПІ ім. Ігоря Сікорського», 2019. С. 65–67.
Lohvynchuk A., Baklan I. Linguistic Approach for a Time Series Anomaly Detection. Slovac International Scientific Journal. 2019. Vol. 1. №35. Р. 16–18.
Баклан І. В. Лінгвістичне моделювання: основи, методи, деякі прикладні аспекти. Системні технології. 2011. № 3. С. 10–19.
Шулькевич Т. В., Баклан І. В. Гібридний лінгвістичний підхід до моделювання часових рядів. Прикладні питання математичного моделювання. 2018. № 2. С. 191–202. DOI: https://doi.org/10.32782/2618-0340-2018-2-191-202
Баклан І. В., Шулькевич Т. В. Порівняльний аналіз прогнозу при варіації параметрів гібридної лінгвістичної моделі. Системні технології. 2019. Вип. 3. С. 32–41.
Cohen W., Ravikumar P., Fienberg S. E. A Comparison of String Distance Metrics for Name-Matching Tasks. KDD Workshop on Data Cleaning and Object Consolidation. 2003. Vol. 3. P. 73–78.
Відстань Геммінга. URL: https://uk.wikipedia.org/wiki/Відстань_Геммінга.
Відстань Левенштейна. URL: https://uk.wikipedia.org/wiki/Відстань_Левенштейна
Jaro M. Advantages in record linkage methodology as applied to the 1985 cenus of Tampa, Florida. Journal of the American Statistical Association. 1989. Vol. 84. Issue 406. P. 414–420.