ASSESSMENT OF THE ACCURACY OF NOTION AND CONCEPT EXTRACTION BASED ON MEASURES OF ASSOCIATION

Authors

  • K.S. HAIDUK
  • O.H. SHEVCHENKO
  • V.A. SVIATNYI

DOI:

https://doi.org/10.32782/KNTU2618-0340/2020.3.2-2.7

Keywords:

extraction of notions and concepts; collocations; measures of association; classification; function of logarithmic likelihood; KDE method

Abstract

The paper presents the results of assessing the quality of the binary classification of pairs of words (bigrams) on the basis of various measures of association, during which the bigrams were divided into classes 'concepts and notions' and 'other bigrams'. It is shown that the usual ranking of objects based on the values of the association measure, followed by the use of threshold filtering (or selection of a fixed number of the first elements of the sorted list), allows you to get only a certain top of the rating, but does not allow you to achieve an effective solution to the classification problem. The approach proposed by the authors is based on the threshold filtering not of the values of the association measure, but the probability of the bigram belonging to the class 'concepts and notions' for a given value of the association measure. The indicated probability is calculated based on the values of the probability density functions (PDFs) corresponding to the distributions of the association measure as a random variable in both classes. The construction of empirical PDFs was performed by analyzing the labeled training sample. Determination of the threshold value of the probability is reduced to solving a onedimensional optimization problem, during which the ratio of the number of objects identified as 'concepts and notions' to the number of objects classified as 'other bigrams' is maximized. Determination of the nature of the statistical distribution of most of the considered association measures is difficult (rejection of the null hypothesis for the main known distributions based on the results of the

References

Баранов В. А. Опыт создания модуля n-грамм системы «Манускрипт» и оценки эффективности его использования для поиска коллокаций в корпусе М. В. Ломоносова. Интеллектуальные системы в производстве. 2016. №4. С. 124–131.

Большакова Е. И., Клышинский Э. С., Ландэ Д. В. и др. Автоматическая обработка текстов на естественном языке и компьютерная лингвистика. М.: МИЭМ, 2011. 272 с.

Lyse G. I., Andersen G. Collocations and statistical analysis of n-grams: Multiword expressions in newspaper text. Exploring Newspaper Language. Amsterdam, New York: John Benjamins, 2012. P. 79–109.

Виноградова Н. В., Иванов В. К. Современные методы автоматизированного извлечения ключевых слов из текста. Информационные ресурсы России. 2016. №4. С. 13–18.

Lossio-Ventura J. A., Jonquet C., Roche M. et al. Combining C-value and Keyword Extraction Methods for Biomedical Terms Extraction. Proceedings of the LBM: Languages in Biology and Medicine: 5th International Symposium, (Japan, Tokyo, December 12-13, 2013). Tokyo, 2013, pp. 1–6.

Evert S., Krenn B. Using Small Random Samples for the Manual Evaluation of Statistical Association Measures. Computer Speech & Language. 2005. Vol. 19. P. 450–466.

Wei C.-H., Allot A., Leaman R. & Lu Z. PubTator central: Automated Concept Annotation for Biomedical Full Text Articles. Nucleic Acids Research. 2019. Vol. 47. P. 587–593.

Gehrmann S., Derenoncourt F., Li Y. et al. Comparing Deep Learning and Concept Extraction Based Methods for Patient Phenotyping from Clinical Narratives. PLoSOne. 2018. Vol. 13. Issue 2. P. 1–19. 9. Ванюшкин А. С., Гращенко Л. А. Методы и алгоритмы извлечения ключевых

слов. Новые информационные технологии в автоматизированных системах. 2016. №.19. С. 85–93.

Мозжерина Е. С. Автоматическое построение онтологии по коллекции текстовых документов. Электронные библиотеки: перспективные методы и технологии, электронные коллекции: Труды 13-й Всероссийской научной конференции. (Россия, Воронеж, 19-22 октября 2011 г.) Воронеж: Издательство Воронежского государственного университета, 2011. C. 293–298.

Christopher D. M., Hinrich S. Foundations of Statistical Natural Language Processing. Cambridge, Mass.: MIT Press, 1999. P. 178–183.

Thanopoulos A., Fakotakis N., Kokkinakis G. Comparative Evaluation of Collocation Extraction Metrics. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). (Canary Islands – Spain, Las Palmas, May, 2002). Luxembourg: European Language Resources Association (ELRA), 2002. P. 620–625.

Kolesnikova O. Survey of Word Co-occurrence Measures for Collocation Detection. Computacion y Sistemas. 2016. Vol. 20. № 3. P. 327–344. DOI: 10.13053/CyS-20-3-2456.

Hoang H. H., Kim S. N., Kan M.-Y. A Re-examination of Lexical Association Measures. Proceedings of the Identification, Interpretation, Disambiguation and Applications: Workshop on Multiword Expressions (MWE 2009). (Singapore, Singapore, August, 2009). Stroudsburg: Association for Computational Linguistics, 2009. P. 31–39.

Pazienza M. T., Pennacchiotti M., Zanzotto F. B. Terminology extraction: an analysis of linguistic and statistical approaches. Studies in Fuzziness and Soft Computing. 2006. Vol. 185. P. 255–279.

Bouma G. Normalized (Pointwise) Mutual Information in Collocation Extraction. Proceedings of the Biennial GSCL Conference. 2009. P. 1–11.

Calculate Pointwise Mutual Information (PMI)/ URL: https://polmine.github.io/ polmineR/reference/pmi.html.

Mikolov T., Sutskever I., Chen K. et al. Distributed Representations of Words and Phrases and their Compositionality. Proceedings of the Neural Information Processing Systems 2013: conference. (USA, Lake Tahoe, 2013). In Advances in Neural Information Processing Systems. 2013. 9 p.

Когай В. Н., Пак В. С. Алгоритмическая модель компьютерной системы выделения ключевых слов из текста на базе онтологий. Проблемы современной науки и образования. 2016. № 16(58). С. 33–40.

Damani O. Improving Pointwise Mutual Information (PMI) by Incorporating Significant Co-occurrence. Proceedings of the Seventeenth Conference on Computational Natural Language Learning. (Bulgaria, Sofia, August 8-9, 2013). Madison: Omnipress, 2013. P. 20–28.

Андреев И. А., Башаев В. А., Клейн В. В. и др. Комбинирование статистического и лингвистического методов для извлечения двухсловных терминов из текста. Автоматизация процессов управления. 2013. № 4. С. 61–70.

SMART Information Retrieval System. URL: https://en.wikipedia.org/wiki/ SMART_Information_Retrieval_System.

Поршнев С. В., Копосов А. С. Использование аппроксимации РозенблаттаПарзена для восстановления функции распределениянепрерывной случайной величины с ограниченным одномодальным законом распределения. Научный журнал КубГАУ. 2013. № 92. С. 1–14.

Published

2023-08-10