METHOD OF NEURAL NETWORK FORMATION OF REPRESENTATIVE NON-DISCRIMINATORY TEXT DATASETS ACCORDING TO THE FATE PRINCIPLE OF JUSTICE
DOI:
https://doi.org/10.35546/kntu2078-4481.2024.4.45Keywords:
representativeness, ethical principles, non-discrimination, dataset, Sustainable Development GoalsAbstract
Paper presents neural network method for generating representative non-discriminatory text datasets according to the FATE fairness principle. The proposed method focuses on creating balanced datasets that accurately reflect demographic groups, taking into account ethical aspects such as gender, age, religion, and ethnicity. The method consists of identifying and correcting imbalances in the dataset by solving an optimization problem that selects data for deletion or augmentation in such a way that the final dataset remains representative and unbiased. To evaluate the effectiveness of this approach, software was developed that uses machine learning models, in particular SVM for age classification, LSTM for gender classification, and BERT for religious classification, all of which showed high statistical results. The results of this method show that after generation, the dataset became more representative in terms of fairness in terms of age and gender, with minimal deviations (from 0.00% to 0.04%) from the ideal representative distribution. These results demonstrate that the proposed method is able to effectively analyze and generate datasets, ensuring their compliance with fairness standards for different ethical categories. In addition, this approach contributes to the achievement of the Sustainable Development Goals, in particular Goal No. 5 (gender equality), Goal No. 10 (reduced inequality) and Goal No. 11 (sustainable urban and community development). Ensuring that datasets reflect a diverse and inclusive representation of social groups contributes to the creation of ethical and fair AI systems, which helps reduce bias and discrimination in decision-making processes.
References
Собко О. В. Дослідження ефективності методу оцінювання та коригування репрезентативності датасету за FATE-принципом справедливості. Перспективи сучасної науки: теорія і практика: Матеріали VIII Міжнар. наук.-практ. конф., 2024. С. 217–221.
Krak I., Zalutska O., Molchanova M., Mazurets O., Bahrii R., Sobko O., Barmak O. Abusive Speech Detection Method for Ukrainian Language Used Recurrent Neural Network. CEUR Workshop Proceedings. 2024. Vol. 3688. С. 16–28.
Zalutska O., Molchanova M., Sobko O., Mazurets O., Pasichnyk O., Barmak O., Krak I. Method for sentiment analysis of Ukrainian-language reviews in e-commerce using RoBERTa neural network. CEUR Workshop Proceedings. 2023. Vol. 3387. С. 344–356.
Собко О. В. Метод інтелектуального пошуку та класифікації кіберзалякувань у текстовому контенті. Інформаційні управляючі системи та технології IУСT-OДЕСA-2024: Матеріали XII Міжнар. наук.-практ. конф. Одеса, 2024. С. 262–265.
Jungwirth D., Haluza D. Artificial intelligence and the sustainable development goals: an exploratory study in the context of the society domain. Journal of Software Engineering and Applications. 2023. Vol. 16, No. 4. С. 91–112. https://doi.org/10.4236/jsea.2023.164006.
Matsui T., Suzuki K., Ando K., Kitai Y., Haga C., Masuhara N., Kawakubo S. A natural language processing model for supporting sustainable development goals: translating semantics, visualizing nexus, and connecting stakeholders. Sustainability Science. 2022. Vol. 17, No. 3. С. 969–985. https://doi.org/10.1007/s11625-022-01093-3.
Suzuki J., Zen H., Kazawa H. Extracting representative subset from extensive text data for training pre-trained language models. Information Processing & Management. 2023. Vol. 60, No. 3. С. 103249. https://doi.org/10.1016/j.ipm.2022.103249.
Zowghi D., Bano M. AI for all: Diversity and Inclusion in AI. AI and Ethics. 2024. С. 1–4. https://doi.org/10.1007/s43681-024-00485-8.
Dablain D., Krawczyk B., Chawla N. Towards a holistic view of bias in machine learning: Bridging algorithmic fairness and imbalanced learning. arXiv preprint arXiv:2207.06084. 2022. https://doi.org/10.48550/arXiv.2207.06084.
Kaggle.com. Cyberbullying Classification. 2021. URL: https://www.kaggle.com/datasets/andrewmvd/cyberbullying-classification?resource=download (дата звернення: 24.11.2024).
Kaggle.com. CyberBullying Detection Dataset. 2024. URL: https://www.kaggle.com/datasets/sayankr007/cyberbullying-data-for-multi-label-classification (дата звернення: 24.11.2024).