СТІЙКІСТЬ СИСТЕМ ШТУЧНОГО ІНТЕЛЕКТУ ДО АДВЕРСАРІАЛЬНИХ ЗАПИТІВ ТА JAILBREAK-АТАК

M. V. BAUTINA

doi:10.35546/kntu2078-4481.2026.1.27

Authors

M. V. BAUTINA SoftServe https://orcid.org/0009-0002-9617-9262

DOI:

https://doi.org/10.35546/kntu2078-4481.2026.1.27

Keywords:

model robustness, information security, language models, security policy bypass, adversarial impact, security alignment, contextual manipulation, defense architectures, resilience evaluation, adaptive security mechanisms

Abstract

The relevance of this research is driven by the rapid proliferation of AI systems in critically important and regulated fields, which is accompanied by an increasing risk related to adversarial queries and jailbreak attacks. These attacks undermine the reliability, predictability, and safety of the operation of language and multimodal models, posing threats to information security, compliance with ethical and legal standards, and public trust in AI-generated results. The goal of this article is to provide a comprehensive scientific understanding of the mechanisms underlying the vulnerabilities of modern AI systems to adversarial queries and jailbreak attacks, as well as to justify scientific and technical approaches to enhancing their robustness within the constraints of current security alignment models. The research methods are based on a theoretical analysis of current scientific literature in the fields of AI and information security, as well as system and structural-functional approaches. The methods also include logical generalization and comparative analysis of the types of adversarial attacks and technical defense strategies for AI. The results of the study demonstrate that the effectiveness of jailbreak attacks is determined by the statistical nature of AI’s language understanding, its instructional orientation, and the high contextual dependency of generation. The main types of adversarial attacks are systematized, the limitations of isolated protective solutions are established, and the necessity of combining architectural, training, and procedural strategies to enhance AI robustness is proven. Key scientific and practical challenges in implementing protection measures are identified, particularly those related to scalability, maintaining the functional utility of models, and the incomplete formalization of threat landscapes. The conclusions indicate that ensuring the resilience of AI systems to jailbreak attacks requires a shift from reactive blocking mechanisms to the systemic design of security as a fundamental property of intelligent systems. The prospects for future research are related to the development of formalized threat models, agreed-upon metrics for evaluating robustness, and adaptive security mechanisms capable of evolving alongside AI usage practices.

References

Lu L., Yan H., Yuan Z., Shi J., Wei W., Chen P. Y., Zhou P. AutoJailbreak: Exploring jailbreak attacks and defenses through a dependency lens. arXiv preprint. 2024. arXiv:2406.03805. DOI: https://doi.org/10.48550/arXiv.2406.03805

Strohmier H., Dasri Y., Murzello D. Exploring Security Vulnerabilities in ChatGPT Through Multi-Technique Evaluation of Resilience to Jailbreak Prompts and Defensive Measures. 2024 International Conference on Computer and Applications (ICCA), Cairo, Egypt, 2024. P. 1–12. DOI: https://doi.org/10.1109/ICCA62237.2024.10928071

Shayegani E., Mamun M. A. A., Fu Y., Zaree P., Dong Y., Abu-Ghazaleh N. Survey of vulnerabilities in large language models revealed by adversarial attacks. arXiv preprint. 2023. arXiv:2310.10844. DOI: https://doi.org/10.48550/arXiv.2310.10844

Mao Y., Cui T., Liu P., You D., Zhu H. From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem. arXiv preprint. 2025. arXiv:2506.15170. DOI: https://doi.org/10.48550/arXiv.2506.15170

Qi X., Huang K., Panda A., Henderson P., Wang M., Mittal P. Visual adversarial examples jailbreak aligned large language models. Proceedings of the AAAI Conference on Artificial Intelligence. 2024. Vol. 38, № 19. P. 21527–21536. DOI: https://doi.org/10.1609/aaai.v38i19.30150

Hannon B., Kumar Y., Gayle D., Li J. J., Morreale P. Robust testing of AI language model resiliency with novel adversarial prompts. Electronics. 2024. Vol. 13, № 5. Article 842. DOI: https://doi.org/10.3390/electronics13050842

Li B., Wang H., Zhou A. Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks. Advances in Neural Information Processing Systems 37, Vancouver, BC, Canada, 10–15 December 2024. San Diego, California, USA, 2024. P. 40184–40211. URL: https://doi.org/10.52202/079017-1270 (date of access: 30.12.2025).

Robustness of Large Language Models Against Adversarial Attacks / Y. Tao et al. 2024 4th International Conference on Artificial Intelligence, Robotics, and Communication (ICAIRC), Xiamen, China, 27–29 December 2024. 2024. P. 182–185. URL: https://doi.org/10.1109/icairc64177.2024.10900215 (date of access: 30.12.2025).

Liu F., Xu Z., Liu H. Adversarial tuning: Defending against jailbreak attacks for LLMs. arXiv preprint. 2024. arXiv:2406.06622. DOI: https://doi.org/10.48550/arXiv.2406.06622

Are Large Language Models Really Bias-Free? Jailbreak Prompts for Assessing Adversarial Robustness to Bias Elicitation / R. Cantini et al. Lecture Notes in Computer Science. Cham, 2025. P. 52–68. URL: https://doi.org/10.1007/978-3-031-78977-9_4 (date of access: 30.12.2025).

Fight Back Against Jailbreaking via Prompt Adversarial Tuning / Y. Wang et al. Advances in Neural Information Processing Systems 37, Vancouver, BC, Canada, 10–15 December 2024. San Diego, California, USA, 2024. P. 64242–64272. URL: https://doi.org/10.52202/079017-2049 (date of access: 30.12.2025).

Pingua B., Murmu D., Kandpal M., Rautaray J., Mishra P., Barik R. K., Saikia M. J. Mitigating adversarial manipulation in LLMs: a prompt-based approach to counter jailbreak attacks (Prompt-G). PeerJ Computer Science. 2024. Vol. 10. Article e2374. DOI: https://doi.org/10.5281/zenodo.13501821

Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models / J. Ma et al. Findings of the Association for Computational Linguistics: NAACL 2025, Albuquerque, New Mexico. Stroudsburg, PA, USA, 2025. P. 3141–3157. URL: https://doi.org/10.18653/v1/2025.findings-naacl.172 (date of access: 30.12.2025).

Yi S., Liu Y., Sun Z., Cong T., He X., Song J., Li Q. Jailbreak attacks and defenses against large language models: A survey. arXiv preprint. 2024. arXiv:2407.04295. DOI: https://doi.org/10.48550/arXiv.2407.04295

Donato J. Benchmarking LLM Robustness Against Prompt-Based Adversarial Attacks. 2025 20th European Dependable Computing Conference Companion Proceedings (EDCC-C), Lisbon, Portugal, 8–11 April 2025. 2025. P. 60–63. URL: https://doi.org/10.1109/edcc-c66476.2025.00031 (date of access: 30.12.2025).

RESILIENCE OF ARTIFICIAL INTELLIGENCE SYSTEMS TO ADVERSARIAL QUERIES AND JAILBREAK ATTACKS

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

License

Language

logo