METHODS OF VULNERABILITY INJECTION INTO SMART CONTRACTS FOR BALANCED DATASET GENERATION
DOI:
https://doi.org/10.35546/kntu2078-4481.2025.3.2.58Keywords:
smart contracts, Solidity, vulnerabilities, vulnerability injection, reentrancy, integer overflow, balanced dataset, static analysis, large language modelsAbstract
Smart contracts are widely used in financial and decentralized applications; however, their security remains a critical issue. Analysis of existing corpora shows a significant imbalance: common vulnerability classes (integer overflow/ underflow) are strongly dominant, while the critically dangerous reentrancy vulnerability is underrepresented. This complicates both training and objective evaluation of vulnerability detection tools. The aim of the study is to improve the objectivity and quality of training and testing methods for detecting vulnerabilities in smart contracts by creating a balanced and controlled dataset. Two complementary injection approaches are proposed. The deterministic method relies on static analysis and formal patterns for selecting and modifying functions, ensuring reproducibility and syntactic correctness. The LLM-based approach performs context-aware modifications with minimal code differences, increasing the diversity of examples. Both approaches are integrated into a unified pipeline with normalization, deduplication, and multi-stage validation: successful solc compilation, static confirmation of target patterns, preservation of non-target logic, and minimization of code changes. The result is a balanced dataset of five classes (integer_overflow, integer_ underflow, timestamp_dependency, reentrancy, and “safe” contracts) with an equal number of examples, a standardized storage format (full contract, vulnerable snippet, metadata), and a reproducible pipeline. Combining deterministic and LLM methods provides a balance between controllability and realism, which improves the quality of experiments and the fairness of tool comparisons. The novelty lies in the unified formal specification of injection operators and the practical pipeline for batch dataset generation, while limitations concern the stochastic nature of LLMs and the need for further dynamic PoC validation.
References
Tereshchenko O. I., Komleva N. O. Vulnerability Detection of Smart Contracts Based on Bidirectional GRU and Attention Mechanism // Communications in Computer and Information Science. 2023. Vol. 1980. Springer, Cham. DOI: https://doi.org/10.1007/978-3-031-48325-7_21
Tereshchenko O. I., Komleva N. O. Identification and Localization of Vulnerabilities in Smart Contracts Using Attention Vectors Analysis in a BERT-Based Model // Radio Electronics, Computer Science, Control. 2024. № 3. С. 173–184. DOI: https://doi.org/10.15588/1607-3274-2024-3-15
Ferreira J. F., Cruz P., Durieux T., Abreu R. SmartBugs: A Framework to Analyze Solidity Smart Contracts // ASE 2020. DOI: https://doi.org/10.1145/3324884.3415298
Zheng Z., Su J., Chen J., Lo D., Zhong Z., Ye M. DAppSCAN: Building Large-Scale Datasets for Smart Contract Weaknesses in DApp Projects // IEEE Transactions on Software Engineering. 2024. URL: https://doi.org/10.1109/TSE.2024.3383422
Morello G., Eshghie S., et al. DISL: Fueling Research with a Large Dataset of Solidity Smart Contracts : [препринт] // arXiv : [cs.SE]. 2024. URL: https://doi.org/10.48550/arXiv.2403.16861
Yashavant C. S., Kumar S., Karkare A. ScrawlD: A Dataset of Real-World Ethereum Smart Contracts Labelled with Vulnerabilities : [препринт] // arXiv : [cs.SE]. 2022. URL: https://doi.org/10.48550/arXiv.2202.11409
Ghaleb A., Pattabiraman K. SolidiFI: An Automated and Systematic Approach for Evaluating Smart Contract Static Analysis Tools // Proceedings of the ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). 2020. DOI: https://doi.org/10.1145/3395363.3397376
Iuliano G., et al. Automated Vulnerability Injection in Solidity Smart Contracts (MuSe) : [препринт] // arXiv : [cs. CR]. 2025. URL: https://doi.org/10.48550/arXiv.2504.15948
Gebru T., Morgenstern J., Vecchione B., Vaughan J. W., Wallach H., Daumé III H., Crawford K. Datasheets for Datasets // Communications of the ACM. 2021. Vol. 64, № 12. P. 86–92. DOI: https://doi.org/10.1145/3458723
Bender E. M., Friedman B. Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science // Transactions of the Association for Computational Linguistics (TACL). 2018. Vol. 6. P. 587–604. DOI: https://doi.org/10.1162/tacl_a_00041
Chang S., Zhang Y., Yu M., Jaakkola T. S. Invariant Rationalization : [препринт] // arXiv : [cs.LG]. 2020. URL: https://doi.org/10.48550/arXiv.2003.09772







