HYBRID MODEL FOR GENE EXPRESSION ANALYSIS USING ENSEMBLE CLUSTERING AND CLASSIFICATION STRATEGIES FOR DIAGNOSING COMPLEX SYSTEMS

Authors

DOI:

https://doi.org/10.32782/mathematical-modelling/2025-8-2-33

Keywords:

gene expression data, similarity metrics, hybrid model, clustering, classification, personalized medi- cine, hybridization

Abstract

Gene expression data analysis is one of the key tools of modern bioinformatics and computer science, as it enables the identification of biomarkers, the formation of molecular profiles, and the support of cancer diagnostics. The relevance of this field is explained by the need for methods capable of adequately processing large-scale transcriptomic data char- acterized by high dimensionality, heterogeneity, and noise. Such features significantly complicate the use of traditional clustering and classification methods, leading to reduced interpretability and accuracy. This article proposes a hybrid model that combines the Self-Organizing Tree Algorithm (SOTA) with consensus strategies of agglomerative and spec- tral clustering. The proposed ensemble approach enables the formation of consistent and informative clusters of gene expression profiles, which reduces the impact of local anomalies and increases the reliability of the results obtained. The constructed clusters were used as new features for a classification model based on the Random Forest algorithm, whose hyperparameters were optimized using Bayesian methods in combination with stacking. The modeling was carried out on a large-scale expression matrix including more than 6,000 biological samples and 18,000 genes, covering 14 sample classes. The results demonstrated that the spectral consensus version of SOTA consistently provided the best values of internal clustering quality indices and the highest classification accuracy. In particular, in configurations with three to five clusters, 100% accuracy and F1-score were achieved, confirming the diagnostic significance of the identified gene groups. The developed ensemble pipeline represents a scalable and standardized tool for transcriptomic data analysis that can be integrated into decision-support systems for early diagnosis of complex systems.

References

Ryan C., O’Driscoll A., Coughlan J., Luo J. Cancer diagnosis and prognosis through gene expression. Briefings in Bioinformatics. 2023. Vol. 24. Art. no. bbac527. DOI: https://doi.org/10.1093/bib/bbac527

Golalipour K., Akbari E., Hamidi S., Lee M., Enayatifar R. From clustering to clustering ensemble selection: A review. Engineering Applications of Artificial Intelligence. 2021. Vol. 104. Art. no. 104388. DOI: https://doi.org/10.1016/j.engappai.2021.104388

Babichev S., Yasinska-Damri L., Liakh I. A hybrid model of cancer diseases diagnosis based on gene expression data with joint use of data mining methods and machine learning techniques. Applied Sciences. 2023. Vol. 13. Art. no. 6022. DOI: https://doi.org/10.3390/app13106022

Galluzzo Y. A comprehensive review of the data and knowledge graph approaches in bioinformatics. Computer Science and Information Systems. 2024. Vol. 21. P. 1055–1075. DOI: https://doi.org/10.2298/CSIS230530027G

Shen J., Guo X., Bai H., Luo J. CAEM-GBDT: a cancer subtype identifying method using multi- omics data and convolutional autoencoder network. Frontiers in Bioinformatics. 2024. Vol. 15. Art. no. 1403826. DOI: https://doi.org/10.3389/fbinf.2024.1403826

Khalsan M., Machado L., Al-Shamery E., Liu R. A survey of machine learning approaches applied to gene expression analysis for cancer prediction. IEEE Access. 2022. Vol. 10. P. 27522–27534. DOI: https://doi.org/10.1109/ACCESS.2022.3146312

Xianyu H., Zhenglin W., Qing W. Molecular classification reveals the diverse genetic and prognostic features of gastric cancer: A multi-omics consensus ensemble clustering. Biomedicine & Pharmacotherapy. 2021. Vol. 144. Art. no. 112222. DOI: https://doi.org/10.1016/j.biopha.2021.112222

Figueroa-Martínez J., Saz-Navarro D. M., López-Fernández A., Rivera J. Computational ensemble gene co-expression networks for breast and prostate cancer biomarker identification. Informatics. 2024. Vol. 11. Art. no. 14. DOI: https://doi.org/10.3390/informatics11020014

Mubeen S., Hoyt C., Gemünd A., & Smith K. The impact of pathway database choice on statistical enrichment analysis and predictive modeling. Frontiers in Genetics. 2019. № 22. Art. no. 1203. DOI: https://doi.org/10.3389/fgene.2019.01203

Jianxia L., Liu R., Mingyang Z., Yangyang L. Ensemble-based multi-objective clustering algorithms for gene expression data sets. IEEE Congress on Evolutionary Computation (CEC) : Donostia–San Sebastián, Spain, 5–8 June. 2017. P. 333–340.

Panwong P., Boongoen T., Iam-On N., Mullaney J. Exploiting consensus clustering for light curve data analysis. IEEE Eurasia Conference on IOT, Communication and Engineering (ECICE) : Yunlin, Taiwan, October 3–6. 2019. P. 498–501.

Dopazo J., Carazo J. Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree. Journal of Molecular Evolution. 1997. Vol. 44. P. 226–233. DOI: https://doi.org/10.1007/PL00006139

Heidari J., Daneshpour N., Zangeneh A. A novel k-means and k-medoids algorithms for clustering non-spherical-shape clusters non-sensitive to outliers. Pattern Recognition. 2024. Vol. 155. Art. no. 110639. DOI: https://doi.org/10.1016/j.patcog.2024.110639

Babichev S., Yarema O., Savchenko A. Evaluating proximity metrics for gene expression data: A hybrid model integrating data mining and machine learning techniques for disease diagnosis systems. Biomedical Signal Processing and Control. 2025. Vol. 110. Art. no. 108115. DOI: https://doi.org/10.1016/j.bspc.2025.108115

Babichev S., Yarema O., Liakh I., Shumylo N. A gene ontology-based pipeline for selecting significant gene subsets in biomedical applications. Applied Sciences. 2025. Vol. 15. Art. no. 4471. DOI: https://doi.org/10.3390/app15084471

Published

2025-12-30