HYBRID MODEL FOR GENE EXPRESSION ANALYSIS USING ENSEMBLE CLUSTERING AND CLASSIFICATION STRATEGIES FOR DIAGNOSING COMPLEX SYSTEMS
DOI:
https://doi.org/10.32782/mathematical-modelling/2025-8-2-33Keywords:
gene expression data, similarity metrics, hybrid model, clustering, classification, personalized medi- cine, hybridizationAbstract
Gene expression data analysis is one of the key tools of modern bioinformatics and computer science, as it enables the identification of biomarkers, the formation of molecular profiles, and the support of cancer diagnostics. The relevance of this field is explained by the need for methods capable of adequately processing large-scale transcriptomic data char- acterized by high dimensionality, heterogeneity, and noise. Such features significantly complicate the use of traditional clustering and classification methods, leading to reduced interpretability and accuracy. This article proposes a hybrid model that combines the Self-Organizing Tree Algorithm (SOTA) with consensus strategies of agglomerative and spec- tral clustering. The proposed ensemble approach enables the formation of consistent and informative clusters of gene expression profiles, which reduces the impact of local anomalies and increases the reliability of the results obtained. The constructed clusters were used as new features for a classification model based on the Random Forest algorithm, whose hyperparameters were optimized using Bayesian methods in combination with stacking. The modeling was carried out on a large-scale expression matrix including more than 6,000 biological samples and 18,000 genes, covering 14 sample classes. The results demonstrated that the spectral consensus version of SOTA consistently provided the best values of internal clustering quality indices and the highest classification accuracy. In particular, in configurations with three to five clusters, 100% accuracy and F1-score were achieved, confirming the diagnostic significance of the identified gene groups. The developed ensemble pipeline represents a scalable and standardized tool for transcriptomic data analysis that can be integrated into decision-support systems for early diagnosis of complex systems.
References
Ryan C., O’Driscoll A., Coughlan J., Luo J. Cancer diagnosis and prognosis through gene expression. Briefings in Bioinformatics. 2023. Vol. 24. Art. no. bbac527. DOI: https://doi.org/10.1093/bib/bbac527
Golalipour K., Akbari E., Hamidi S., Lee M., Enayatifar R. From clustering to clustering ensemble selection: A review. Engineering Applications of Artificial Intelligence. 2021. Vol. 104. Art. no. 104388. DOI: https://doi.org/10.1016/j.engappai.2021.104388
Babichev S., Yasinska-Damri L., Liakh I. A hybrid model of cancer diseases diagnosis based on gene expression data with joint use of data mining methods and machine learning techniques. Applied Sciences. 2023. Vol. 13. Art. no. 6022. DOI: https://doi.org/10.3390/app13106022
Galluzzo Y. A comprehensive review of the data and knowledge graph approaches in bioinformatics. Computer Science and Information Systems. 2024. Vol. 21. P. 1055–1075. DOI: https://doi.org/10.2298/CSIS230530027G
Shen J., Guo X., Bai H., Luo J. CAEM-GBDT: a cancer subtype identifying method using multi- omics data and convolutional autoencoder network. Frontiers in Bioinformatics. 2024. Vol. 15. Art. no. 1403826. DOI: https://doi.org/10.3389/fbinf.2024.1403826
Khalsan M., Machado L., Al-Shamery E., Liu R. A survey of machine learning approaches applied to gene expression analysis for cancer prediction. IEEE Access. 2022. Vol. 10. P. 27522–27534. DOI: https://doi.org/10.1109/ACCESS.2022.3146312
Xianyu H., Zhenglin W., Qing W. Molecular classification reveals the diverse genetic and prognostic features of gastric cancer: A multi-omics consensus ensemble clustering. Biomedicine & Pharmacotherapy. 2021. Vol. 144. Art. no. 112222. DOI: https://doi.org/10.1016/j.biopha.2021.112222
Figueroa-Martínez J., Saz-Navarro D. M., López-Fernández A., Rivera J. Computational ensemble gene co-expression networks for breast and prostate cancer biomarker identification. Informatics. 2024. Vol. 11. Art. no. 14. DOI: https://doi.org/10.3390/informatics11020014
Mubeen S., Hoyt C., Gemünd A., & Smith K. The impact of pathway database choice on statistical enrichment analysis and predictive modeling. Frontiers in Genetics. 2019. № 22. Art. no. 1203. DOI: https://doi.org/10.3389/fgene.2019.01203
Jianxia L., Liu R., Mingyang Z., Yangyang L. Ensemble-based multi-objective clustering algorithms for gene expression data sets. IEEE Congress on Evolutionary Computation (CEC) : Donostia–San Sebastián, Spain, 5–8 June. 2017. P. 333–340.
Panwong P., Boongoen T., Iam-On N., Mullaney J. Exploiting consensus clustering for light curve data analysis. IEEE Eurasia Conference on IOT, Communication and Engineering (ECICE) : Yunlin, Taiwan, October 3–6. 2019. P. 498–501.
Dopazo J., Carazo J. Phylogenetic reconstruction using an unsupervised growing neural network that adopts the topology of a phylogenetic tree. Journal of Molecular Evolution. 1997. Vol. 44. P. 226–233. DOI: https://doi.org/10.1007/PL00006139
Heidari J., Daneshpour N., Zangeneh A. A novel k-means and k-medoids algorithms for clustering non-spherical-shape clusters non-sensitive to outliers. Pattern Recognition. 2024. Vol. 155. Art. no. 110639. DOI: https://doi.org/10.1016/j.patcog.2024.110639
Babichev S., Yarema O., Savchenko A. Evaluating proximity metrics for gene expression data: A hybrid model integrating data mining and machine learning techniques for disease diagnosis systems. Biomedical Signal Processing and Control. 2025. Vol. 110. Art. no. 108115. DOI: https://doi.org/10.1016/j.bspc.2025.108115
Babichev S., Yarema O., Liakh I., Shumylo N. A gene ontology-based pipeline for selecting significant gene subsets in biomedical applications. Applied Sciences. 2025. Vol. 15. Art. no. 4471. DOI: https://doi.org/10.3390/app15084471







