Application of Confidence-Based Semi-Supervised Learning Algorithms in Viral Subtype Prediction
DOI:
https://doi.org/10.63313/JCSFT.9031Keywords:
Semi-supervised learning, Confidence estimation, Consistency regularization, Feature selection, Viral variant predictionAbstract
Rapid and accurate identification of viral variants is essential for disease surveillance and public health decision-making. However, large amounts of viral genomic data often lack clear variant labels, and the high cost of expert annotation limits the effectiveness of supervised learning models. To address this, we propose a confidence-based semi-supervised learning framework for predicting SARS-CoV-2 variants, specifically Alpha, Beta, and Omicron. The approach begins with a small labeled dataset (48 Alpha and 50 Beta samples) for initial training. To address class imbalance, we apply the SMOTE technique, and use Lasso for feature selection to enhance model efficiency. The key innovation is an ensemble model combining Random Forest, Gradient Boosting Trees, and Logistic Regression, with a consistency regularization mechanism. This mechanism iteratively assigns pseudo-labels to a large pool of unlabeled data, including only samples with high prediction confidence (above 0.7) into the training set. This approach refines the model’s decision boundaries without additional manual annotation. Experiments show that the proposed method effectively uses unlabeled data to improve model performance and provides reliable predictions for Alpha, Beta, and Omicron variants. This confidence-based semi-supervised learning framework offers a practical solution for accurate pathogen subtyping when labeled data is scarce.
References
[1] Harvey, W. T., Carabelli, A. M., Jackson, B., et al. (2021). SARS-CoV-2 variants, spike muta-tions, and immune escape. Nature Reviews Microbiology, 19, 409–424.
[2] World Health Organization. (2021). Classification of Omicron (B.1.1.529): SARS-CoV-2 variant of concern.
[3] Low, S. J., Džunková, M., Chaumeil, P.-A., Parks, D. H., & Hugenholtz, P. (2019). Evaluation of a concatenated protein phylogeny for classification of tailed double-stranded DNA vi-ruses belonging to the order Caudovirales. Nature Microbiology, 4(7), 1186–1195.
[4] EIOFX-DT. (2025). Leveraging graph centrality metrics for feature extraction and classifi-cation of viral genetic sequences. Biotechnology Reports, 12, e00939.
[5] Liu, L., Li, T., & Wang, J. (2021). A survey of deep learning in bioinformatics: Current status, challenges, and future directions. Briefings in Bioinformatics, 22(5), 1905–1925.
[6] Oliver, A., Odena, A., & Shlens, J. (2018). Realistic evaluation of semi-supervised learning algorithms. In Proceedings of the 36th International Conference on Machine Learning (ICML 2018) (pp. 3549–3558).
[7] Xie, Q., Dai, Z., & Hovy, E. (2020). Self-training with noisy student improves ImageNet clas-sification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020) (pp. 10684–10695).
[8] Yang, J., & Xu, X. (2020). COVID-19 detection with semi-supervised learning and transfer learning. IEEE Transactions on Biomedical Engineering, 67(8), 2209–2218.
[9] Zhang, Z., & Wang, S. (2020). SMOTE for imbalanced learning: A comprehensive overview. ACM Computing Surveys, 53(4), 1–35.
[10] Gu, Q., & Tao, J. (2016). Data normalization techniques for high-dimensional data. Interna-tional Journal of Data Science and Analytics, 1(1), 45–60.
[11] Bühlmann, P., & van de Geer, S. (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer.
[12] Hutter, F., Kotthoff, L., & Vanschoren, J. (2019). Automated Machine Learning: Methods, Systems, Challenges. Springer.
[13] Wright, S. J. (2015). Optimization Algorithms for Machine Learning. In Foundations and Trends® in Machine Learning, 3(1), 1-61.
[14] Wu, Y., & Zhang, S. (2020). Iterative semi-supervised learning for image classification. Journal of Machine Learning Research, 21(98), 1–22.
[15] Géron, A. (2017). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly Media.
[16] Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794.
[17] Sun, H., Li, W., & Wang, Z. (2022). Deep Ensemble Learning with Weighted Voting for Im-age Classification. Neurocomputing, 482, 348-357.
[18] Zhang, Z., & Wang, X. (2018). A comprehensive review on evaluation metrics for classifica-tion models. Proceedings of the International Conference on Machine Learning, 150-160.
Downloads
Published
Issue
Section
License
Copyright (c) 2025 by author(s) and Erytis Publishing Limited.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.













