A Method for the Identification and Intensity Prediction of DNA Enhancers Based on Feature Extraction Algorithms

XiFeng Li

doi:10.63313/JCSFT.9056

Authors

XiFeng Li College of Physics and Electronic Engineering, Northwest Normal University, Lanzhou 730070, PR China Author

DOI:

https://doi.org/10.63313/JCSFT.9056

Keywords:

DNA Enhancer, Enhancer Identification, Strength Prediction, K-Mer, CKSNAP, SVM, Xgboost

Abstract

DNA enhancers are important non-coding cis-regulatory elements that control gene transcription. To reduce reliance on high-cost experimental methods such as ChIP-seq and DNase-seq, this paper constructs a classification framework based on sequence features and supervised learning for two sequence-level tasks: enhancer identification (enhancer/non-enhancer) and enhancer strength prediction (strong/weak enhancer).Methodologically, k-mer frequency features and CKSNAP-based dinucleotide interval features were extracted separately, and these were concatenated to form a fused representation. Under a unified hierarchical 5-fold cross-validation framework, the performance of Support Vector Machines (SVM) and XGBoost models was compared. The results indicate that, for both tasks, XGBoost performed slightly better or was comparable to SVM under the same feature set;under the same classifier, the fusion of k-mer and CKSNAP features outperforms either feature alone. Specifically, XGBoost based on the fusion features achieved accuracy rates of 0.796 and 0.681 for enhancer identification and strength classification, respectively, with overall superior ROC/AUC performance. This indicates that local k-mer composition information and short-range distance-dependent information are complementary, thereby enhancing the model’s discriminative power and stability.

References

[1] Pennacchio L A, Bickmore W, Dean A, et al. Enhancers: five essential questions[J]. Nature Reviews. Genetics, 2013, 14(4): 288-295.

[2] Maston G A, Evans S K, Green M R. Transcriptional regulatory elements in the human genome[J]. Annual Review of Genomics and Human Genetics, 2006, 7: 29-59.

[3] Whyte W A, Orlando D A, Hnisz D, et al. Master transcription factors and mediator establish super-enhancers at key cell identity genes[J]. Cell, 2013, 153(2): 307-319.

[4] Shlyueva D, Stampfel G, Stark A. Transcriptional enhancers: from properties to genome-wide predictions[J]. Nature Reviews. Genetics, 2014, 15(4): 272-286.

[5] Lee J Y. The principles and applications of high-throughput sequencing technologies[J]. Development & Reproduction, 2023, 27(1): 9-24.

[6] Kheradpour P, Ernst J, Melnikov A, et al. Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay[J]. Genome Research, 2013, 23(5): 800-811.

[7] Liu B, Fang L, Long R, et al. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition[J]. Bioinformatics (Oxford, England), 2016, 32(3): 362-369.

[8] Ghandi M, Mohammad-Noori M, Beer M A. Robust k-mer frequency estimation using gapped k-mers.[J]. Journal of mathematical biology, 2014, 69(2): 469-500.

[9] Chen Z, Zhao P, Li C, et al. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization[J]. Nucleic Acids Research, 2021, 49(10): e60-e60.

[10] Chen Y Z, Tang Y R, Sheng Z Y, et al. Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs[J]. BMC Bioinformatics, 2008, 9: 101.

[11] Liu B, Liu F, Fang L, et al. repDNA: a python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects[J]. Bioinformatics (Oxford, England), 2015, 31(8): 1307-1309.

[12] H. T. Lin and C. J. Lin, “A Study on Sigmoid Kernels for SVM and the Training of Non-PSD Kernels by SMO- Type Methods,” Department of Computer Science and Information Engineering, National Taiwan University, Taiwan, 2003. - References - Scientific Research Publishing[EB/OL]. [2025-08-11]. https://www.scirp.org/reference/referencespapers?referenceid=133464.

[13] Chen T, Guestrin C. XGBoost: a scalable tree boosting system[J]. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and DATA Mining, 2016.

[14] Chen W, Feng P M, Lin H, et al. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition[J]. Nucleic Acids Research, 2013, 41(6): e68.

[15] Xu Y, Shao X J, Wu L Y, et al. iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins[J]. PeerJ, 2013, 1: e171.

[16] Arlot S, Celisse A. A survey of cross-validation procedures for model selection[J]. Statistics Surveys, 2009, 4: 40-79.