Abstract
Selecting disease-causing genes from gene expression and methylation data with hundreds of thousands of loci is of great benefit for cancer diagnosis and treatment, but it also faces tremendous technical challenges due to its small sample size and ultrahigh-dimensional genetic markers. To enhance the search speed, this paper proposes a new gene selection algorithm, called the Membrane Computing with Harmony Search Algorithm (MC-HSA), based on the theory of membrane computing to quickly select a subset of potential disease-causing genes. In the MC-HSA, an active membrane dissolving P system is designed to obtain a trade-off between global exploration and local exploitation ability for detecting gene combinations that have a strong association with disease status. The harmony search algorithm is embedded in the P system to comprehensively detect gene subsets in both gene expression and DNA methylation data. An enhanced classifier consisting of four general classifiers is employed to improve classification accuracy (CA) and avoid overfitting, while a penalty function is developed to screen out redundant genes. Experiments on six real datasets indicate that our method is very competitive compared with ten excellent optimization algorithms (HybridGA, QSFS, RMA, WOA-CM, ME-BPSO, CDNC, ABCD, HAMS, mRMR, and ImRMR). Taking the gene expression and DNA methylation data of prostate cancer as an example, the experimental results show that our method finds a smaller number of genes with high CA (> 99%) than four state-of-the-art algorithms and maintains stable performance. Finally, we specifically analyzed the representative genes and comprehensively validated them in terms of Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and Gene Ontologies (GO).