Somatic mutations have a critical role in carcinogenesis in humans. It serves as the signature of cancer genesis and progression. A somatic cell is a non-germline cell that constructs the internal organs, skin, bones, blood, and connective tissues in mammals. Any replication error or substitutions or deletions in the DNA sequence of a somatic cell is defined as somatic mutation. Most of the time, the accumulation of somatic mutation can advance the process of malignant transformation, from a normal cell to a cancerous cell. The advances of next-generation sequencing (NGS) computational tools and technologies allow for parallel sequencing of cancer genomic data which provides substantial input for analyzing the mutations in DNA that cause cancer. Several computational tools have been developed to address the somatic mutation challenge, such as VarScan2, VarDict, ISOWN, GATKcan, Strelka2, Cerebro, Mutect2, and NeuSomatic. Typically, these tools construct multiple alignments with both the tumor and normal reads, and then identify the tumor-specific mutations, using statistical algorithms to reduce the false positives. However, it is a common scenario where the only available data is a tumor-only sample, with no paired normal sample. Consequently, there is a need to develop a method that can also precisely identify somatic mutation from tumor-only WES data. DNN has great potential for developing a somatic mutation identification model because it accommodates the need for large-scale data processing and complex feature extraction. Therefore, we proposed to construct a DNN model for somatic mutation identification of WES data.
Furthermore, we also integrated the statistical variant features with the functional prediction scores to acquire more information about the potential variants and to improve the discriminative property of our model. However, some of the variants had empty values in multiple features because these variants were not computed by the variant callers or unknown by the variant annotations database. Therefore, we proposed the implementation of the feature selection method in this research. Feature selection will benefit the classification model by removing the redundant information, eliminate the noise, and better generalizing and comprehension the data. Extreme Gradient Boosting (XGBoost) is an upgrade to previous tree boosting algorithms. The high performance of XGBoost in data mining and classification task, establish it as one of the well-known state-of-the-art gradients boosting tree algorithms. Therefore, we implemented XGBoost as the feature selection method for our variants dataset. To the best of our knowledge, we have not seen the implementation of the DNN classifier model and XGBoost as feature selection for somatic mutation identification.