Risk prediction and marker selection in nonsynonymous single nucleotide polymorphisms using whole genome sequencing data

ABSTRACT Despite the various existing studies about nonsynonymous single nucleotide polymorphisms (nsSNPs), genome-wide studies based on nsSNPs are rare. NsSNPs alter amino acid sequences, affect protein structure and function, and have deleterious effects. By predicting the deleterious effect of nsSNPs, we determined the total risk score per individual. Additionally, the machine learning technique was utilized to find an optimal nsSNP subset that best explains the complete nsSNP effect. A total of 16,100 nsSNPs were selected as the best representatives among 89,519 regressed nsSNPs. In the gene ontology analysis encompassing the 16,100 nsSNPs, DNA metabolic process, chemokine- and immune-related, and reproduction were the most enriched terms. We expect that our risk score prediction and nsSNP marker selection will contribute to future development of extant genome-wide association studies and breeding science more broadly.


Introduction
Genetic variants can be classified into several categories, including single nucleotide polymorphisms (SNPs), small insertions and deletions, and structural variants (Cooper and Shendure 2011). Among these variants, the majority are SNPs that occur in single bases of DNA sequences. Nonsynonymous SNPs (nsSNP) are an important type of SNP that alter the amino acid sequence as well as potentially affect the protein structure and function (Krawczak et al. 2000;Wu and Jiang 2013).
A number of methods have been proposed for the prediction of deleterious nsSNPs, including the Sorting Intolerant From Tolerant (SIFT) program (Ng and Henikoff 2001), PolyPhen (Galehdari et al. 2013), Poly-Phen-2 (Galehdari et al. 2013), variant effect predictor (VEP;(McLaren et al. 2016)), and SnpEff (Cingolani et al. 2012). Deleterious nsSNP prediction is formulated as a binary classification model using diverse genomic data as features to extract deleterious nsSNPs. The classification result can be determined by the aforementioned tools (Wu and Jiang 2013).
We used SIFT scores to estimate the individual risk by nsSNPs and select significant nsSNP markers. SIFT incorporates position-specific information using sequence alignment and is intended for predicting whether an amino acid substitution affects protein function. SIFT converts the alignment into a position-specific scoring matrix and calculates the probability of an amino acid appearing at a specified position. Using this positionspecific probability estimation, SIFT assigns a decision rule to make the classification (Ng and Henikoff 2001).
Machine learning (ML) is defined as 'a field of computer science that evolved from studying pattern recognition and computational learning theory in artificial intelligence and can make predictions on datasets (Taranov;Simon et al. 2016). ML is classified by learning style using the supervised, unsupervised, and semisupervised learning categories. In supervised data, the input data is referred to as training data. The prelabeled or categorized data can be applied to the problems of classification or regression algorithms (Taranov).
ML methods consist of computational algorithms to relate all or some of a set of predictors to an outcome. The algorithms attempt to balance two competing interests: bias and variance. In ML contexts, bias is the extent to which the predictions correspond to the true values. Variance represents the sensitivity of the predictions to perturbations in the input data. Even though it is impossible to quantify a model's bias and variance separately, the two values can be summarized by loss functions. The aim is to reduce both bias and variance simultaneously (Goldstein et al. 2017).
Regression analysis is a statistical tool that models the relationship between quantitative variables using measurements of error from the model (Taranov). There are various studies using regression. The multivariate linear regression model were used for understanding cow evaluations (Mrode and Coffey 2008). The machine learning algorithm using multiple regression were applied to the carcass traits and saleable meat cuts prediction in commercial lambs (Alves et al. 2019).
In this study, we used next-generation sequencing data from a number of pig breeds. One of our goals was to predict individual risk score using the deleterious effect of nsSNPs. Each individual's risk score was predicted by the nsSNP set and its associated effects. However, the respective nsSNP contributions to the individual risk score were not equivalent. Thus, extraction of the minimal subset of nsSNPs that best explains the entire nsSNP effect, referred to as 'nsSNP marker selection,' was critical.

Whole genome sequencing data
The whole genome sequencing data consisted of 106 pigs (Berkshire (BKS) pigs, Duroc (DUR) pigs, Jeju Native pigs (JNP), Jeju Native Black pigs (JNB), Korea Native pigs (KNP), Korea wild boars (KWP), Landrace pigs (LDR), Yorkshire (YKS) pigs, and Yucatan miniature pigs (YMP)). The procedure for producing the sequencing data was as follows: FastQC software was used to perform a quality check on the sequencing data (Brown et al. 2017). The Trimmomatic-0.32 tool was used to remove the potential adapter sequence before aligning the sequence (Bolger et al. 2014). Paired-end sequence reads were mapped to the reference genome (Sscrofa 11.1) from the Ensembl database using the Bowtie2 default setting (Langdon 2015). The following open-source software packages were used for downstream processing and variant calling: Picard tools, SAMtools, and the Genome Analysis Toolkit (GATK) (Li et al. 2009;do Valle et al. 2016). The Picard tools 'CreateSequenceDictionary' and 'MarkDuplicates' were used to read the reference sequence to write a bam file containing a sequence dictionary and filter potential PCR duplicates, respectively. Index files were created for the reference and bam files using SAMtools. We used the local alignment of sequence reads to correct misalignments using the GATK 'Realigner-Target-Creator' and 'IndelRealigner' arguments. Base quality score recalibration was utilized to obtain accurate quality scores. The GATK 'UnifiedGenotyper' and 'Select-Variants' arguments were used with several criteria for calling variants. All variants with 1) a Phread-scaled quality score of < 30, 2) a read depth < 5, 3) an MQ0 (total count across all samples of mapping quality zero reads) > 4, or 4) a Phred-scaled p-value > 200 using Fisher's exact test were filtered to reduce false positive calls due to strand bias. The total number of SNP after SNP calling quality control was 37,410,105. The vcfmerge tool in VCFtools was used to merge all variant calling formats. The number of extracted SNPs was 36,586,008.

Examination of population structures
To survey the genetic relatedness of the pig samples, we performed principal component analysis (PCA). PCA is a technique for reducing the dimensionality of datasets, increasing interpretability but simultaneously minimizing information loss (Jolliffe et al. 2016). For this purpose, we utilized the Genome-wide Complex Trait Analysis program (Yang et al. 2011). The eigenvector and eigenvalues were computed, and major principal components 1 and 2 (PC1 and PC2) were used to check the separateness of each pig subspecies. Through PC1 and PC2, we examined the genetic relatedness between individuals in the highly-dimensional genomic dataset.

Extraction and risk score prediction of nsSNPs
To predict whether SNPs were nonsynonymous, we used the SnpEff program with the reference genome version Sscrofa 11.1. SnpEff is a variant annotation and effect prediction tool that can be used to identify differences like amino acid changes (Cingolani et al. 2012). We surveyed the missense variants (nsSNPs) using SnpEff. Additionally, the Ensembl VEP program was used to predict the SIFT scores of the nsSNPs (McLaren et al. 2016). SIFT is a homology-based sequencing tool that does not permit non-resistant amino acid substitutions and predicts whether amino acid substitutions of proteins have phenotypic effects. SIFT is based on the premise that protein evolution correlates with protein function (www.incodom.kr/Interpretation_DB-_sift).
SIFT scores range from 0 to 1. Amino acid substitutions at a given coding sequence with normalized probabilities of < 0.05 are predicted to be damaging, whereas those with normalized probabilities of > 0.05 are predicted to be tolerated. A lower tolerance index indicates a higher functional impact on the translated amino acid residues (Raghav and Sharma 2013).
Total risk score and linear regression for preprocessing of nsSNP data Each individual's total risk score by nsSNPs can be defined by the following equation: Where i represents the individual, G ij represents the allele-coded matrix, and (1-SIFT score) is the nsSNP's predicted deleterious effect. We chose the linear additive model for allele coding, and alternative alleles with amino acid substitutions were coded additively.
For machine learning (ML), nsSNP features should be ordered and selected using some statistical criteria. Here, we set the criteria to be the p-values of linear regression. The linear regression model that we used for the preprocessing of nsSNP data and acquisition of regression pvalues was as follows: Where g i is the i-individual's coded alleles, b is the coefficient of the regression model, and e i is the residual error.

ML for nsSNP feature selection
NsSNPs represent the deleterious effects of translated proteins. These deleterious effects can influence individual survival or risk. Our goal was to select the nsSNPs that significantly affect an individual's risk among the hundreds of thousands of variants. Thus, we attempted to select nsSNP features through ML (scikit-learn package) (Pedregosa et al. 2011). The root mean squared error (RMSE) statistic was used to identify optimal nsSNPs.

Quantitative trait loci (QTLs) of the selected nsSNPs
Investigating the traits related to each nsSNP and its deleterious effect can be interesting. To survey the characteristics of the selected nsSNPs related to pig traits, the QTL regions encompassing the selected nsSNPs were inspected. The QTL data were retrieved from the pig QTL database (www.animalgenome.org). The QTL regions in which the ratio of the selected nsSNPs/total nsSNPs was greater than 200-fold were depicted.

NsSNP descriptions
The number of SNPs per chromosome ranged from  Table 2). The number of SNP markers and nsSNPs per chromosome tends to co-vary across the board. In this study, the correlation between SNP markers and nsSNPs per chromosome was 0.997 (Figure 1(a)).

PCA of pig species
PCA was performed to examine the population similarity among 106 pigs. The pig subspecies were BKS, DUR, JNB, JNP, KNP, KWB, LDR, YKS, and YMP. Aside from proximal aggregates between the Landrace and Yorkshire pigs as well as the Duroc and Yucatan miniature pigs, the overall distinctions between breeds were confirmed. PC1 and PC2 explained 18% and 11% of the total variance, respectively (Figure 1(b)).

VEP: SIFT score prediction
We filtered nsSNPs from the SNP marker data using the SnpEff program. SnpEff predicts variant effects, which are indicated as annotation impacts. These annotation impacts, such as frameshift, stop-gain, and missense variants, are presented in the ANN field of the vcf file (http:// snpeff.sourceforge.net). In the VEP program, the nsSNP effects are predicted to be 'deleterious,' 'deleterious with low confidence,' 'tolerated,' or 'tolerated with low confidence' based on SIFT score. The number of nsSNPs that were predicted to be 'deleterious' was 21,559, and the number that were predicted to be 'deleterious with low confidence' was 10,010 (Supplementary Table 3).

Preprocessing before ML using simple linear regression
The deleterious effect of each nsSNP was represented by a SIFT score. The risk score of each nsSNP was calculated by 1-SIFT score. The total risk score per individual was defined as Eq. 1. The linear regression served as preprocessing for the nsSNP set because the p-value in the simple linear regression was utilized as the criteria for sorting the nsSNPs by importance. Supplementary Table 4 displays the p-value table for preprocessing using simple linear regression. The number of nsSNPs with a p-value < 10 −12 , between 10 −10 and 10 −12 , and between 10 −4 and 10 −6 was 1,583, 1430, and 7,456, respectively, among the 89,519 total regressed nsSNPs. Figure 2(a and b) illustrate the nsSNP regression plot with the lowest and highest p-values. The discrepancy between the two cases was evident.

ML and multiple linear regression
We performed multiple linear regressions after preprocessing. The scikit-learn package was utilized for ML (Pedregosa et al. 2011), and the RMSE statistic was used for feature selection. The plot of the RMSE against the number of nsSNPs resembled the ML theoretic curve (Anderson et al. 2018). The RMSE across the number of nsSNPs showed a minimum value at 16,100 nsSNPs (Figure 2(c)). We determined that 16,100 nsSNPs comprised the optimal subset of markers that best explains the total risk. The risk score predicted against the total risk score displayed linearity (Figure 3  The discrepancy between the two regressions is clear. The response variable was total risk score (see Equation 1), and the deleterious allele was coded additively. (c) Root mean squared error (RMSE) statistic along with the number of ordered nsSNPs. Among all regressed nsSNPs, 16,100 nsSNPs comprised the best subset for explaining the total risk score. These 16,100 nsSNPs can be used as markers for future genomic analyses.
(a)). The nine breeds are clustered into each breed as seen in PCA analysis. The total risk score was clustered with respect to each breed. Figure 3(b) demonstrates the number of nsSNPs in each QTL. The 16,100 significant nsSNPs were abundant in the fat-related and body weight QTLs. The supplementary provides the fat and body weight QTLs encompassing nsSNPs.
NMI encodes a protein that interacts with N-MYC and C-MYC. These proteins are two members of the oncogene MYC family. NMI also interacts with all STATs, except STAT2, and augments STAT-mediated transcription in response to the cytokines interleukin-2 (IL-2) and interferon-gamma (IFN-γ). NMI is linked to the JAK-STAT cascade and negative regulation of type I IFN production GO terms. TLR3 encodes a member of the tolllike receptor family, which plays a key role in pathogen recognition and activation of innate immunity. They recognize pathogen-associated molecular patterns and mediate the production of cytokines that are necessary for immunity development. CIB1 encodes a member of the EF-hand domain-containing calcium-binding superfamily, which interacts with many other proteins. These include platelet integrin alpha-IIb-beta-3, DNA-dependent protein kinase, presenilin-2, focal adhesion kinase, protein kinase D, and p21-activated kinase. CIB1 is linked to the type II diabetes mellitus and leukocyte count GO terms. EPHA2 belongs to the ephrin receptor superfamily of the protein tyrosine kinase family, which has been implicated in mediating developmental events, particularly in the nervous system. EPHA2 is linked to the protein kinase activity and protein tyrosine kinase activity GO terms. CSF2 encodes a cytokine that controls the production, differentiation, and function of granulocytes and macrophages and is associated with eosinophil count and inflammatory bowel disease. CSF2 is linked to the cytokine activity and growth factor activity GO terms (www.genecards.org).

Characteristics of the method
Our method was based on the 'from genome to genome' concept rather than 'from phenotype to genome' like classic genome-wide association studies (GWASs) (Catchpole et al. 2008). All of the information in our approach originates from the genome because each nsSNP as well as its effect (SIFT score), total risk score, regression coefficient, linear regression p-value, and ML stemmed from a self-genome basis. In particular, the total risk score of an individual (the regression response variable) was based on the SIFT score of each nsSNP. Thus, the accuracy of our analysis was primarily dependent on SIFT score estimation.

ML and multiple linear regression
In the scikit-learn package, multiple linear regression can be accompanied by nsSNP marker selection. Before performing multiple linear regression, we attempted to preprocess nsSNP sorting by p-value from the simple linear regression. This process ensured that the nsSNPs were arranged according to the lowest p-value order rather than deleterious effect. Thus, the nsSNP disposition was identified using a significant order for the regression rather than by adverse effects. After ML, the number of markers that best reflected the data in the nsSNP set was 16,100, which included not only the deleterious effect markers but also a large number of tolerated effect nsSNPs (only 4,466 deleterious nsSNPs out of 16,100).

Conclusion
NsSNPs have deleterious effects and are represented by SIFT scores. Given this knowledge, we predicted the total risk score using the SIFT scores of nsSNPs in various pig breeds. Furthermore, nsSNP markers that best explained the total risk score were selected using ML. In addition to the utility of the total risk score, the selected nsSNPs can serve as SNP markers for future GWASs and breeding research.