Bayesian model combining linkage and linkage disequilibrium analysis for low density-based genomic selection in animal breeding

ABSTRACT We combined linkage (LA) and linkage disequilibrium (LDA) analyses (emerging the term ‘LALDA’) for genomic selection (GS) purposes. The models were fitted to a simulated dataset and to a real data of feed conversion ratio in pigs. Firstly, the significant QTLs (quantitative trait locus) were identified through LA-based mixed models considering the QTL-genotypes as random effects by means of genotypic identity by descent matrix. This matrix was calculated at the positions of significant QTLs (based on LA) allowing to include the QTL-genotype effects additionally to SNP (single nucleotide polymorphism) markers (based on LDA) and additive polygenic effects in several GS models (Bayesian Ridge Regression – BRR; Bayes A – BA; Bayes B – BB; Bayes C – BC and Bayesian LASSO – BL). These models combing all mentioned effects were denominated LALDA. Goodness-of-fit and predictive ability analyses were performed to evaluate the efficiency of these models. For the real data, although slightly, the superiority of the LALDA models was verified in comparison to traditional LDA models for GS. For the simulated dataset, the models presented similar results. For both LDA and LALDA frameworks, BA showed the best fitting through Deviance Information Criterion and higher predictive ability in the simulated and real datasets.


Introduction
Linkage disequilibrium analysis (LDA) is the basic concept behind genomic selection (GS), since the associations between markers and QTLs (quantitative trait locus) is an attribute of the population as a whole. For this reason, it is expected that these associations be shared by all individuals of the population and be preserved for several generations. On the other hand, linkage analysis (LA) takes into account only the linkage disequilibrium within families (or specific crosses), which is undone through recombination after few generations. Wientjes et al. (2013) reported that prediction accuracy in GS depends on linkage disequilibrium from recent familiar structures, signalling for the contribution of LA for GS. Furthermore, LA can be used to identify chromosomal regions that are inherited from a common ancestral by using identity by descent (IBD) matrices. According to Bercovici et al. (2010), these regions are better candidates for detection of causal mutations affecting the phenotypes, and therefore, can be exploited under a genome-enabled prediction viewpoint.
Recently, both sources of information (LA and LDA) have been combined for GS purposes (Boichard et al. 2012;Luan et al. 2012;Wientjes et al. 2013) and showed satisfactory results, even outperforming the LDA models (traditional GS models) in relation to predictive ability in simulated and real data. These cited studies used only Genomic best linear unbiased prediction (GBLUP)-based models. However, to the best of our knowledge, there are no reports in literature approaching the efficiency of the Bayesian GS models (Bayes A, B, Cpi, LASSO and Ridge) in the presence of QTL-genotype effects obtained from LA.
Towards this orientation, we aimed to propose and test a new class of Bayesian LALDA models (combining LA and LDA) for genomic selection implemented through free software (QXPAK e R). These models were fitted to a simulated dataset and to a real data of feed conversion ratio (FCR) in an experimental outbreed pig population.

Simulated data
We used a public simulated dataset (Usai et al. 2014). The base generation (GEN0) was composed by 1020 (20 males and 1000 females) unrelated individuals. Each one of the next four nonoverlapping generations (GEN1, GEN2, GEN3 and GEN4) consisted of 20 males and 1000 females from GEN0 by randomly mating each male with 50 females. The pedigree was composed by 4100 individuals (males only from GEN0 and all the individuals from GEN1 to GEN4). The simulated genome comprised 5 chromosomes, each one with a size of 99.95 Mb carrying 2000 equally distributed SNPs (single nucleotide polymorphism) (1 SNP per 0.05 Mb). Currently, this number of 10,000 SNPs can be considered as low-density for the majority of livestock species. A total of 50 SNPs were randomly sampled and treated as true QTLs. The sum of the QTLs' additive effects over each individual was defined as true breeding values (TBV). The simulated trait (Y1) was generated assuming heritability equal to 0.35. The phenotype of each individual was given by TBV plus the random residual effect sampled from a normal distribution with mean zero and variance that ensures the heritability equal to 0.35. Data of 3000 females from GEN1 to GEN3 were used in the training population, so that the rest of the population (1020 individuals from GEN4) were considered for validation. All used data are available in the following electronic address: http://qtl-mas-2012.kassiopeagroup. com/en/dataset.php.

Real data: experimental population and phenotypic data
The use of animals was approved by the Animal Care and Use Committee at Department of Animal Science (RG Number 181 444) of the Universidade Federal de Viçosa, Brazil.
The phenotypic data were obtained from an experiment carried out in Viçosa, Minas Gerais State. A three-generation resource population was created and managed as described by Hidalgo et al. (2013) and Verardo et al. (2014). Briefly, 2 naturalized Piau breed grandsires were crossed with 18 granddams from a commercial line composed of Large White, Landrace and Pietrain breeds in order to produce the F1 generation, from which 11 F1 sires and 54 F1 dams were selected. These F1 individuals were crossed to produce the F2 population, of which 341 animals were phenotyped for FCR. This trait was chosen due to its economic importance and to the difficulty to be measured, which justify the use of GS. The animals were weighted from birth and weaned at 21 days old. After weaning, they received commercial feed ad libitum. They were kept in individual pens from 77 to 105 days old, where the daily feed intake (kg/day), the daily weight gain (kg/day) and FCR (kg/kg) traits were measured. The phenotypic mean, median, standard deviation, maximum and minimum values for FCR were, respectively, 2.78, 2.71, 0.59, 1.53 and 5.25.

DNA extraction, genotyping and SNP quality control
Genomic DNA was extracted from the white cells of parental, F1 and F2 animals; more details can be found in Band et al. (2005). The low-density customized SNPChip with 384 markers was based on the Illumina Porcine SNP60 BeadChip (San Diego, CA, USA). These SNPs were selected according to QTL positions that were previously identified in this population by using meta-analyses (Silva et al. 2011) and fine mapping (Hidalgo et al. 2013;Verardo et al. 2014). Thus, although a small number of markers have been used, the customized SNPchip based on previously identified QTL positions ensures appropriate coverage of the relevant genome regions in this population.

Mixed model for QTL detection through LA
In order to implement the proposed LALDA model, first we have to fit the Fernando and Grossman (1989) model (1) aiming to identify significant QTLs under a LA approach. This model was fitted separately for each chromosome.
where y is the vector (with dimension equal to n, the number of phenotyped animals) of FCR records; X is the incidence matrix of fixed effects vector (β) (sex, lot and halotane gene genotypes); Z is the incidence matrix of polygenic (u) and QTL-genotype (w) effects, being w = [w 1 , w 2 , . . . , w n ] ′ and e represents the vector of residual terms. Under independence between u, w and e, the following distributions were assumed: where Q is the genotypic IBD (Nagamine 2005) matrix (a covariance matrix which elements are the probabilities that individuals are identical by descent conditional on pedigree and SNP marker information); and e N(0, s 2 e I n ). Model (1) was fitted through QXPAK.5.05 software (Pérez-Enciso and Misztal 2011). For the simulated dataset (with 10,000 SNPs) we need to use the options MEMORY_RAM = yes and TRANSPOSE, which are indicated for large scale genotyping data. The Q matrices are stored in files called 'zran.x', were x represents the SNP positions scanned every cM. Significant QTLs are identified at each tested position by using a likelihood ratio test (LRT). In this context, model (1) is the full model, whereas this model without w effect is the null model. The p-values for this test is calculated at each position x. Once the significant QTL have been identified at position x, the Q matrix stored at this position will be used to compose the Bayesian LALDA models introduced in the next section.

Bayesian LALDA models for genomic selection
It is important to emphasize that if a significant QTL is located exactly in the physical position of a given SNP marker, such marker will be excluded from the LDA component of the LALDA model, being considered only in the LA component. It avoids that the same marker be used simultaneously in both components (LA and LDA) of LALDA model, which can cause multicollinearity problems.
In order to implement LALDA models, the traditional genomic selection Bayesian regression models (so called Bayesian alphabet) have been adapted to bear the QTL-genotype (w) effect from LA. In addition to this effect, the proposed LALDA model (2) also contains the SNP (g i ) and additive polygenic (u) effects. The inclusion of u is due to the small number of SNP markers used here. Assuming only one significant QTL from LA, we have: where y* is the vector of FCR records pre-corrected for fixed effects (sex, lot and halotane gene genotypes); 1 is a vector of ones and µ is the overall mean; g i is the additive effect of SNP marker i; x i is the incidence vector of each marker (assuming the values 2, 1 or 0 that represents the SNP genotypes AA, Aa and aa, respectively) and m is the number of markers; Z is the incidence matrix of polygenic (u) and QTL-genotype (w) effects, assuming, respectively, u N(0, s 2 u A) and w N(0, s 2 w Q); and e is the residual vector, e N(0, s 2 e I n ). In summary, the terms Zw and m i=1 x i g i represent, respectively, the LA and the LDA components of the proposed LALDA model.
All models treated here assume the same prior distribution for the residual variance, a scaled inverted chi-squared distribution given by s 2 e |n e , S e n e S e x −2 n e . The differences between these models will be given by the prior distributions assumed for the marker effects.
The ridge regression (BRR) model is the Bayesian version of RR-BLUP proposed by Meuwissen et al. (2001), which assumes the same genetic variance for all markers; i.e.
The prior distribution for the marker genetic variance is the following scaled inverted chi-squared distribution, s 2 g |n g , S g n g S g x −2 n g . The Bayes A (BA) model, differently from BRR, assumes one specific variance (s 2 g i ) for each marker i, g i N(0, s 2 g i ), whose prior distribution is given by s 2 g i |n g , S g n g S g x −2 n g . Thus, the joint prior distribution for the genetic variance of all markers is given by m i=1 s 2 g i |n g , S g . Alternative approaches for BRR and BA models are Bayes B (BB), Bayes Cπ (BCπ) and Bayesian Lasso (BL), which assume variable selection given by the shortening of the regression coefficients (marker effects). BB and BCπ extend BA and BRR, respectively, by introducing an additional parameter π (probability of the marker effect be equal to zero), and this parameter is treated as unknown and it is assigned a Beta prior π ∼ Beta (p 0 , π 0 ), with p 0 > 0 and π 0 ɛ [0, 1] (Pérez and de los Campos 2014). The BB model assumes as prior for the marker effects a normal mixture distribution given by . Similarly to BA, s 2 g i denotes that each SNP has its own variance with prior distribution given by s 2 g i |n g , S g n g S g x −2 n g . In the BCπ model, the prior distribution for marker effects is also given by a normal mixture distribution, g i |p (1 − p)N(0, s 2 g ) + pN(0, s 2 g = 0), however, similarly to BRR, this model assumes the same genetic variance for all markers with prior distribution given by s 2 g |n g , S g n g S g x −2 ng . Differently from BB and BC, BL provides regression coefficients estimators that solve the following optimization problem: min the term m i |g i | is the sum of the absolute values of the regression coefficients and λ is the parameter that controls the shortening these coefficients.
The BGLR (Pérez and de los Campos 2014) package of R software was used to implement all considered models (see the codes at supplementary material). Analyses were carried out using Monte Carlo Markov Chain (MCMC) algorithms saving every 5th cycle from a total of 130,000 iterations, after 30,000 of burn-in. Global convergence was checked by Geweke's Z criterion and visual inspection of trace plots.

Comparing Bayesian LALDA models for genomic selection
All mentioned models (BRR, BA, BB, BC e BL) from (2) were compared in terms of goodness-of-fit considering the whole dataset through Deviance Information Creiterion (DIC) (Spiegelhalter et al. 2002). As previously mentioned in the subsection 2.1, for the simulated dataset the training and validation populations were composed by 3000 and 1020 individuals, respectively. For the real dataset, the predictive ability evaluation was implemented by means of five-fold cross-validation analysis. For this, of the 5 subsamples (4 subsamples with 68 animals and 1 subsample with 69 animals), a single subsample was retained as the validation data for testing the predictive ability, and the remaining 4 subsamples were used as training data. The cross-validation process was repeated five times, with each of the subsamples used exactly once as the validation data.
The predictive ability was reported as the correlation between the pre-corrected phenotype (y*) from validation dataset and the estimated total breeding value (EBV) denoted asŷ * . From LALDA models,ŷ * =

Heritability estimates
After identifying the best model ('Bayesian alphabet' under LALDA or LDA approaches), the variance components and heritability were estimated using the whole dataset. For LALDA model, the heritability estimate was obtained as follows: whereŝ 2 a is the estimate of the genetic variance explained by all SNP markers based on LDA,ŝ 2 w is the estimate of the genetic variance explained by the QTL based on LA,ŝ 2 u is the estimate of the additive polygenic variance, andŝ 2 e the estimate of the residual variance.
In (3), for the methods assuming the same variance for all markers (BRR and BCπ), the variance componentŝ 2 a was accessed asŝ 2 a = 2ŝ 2 The term p i is the MAF of marker i.

QTL detection through LA and model comparisons
For the simulated data, the 50 true QTLs postulated in the simulation were detected through LA. As the QTLs are assumed as specific SNPs (see subsection 2.1), these markers were excluded from the LDA component of the LALDA model, being considered only in the LA component. With respect to real data, Figure 1 shows the profile of the LRT test across SSC8, since there were no significant QTLs in other studied chromosomes. From this figure, we found a significant QTL at position 136 cM (maximum LRT statistics). Table 1 shows the DIC values for all fitted models (BRR, BA, BB, BC and BL under LDA and LALDA viewpoints) considering both simulated and real datasets.
For the simulated dataset, the LALDA and LDA frameworks showed exactly the same goodness-of-fit measures. On the other hand, for the real data, the LALDA model outperformed the LDA (lower DIC values) for all 'Bayesian alphabet' models (BRR, BA, BB, BC and BL). Additionally, the BA, followed by the BB and BL models presented lower DIC in both LALDA and LDA models in the two evaluated datasets. Thus, models assuming one specific variance per marker fitted better to the both datasets. Table 2 brings the results of cross-validation analysis for predictive ability. For the real dataset, we calculated the correlation between the observed and predicted values; whereas for the simulation dataset, the correlation between observed and TBV was used for this aim. Although the performance of LALDA and LDA models has been similar when considering the real dataset, the results followed the trend observed for goodness-of-fit analysis (Table 1). In summary, the LALDA approach was slightly superior to the LDA for all 'Bayesian alphabet' models (BRR, BA, BB, BC and BL). The BA model showed the best predictive ability, reinforcing the relevance to assume one specific variance per marker for the genomic analysis of FCR trait in the studied population.
With respect to the simulated data, since the true QTLs were assumed as specific SNP markers, the fact of exploiting the marker effect under a LA or LDA approach did not influence   the predictive ability. However, in terms of Bayesian models, the same pattern observed for real data was provided for simulated data (i.e. BA, followed by the BB and BL models presented higher predictive abilities).

Heritability estimates
For the real dataset, since the LALDA model via Bayes A outperformed the other models in terms of goodness-of-fit and predictive ability through cross-validation, this model was chosen to provide heritability estimates according to Equation (3). LALDA and LDA presented the same performance for the simulated dataset, and the Bayes A model outperformed the others, thus the Equation (3) was used to estimate the heritability in both real and simulated datasets.
Considering the real dataset, the heritability estimated for FCR was equal to 0.23, with posterior standard deviation equal to 0.038. This estimate is consistent with those reported in literature. When using the simulated dataset (assuming h 2 = 0.35), it is possible to infer that all models underestimated the heritability. The RR-BLUP, BC, BLASSO, BB and BA provided estimates equal to 0.18, 0.20, 0.25, 0.27 and 0.29, respectively.

QTL detection through LA and model comparisons
For the simulated dataset, although a relative 'high' number of QTLs (a total of 50) have been postulated in the simulation, the LA model was able to detect these ones. As mentioned earlier, since the true QTLs were assumed as being specific SNPs, the LA was applied for a kind of 'optimal situation' (the marker is the QTL), thus improving the chances to detect all QTLs.
In terms of the real dataset, in comparison with other usually traits in pig breeding, we note that there are relatively few studies approaching QTL detection for FCR in crossed pig populations. The reason is the difficulty of measurement and the high cost involved in feed intake phenotyping. Similar to the results presented here, Beeckmann et al. (2003) also reported significant QTL for FCR on SSC8 (between the 96.3 and 106 cM) using animals from crosses of pure Meishan, Pietrain and European wild pig breeds. Zhang et al. (2009) using Duroc White × Erhualian Chinese crossbred populations identified significant QTLs for FCR in SSC8 at position 96 cM that explained 3.27% of the phenotypic variance. The QTL position for FCR do not exactly match the positions of the SNPs markers on the physical map. The two most closest SNP markers, ALGA0049550 and ALGA0050287, are located, respectively, at positions 132.3 and 138.2 cM. (ALGA0049550) e 138.2 (ALGA0050287) on SSC8. Thus, the QTL-genotype effect at position 136 cM was included in the Bayesian models for genomic prediction as the LA component simultaneously with all SNP markers (which composed the LDA component). The IBD matrix stored by QXPAK5.05 at the mentioned position was considered as the covariance matrix associated to QTL-genotype effect.
The DIC is relevant only to investigate the models' properties and to infer about parameter estimates. However, the genomic selection depends on predictive ability concept, which measure the prediction efficiency of nonobserved phenotypes based on known parameter estimates (in this case the SNP marker effect estimates). Thus, models chosen by goodness-of-fit measures may not be the best models to make predictions of new observations. According to Bishop (2006), the main reason for this problem is the overfitting, which can evaluated through cross-validation analysis.
The best performance of LALDA framework for real dataset (Tables 1 and 2) can be due mainly to the small number of markers used here, since it enabled to exploit a relevant region (136 cM) not directly considered in the LDA because there is no SNP marker physically located at this position. This kind of extra information added to the model may has contributed to the improvement of the model performance. However, for the simulated dataset, this advantage of LALDA models was not observed because the true QTLs were assumed as the markers themselves. For this situation which the significant QTL positions from LA are coincident with the physical SNP map position, we adopted the LALDA considering these markers only in the LA component (QTL-genotype effect). In this context, the fitted model assumed that linkage disequilibrium occurs only within families (or specific crosses), and can be undone through recombination after few generations. Since the comparisons between the mentioned LALDA with the traditional LDA models resulted in similar results, we can to infer that LA modelling was not relevant for the simulated dataset.
The best performance (goodness-of-fit and predictive ability analysis) of BA model for both datasets, independently of the LD approach (LDA or LALDA), suggests that the assumption of one variance per marker is suitable (since BA outperformed BRR and BCπ) and that variable selection was not effective (since BA outperformed BB and BL).
Although there are few references comparing the LALDA with the LDA models, Boichard et al. (2012) studying the entire Holstein cattle population from France, reported that predictive ability for total milk production, total protein production, total fat production, milk protein percentage, milk fat percentage and fertility were 0.60 and 0.56, 0.57 and 0.55, 0.66 and 0.59, 0.73 and 0.73, 0.81 and 0.72, and 0.39 and 0.35 respectively for LALDA (denominated as QTL-BLUP) and LDA (GBLUP) models. In general, although the authors used a much larger number of animals and markers, the LALDA model proved to be satisfactory. Luan et al. (2012) studying populations of Italian dairy cattle reported the importance of identity-by-state information for the accuracy of genomic selection. Comparing the predictive ability of models by combining information from LA and LDA, the authors found values of 0.60 and 0.59, 0.60 and 0.58 and 0.63 and 0.61, respectively, for the LALDA and LDA models, considering the traits total milk production, total protein production and total fat production. Although the models presented similar performance, the authors concluded that LA component explains why the prediction equations derived for one breed may not predict accurate genome-wide breeding values when applied to other breeds, since family structures differ among breeds.

Heritability estimates
According to a general review by Rothschild and Ruvinsky (2010), the heritability for FCR in pigs is generally moderate (ranging from 0.12 to 0.58, with an average of 0.30) and suggest that selection would be successful. Under a genomic selection approach, Jiao et al. (2014) working with Duroc pigs reported the heritability estimated obtained through Bayes A equal to 0.32 (0.09). Lopez et al. (2016) reported heritability estimates for FCR closest to 0.30, and concluded that genomic selection for FCR is clearly superior to the conventional scheme in terms of monetary genetic gain and profit. According to Akanno et al. (2013), when using genomic selection for feed conversion in pigs assuming heritability equal to 0.32, the accuracy of selection index ranged from 0.38 to 0.66, whereas the conventional selection provides accuracies between 0.10 and 0.64.

Conclusions
Bayesian LALDA models proposed here enable to infer about the nature of LD at specific regions of interest, thus implying in new insights on whole genome prediction such as genetic architecture information and genomic breeding in crossed populations. The available theoretical support and the related computational features characterize these models as a new methodology for genomic selection.