An association analysis between a polymorphism in the SEC24A gene and lipid traits recorded in Duroc pigs

Abstract By regulating the expression of low-density lipoprotein receptors (LDLR), the SEC24A protein modulates the clearance of plasma cholesterol in mice. The whole-genome sequencing of five Duroc boars that sired a Duroc purebred experimental population (LIPGEN population, 350 individuals) revealed the segregation of one putative nonsense mutation located at the SEC24A (rs336401826, c.44G>A) gene. Genotyping of this polymorphic site in the LIPGEN population highlighted the existence of a significant departure from the Hardy Weinberg equilibrium (χ 2 = 5.05, p-value = 0.024527). Despite the fact that SEC24A inactivation should affect serum lipid concentrations, we did not observe an association between SEC24A genotype and such traits. A thorough revision of the annotation of the SEC24A rs336401826 SNP revealed that this polymorphism is probably located in the 5’UTR of the SEC24A gene, rather than at the beginning of the first exon, hence not introducing the predicted premature stop codon that could truncate the SEC24A protein. Given that the annotation of loss-of-function mutations in pigs is far from perfect, careful manual curation of such mutations is recommended before undertaking the analysis of their effects at the experimental level. HIGHLIGHTS One mutation (c.44 G>A) predicted to disrupt the SEC24A protein segregates in a Duroc pig population. This polymorphism is not associated with serum lipid levels despite its expected effect on such traits. Manual curation of the annotation of this putative nonsense polymorphism has revealed that it maps to the 5’UTR of the SEC24A gene.


Introduction
An important fraction of proteins synthesised in the endoplasmic reticulum need to enter the secretory pathway to reach other intracellular organelles, the cell surface or the extracellular space (Bonifacino and Glick 2004). The transport of these proteins to the Golgi complex, where they can undergo post-translational modifications, is mediated by vesicles coated with coat protein complex II (COPII), which are assembled through the recruitment of SEC24 and other proteins (Lee et al. 2004). Indeed, the SEC24 subunit of the COPII complex contributes to cargo selection by recognising specific molecular signals (Lee et al. 2004). In mammals, there are multiple forms of COPII coat structures specialised in different types of cargos and, in this context, four paralogous SEC24 genes (SEC24A to SEC24D) have been reported so far (Zanetti et al. 2011). In mice, the inactivation of the SEC24A gene is not lethal but it causes a reduction in plasma cholesterol because the secretion of proprotein convertase subtilisin/kexin type 9 (PCSK9), which is a key mediator in the lysosomal degradation of lowdensity lipoprotein (LDL) receptors (Lagace 2014), is shut off, thus leading to an hepatic upregulation of LDL receptors and, in consequence, to an increased cholesterol clearance (Chen et al. 2013).
In a previous experiment (M armol-S anchez et al. 2020, 2021), we sequenced the whole genomes of the five Duroc boars that sired a purebred population of 350 Duroc barrows (LIPGEN population), as reported by Gallardo et al. (2008Gallardo et al. ( , 2009. The analysis of these five whole-genome sequences (WGS) revealed the segregation of a putative nonsense mutation located in the SEC24 Homolog A, COPII Coat Complex Component (SEC24A) gene (our unpublished data). This mutation was recorded at Ensembl repositories (https://www.ensembl.org/index.html) with the rs336401826 (c.44 G>A, SEC24A) identifier. The goal of the current work was to investigate the segregation of this putative nonsense SNP in the LIPGEN population, as well as to determine its association with lipidrelated traits in order to verify whether SEC24A inactivation in pigs has the same consequences on fat metabolism than those previously reported in mice.

Sequencing the genomes of five Duroc boars
In brief, genomic DNA was purified from blood samples belonging to the five boars that sired a Duroc population of 350 offspring (LIPGEN population), as reported by Gallardo et al. (2008Gallardo et al. ( , 2009). Subsequently, these genomic DNA samples were sequenced at the Centre Nacional d'An alisi Gen omica (CNAG, Barcelona, Spain). Paired-end multiplex libraries were prepared according to the instructions of the manufacturer with the KAPA PE Library Preparation kit (Kapa Biosystems, Wilmington, MA) and processed to generate 100 bp paired-end reads on a HiSeq2000 Illumina instrument. Base calling and quality control analyses were performed with the Illumina RTA sequence analysis pipeline. Quality-checked filtered reads were then processed and any remaining sequence adapters were trimmed by making use of the Trimmomatic software (v.0.36) with default parameters (Bolger et al. 2014). Trimmed paired-end sequences were aligned against the pig Sscrofa11.1 reference genome (Warr et al. 2020) with the BWA-MEM algorithm (Li 2013) and default settings. Sequence alignment map (SAM) formatted files were then sorted and transformed into binary (BAM) formatted files and PCR duplicates were subsequently identified with the MarkDuplicates package from Picard tool (https://broadinstitute.github.io/ picard/) and removed to perform INDEL realignment with the IndelRealigner package from the Genome Analysis Toolkit (GATK v.3.8, McKenna et al. 2010). Base quality score recalibration (BQSR) was implemented with BaseRecalibrator from GATK v.3.8. Variant calling was performed using the HaplotypeCaller tool according to GATK best practices recommendations. Finally, SNP variants were retained with SelectVariants package and quality filtered using the FilterVariants tool from GATK v.3.8, according to the following parameters: QD < 2.0, FS > 60.0, MQ < 40.0, MQRankSum < À12.5, ReadPosRankSum < À8.0, SOR > 3.0.
Retrieval of the whole-genome sequences from 120 pigs available at public repositories A total of 120 whole-genome sequences from wild boars and domestic pigs (Sus scrofa) were retrieved from the NCBI Sequence Read Archive (SRA) database (https://www.ncbi.nlm.nih.gov/sra). The 120 selected genome sequences (Supplementary Table 1 Further details about breeds and sample origins can be found in M armol-S anchez et al. (2020, 2021). All SRA files from selected porcine individuals were downloaded from SRA public repositories and converted into fastq format by using the fastq-dump 2.8.2 tool available in the SRA-toolkit package (ncbi.github.io/ sra-tools/). Subsequent processing, SNP calling and filtering were performed analogously to what has been reported in the previous section for the 5 whole-genome sequenced Duroc sires.

Isolation of genomic DNA and genotyping
The manipulation of Duroc pigs followed Spanish national guidelines and it was approved by the Ethical Committee of Institut de Recerca i Tecnologia Agroaliment aries (IRTA). Genomic DNA was purified from blood samples collected from the offspring of the five sequenced Duroc boars by using a standard phenol-chloroform extraction protocol (Gallardo et al. 2008(Gallardo et al. , 2009). All Duroc pigs from the LIPGEN population were typed with the Porcine SNP60 BeadChip (Illumina) in a previous experiment (Gonz alez-Prendes et al. 2017). Moreover, a dedicated TaqMan assay was used by the Veterinary Molecular Genetics Service from the Universitat Aut onoma de Barcelona (http:// svgm.es/es/Home) to genotype the rs336401826 G > A SEC24A (N ¼ 320) mutation by employing a QuantStudio 12 K flex Real Time PCR System. The HWChisq module from the HardyWeinberg R package (https://CRAN.R-project.org/package=HardyWeinberg) was used to compute a Chi square test of independence to infer whether genotype frequencies deviate from the Hardy Weinberg equilibrium (Table 1).

Association analyses with lipid traits
A summary of the lipid-related phenotypes considered in our experiment is provided in Supplementary Table  2. The measurement of lipid traits has been previously described in detail (Gallardo et al. 2008(Gallardo et al. , 2009Quintanilla et al. 2011). Serum lipid traits were measured after weaning and before starting the fattening period ($45 days of age), as well as prior to slaughter ($190 days of age), according to specific experimental protocols for determining total serum cholesterol (Richmond 1992), low density lipoprotein (LDL, Friedewald et al. 1972), high density lipoprotein (HDL, Rafai and Warnick 1994) and triglyceride (Fossati and Prencipe 1982) concentrations. Intramuscular fat (IMF) was measured by Near Infra-red Transmittance (NIT, Infratec 1625, Tecator Hoganas, Sweden), while muscle cholesterol content was assessed following Cayuela et al. (2003). A chromatography analysis of methyl esters (Mach et al. 2006) was used to determine the fatty acid composition in the C:12 to C:22 range and they were subsequently expressed as a percentage of the IMF for the gluteus medius (GM) and longissimus dorsi (LD) skeletal muscles. Association analyses were performed by applying a univariate linear mixed model framework with the Genome-Wide Efficient Mixed-Model Association (GEMMA) software (Zhou and Stephens 2012) and the following statistical model: where y is the vector of phenotypic observations for every individual; a is a vector containing the intercept plus the included fixed effects, i.e. batch effect with four categories (all traits), farm origin effect with three categories (all traits) and data of extraction with two categories within batch (serum lipids traits). The a vector also contains the regression coefficients of the following covariates: (1) IMF content in the LD or GM skeletal muscles for LD and GM fatty acid composition, (2) backfat thickness (measured between 3 rd and 4 th ribs) and live weight at slaughterhouse for serum lipids measured at 45 and 190 days of age, and (3) backfat thickness (measured between 3 rd and 4 th ribs) for IMF content; W is the incidence matrix relating phenotypes with the corresponding effects; x is the vector of the genotypes corresponding to the set of selected polymorphisms; d is the allele substitution effect of the analysed polymorphism; u is a vector of random individual genetic effects with a n-dimensional multivariate normal distribution MVNn (0, k s À 1 K), where s À 1 is the variance of the residual errors, k is the ratio between the two variance components and K is a known relatedness matrix derived from the SNPs; and e is the vector of residual errors. A false discovery rate (FDR) approach (Benjamini and Hochberg 1995) was used for multiple testing correction.

Chromosome-wide association analyses
Genotyping data for the SEC24A rs336401826 G>A nonsense mutation were merged with the genotypes of 3684 SNPs mapping to chromosome 2 (SSC2) and generated with the Porcine SNP60 BeadChip (Illumina, San Diego, CA) as reported by Gonz alez-Prendes et al. (2017). The GEMMA software (Zhou and Stephens 2012) was used to perform this association analysis by considering the methods reported previously and chromosome-wide significance was assessed at the nominal level (p-value <0.05) as well as after correction for multiple testing using the Bonferroni approach.

Multiple sequence alignment of SEC24A transcripts from different species
We used the Clustal Omega tool of the EMBL-EBI sequence analysis toolkit (Madeira et al. 2019) to generate a multiple sequence alignment of the annotated coding sequence (CDS) of the 1 st exon of the SEC24A transcripts retrieved from eight available porcine assemblies (Sscrofa11.1, USMARC, Bamei, Meishan, Rongchan, Pietrain, Landrace and Large White), as well as from the human assembly (GCRh38.p13). Moreover, the MultAlin software (Corpet 1988) was employed to generate a multiple alignment of the porcine SEC24A protein with other homologous SEC24A proteins annotated in human (GCRh38.p13), cattle (ARS-UCD1.2), mouse (GRCm38.p6) and dog (CanFam3.1) assemblies.

Results and discussion
A significant departure from the Hardy Weinberg equilibrium was found (v 2 ¼ 5.05, p-value ¼ 0.024527, Table 1) for the rs336401826 G>A SEC24A variant, mainly because there was a slight deficit of the two homozygous genotypes. Genotyping errors are one of the most common causes of observing Hardy-Weinberg departures, but that does not seem to be the case because all three AA, GA and GG genotypes are clearly discernible (Figure 1). Another potential reason for the Hardy-Weinberg departure observed by us would be that the frequencies of the SNP might be altered by selection or stochastic factors directly acting on it or on a linked nearby mutation. Nevertheless, the statistical significance of the departure was so weak that no biological inferences should be made. More importantly, no missing genotypes were detected, thus evidencing that this putative nonsense mutation is not lethal. As previously said, this result was anticipated because SEC24A inactivation is expected to just reduce plasma cholesterol concentration via up-regulating LDL receptor membrane concentration due to a specific block in the secretion of PCSK9 (Chen et al. 2013). The rs336401826 (c.44G>A, Trp15 Ã ) variant is located at codon 15 of a gene encoding a protein of 1133 amino acid residues, so the premature stop codon should provoke the full inactivation of the protein and marked changes in blood lipid levels. Contrary to our expectations, statistical analyses did not show any significant association between SEC24A rs336401826 genotypes (GG, N ¼ 152; GA, N ¼ 149 and   Table 3). In contrast, significant associations were found between the nonsense rs336401826 mutation and C14:0, C16:0, C16:1 and C20:2 fatty acid contents in the GM skeletal muscle (Table 2), although only at the nominal level (p-value < 0.05). Data about average GM C14:0, C16:0, C16:1 and C20:2 contents in pigs with different rs336401826 genotypes are displayed in Table 3. A total of 2951 SNPs mapping to the SSC2 porcine chromosome were successfully analysed in chromosome-wide association analyses, which revealed that the rs336401826 SEC24A polymorphism was among the SNPs displaying the most significant associations with GM C16:1 (nominal p-value ¼ 1.577E-03) and GM C20:2 (nominal p-value ¼ 4.841E-03) phenotypes ( Figure 2). It should be noted that C14:0, C16:1 and C20:2 fatty acids are found in very low amounts in the intramuscular fat of the GM muscle, so it is challenging to measure them with accuracy. On account of this, the results obtained in the current association analysis should be interpreted with caution until replicated in other porcine populations.
When we assessed the segregation patterns of the SEC24A mutation in the 120 WGS from European and Asian pigs and wild boars, the frequency of the A Table 3. Average C20:2, C16:1, C14:0 and C16:0 fatty acids contents in the gluteus medius (GM) skeletal muscle of pigs with GG and AA genotypes for the rs336401826 (c.44G>A) mutation in the porcine SEC24A gene.  allele was higher than that of the G allele, despite its potential deleterious effect on gene function, and this was observed in European and Asian domestic breeds as well as in European and Asian wild boars (Supplementary Table 1). In the EDM group, the G allele was predominantly found in Duroc pigs (N ¼ 7; 2 GG, 4 GA and 1 AA). Aside the Duroc breed, only one GA Yucatan pig and three GG Yucatan, Yorkshire and Large White pigs carried the G-allele, while all the remaining EDM pigs harboured AA genotypes (Supplementary Table 1). Despite the high frequency of the A-allele in most porcine breeds, the G-allele is the reference allele in the Sscrofa11.1 porcine assembly (Warr et al. 2020). The most likely reason of this outcome is that this assembly was built using PacBio RS II deep long- Figure 3. Multiple sequence alignment showing the sequence of the protein encoded by the SEC24A gene in different species (human, mouse, cattle, dog and pig). The portion of the pig SEC24A protein that is not homologous to the SEC24A protein sequences from the remaining species is clearly visible, evidencing that the translation start codon of the porcine SEC24A protein sequence is incorrectly annotated in the Sscrofa11.1 assembly. read sequencing from one single female pig belonging to the Duroc breed (Warr et al. 2020), in which the G variant is frequent.
In the light of these results, that were not consistent with a deleterious effect of the nonsense rs336401826 G>A SEC24A mutation, we decided to further investigate the genomic context of this SNP. By doing so, we determined that the common predicted ATG translation start codon is located 141 nucleotides downstream from the predicted start codon of SEC24A exon 1 in the Sscrofa11.1 assembly (Supplementary Figure 1). Interestingly, in the Rongchang, Large White and Landrace assemblies the SEC24A exon 1 coding sequence was correctly annotated, whereas other available porcine assemblies like USMARC, Meishan, Bamei and Pietrain contained a wrong annotation, analogously to what was found for the Sscrofa11.1 assembly (Supplementary Figure 1). The alignment of the mature SEC24A protein sequences from human, mouse, pigs, cattle and dogs also evidenced that the porcine sequence contains an initial segment that is not present in the remaining species (Figure 3), a result that once again suggests that the starting codon of the porcine SEC24A sequence is incorrectly annotated. Altogether, these findings would imply that the rs336401826 G>A SEC24A polymorphism is located within the 5'UTR of the SEC24A transcripts, rather than at codon 15 of the coding region.
Although the current Sscrofa11.1 assembly (Warr et al. 2020) represents a considerable improvement with regard to the previous one (Sscrofa10.2, Groenen et al. 2012), there are still annotation problems that need to be corrected. The annotation of nonsense mutations can be particularly error-prone when gene and/or transcript structures are not well known or when the existence of nearby compensatory polymorphisms is ignored. Another important source of error can be pseudogenes, which are expected to contain a much higher rate of deleterious mutations than that observed in functional genes. As a whole, these findings reinforce the need of manually curating the annotation of loss-of-function mutations in pigs in order to be able to predict their functional consequences.

Conclusions
Rather than a nonsense mutation, the rs336401826 G>A SEC24A variant is a benign 5'UTR polymorphism with no harmful consequences. In general, the annotation of loss-of-function mutations is challenging and this is why manual curation becomes essential to prevent wrongful annotations. At this point, it is difficult to assess how many putative lossof-function mutations are incorrectly annotated in the current pig genome assembly, but caution is advised. Careful inspection and verification of the annotation of loss-of-function mutations is recommended before undertaking the analysis of their effects at the experimental level.