Nuclear DNA barcodes for cod identification in mildly-treated and processed food products

ABSTRACT Gadoids are a group of fish with historical importance in the fishing industry. The high demand for cod is one of the reasons why cod products are often mislabelled, and numerous observations have been made on the replacement of Atlantic cod (Gadus morhua) by cheaper species or its illegal capture in contravention of fish quotas. Fish species identification is traditionally based on morphological features, but this may be difficult in case of heat-treated or processed products, or where the species look similar, as in the Gadoid group. DNA-based approaches (using either nuclear or mitochondrial DNA) are most commonly used in this case, due to their high specificity and to the high resilience of the target molecules to food processing techniques. In this article, we identified, using an automated screening approach, novel barcode regions and their associated primers in the nuclear genome, to be used for the efficient identification of Gadoids. The barcode regions were tested on official and commercial samples, raw or mildly treated products, like frozen, or salted, as well as pre-cooked complex mixtures and processed samples, using next-generation sequencing (NGS) technique. The method proposed could complement existing fish identification strategies in establishing an efficient framework to detect and prevent frauds along the food chain.


Introduction
Seafood consumption has significantly increased over the last decades and so has the demand for fish, which is considered the most common protein source consumed worldwide. Owing to this increasing popularity, seafood has become one of the top categories associated with fraud issues. Seafood fraud involves several aspects of the industry, including economically motivated fraud (Everstine et al. 2013); consumer safety (Miller and Mariani 2010); and sustainability of fisheries (Jacquet and Pauly 2008;Triantafyllidis et al. 2010). In order to control and reduce economic fraud, the European Union (EU) has established legislation that regulates seafood labelling, requiring traceability information, such as commercial and scientific names, fishing and production methods, catch area and the fishing gear; for other processed foods, such as canned, composite products and breaded products, this information is voluntary (European Union 2011;European Union 2013). Despite these efforts, widespread seafood mislabelling has been reported in the United States (Khaksar et al. 2015), in Europe (Filonzi et al. 2010;Miller and Mariani 2010;Garcia-Vazquez et al. 2011;Nedunoori et al. 2017), in Asia (Xiong et al. 2016) and in South Africa (Cawthorn et al. 2011), indicating the presence of flaws in legislation on food traceability and the need for stringent control measures to guarantee efficient species identification.
Gadoids are a group of fish with historical importance in the fishing industry, representing approximately 18% of the world's total catch (FAO 2014). Various species of the Gadidae family belong to this group, including Atlantic cod (Gadus morhua), Pacific cod (Gadus macrocephalus), Alaska pollock (Gadus chalcogrammus), Pollock (Pollachius pollachius) and Saithe (Pollachius virens). Most of these species look similar, which makes their morphological identification very difficult or almost impossible. The high demand for cod is one of the reasons why cod products are often mislabelled, and numerous observations have been made on the replacement of Atlantic cod (G. morhua) by cheaper species (Helyar et al. 2014) or its illegal capture being hidden by the use of other species' names (Miller and Mariani 2010).
In the case of G. morhua, several research works and institutional publications have identified adulteration and misclassification prevalently with G. macrocephalus and G. chalcogrammus (Miller et al. 2012;Cutarelli et al. 2014;US FDA 2014;Mendes and Silva 2015).
Fish species identification is traditionally based on morphological features (Strauss and Bond 1990), but this may be difficult or even impossible in case of processed products. Food authenticity can be assessed using a broad variety of methods, such as those based on protein (Mazzeo and Siciliano 2016) or DNA analysis (Griffiths et al. 2014). DNA-based approaches are by far the most popular methods, mainly due to their high specificity but also to the relatively high resistance of target molecules to food processing techniques. Several methodologies have been carried out using either nuclear DNA (nDNA) or mitochondrial DNA (mtDNA). The most prominent ones are the Forensically Informative Nucleotide Sequencing (FINS), Restriction Fragment Length Polymorphism (RFLP); Single-Stranded Conformational Polymorphism (SSCP); Random Amplified Polymorphic DNA (RAPD); Amplified Fragment Length Polymorphism (AFLP); Loop-mediated isothermal amplification (LAMP) assay; High Resolution Melting (HRM) analysis; some of them already available as commercial applications (Asensio Gil 2007;Rasmussen and Morrissey 2008;Clark 2015;Saull et al. 2016;Fernandes et al. 2017Fernandes et al. , 2018Tomás et al. 2017).
Within the sphere of DNA-based approaches, much attention has been devoted to DNA barcoding, which relies on sequence variations within a short and standardised region of the genome, designated as a ''barcode'. DNA barcoding was shown to provide accurate species identification (Hebert et al. 2003). Currently, the mitochondrial genes coding for cytochrome c oxidase subunit I (COI) and cytochrome b (cytb) are considered reliable DNA barcodes for the discrimination of animal species (Hebert et al. 2003;Hellberg et al. 2014;Mueller et al. 2015). For the identification of fish species, mitochondrial loci have been preferred to nuclear genes because of their features: mitochondrial genes belong to a haploid genome, they are present in high copy numbers (particularly in fish tissues) and their mutation rate is greater than that of nuclear genes (Cline 2012;Rehbein 2013). The number of DNA barcodes deposited in databases is growing continuously (http://www.fishbol.org/). Despite the fact that the majority of the studies have used COI or cytb mtDNA barcoding to identify seafood products and investigate broad patterns in fish mislabelling (Rasmussen and Morrissey 2008;Miller and Mariani 2010;Cline 2012;Di Pinto et al. 2013), novel nuclear barcode regions have also been proposed for the identification for example, of flatfish species (Paracchini et al. 2017). The length of these nDNA barcodes is generally shorter than that of the mtDNA barcodes: this facilitates the amplification of the DNA even in the case of highly processed food products and the compatibility with the current next-generation sequencing next-generation sequencing (NGS) technologies, allowing the identification of species also in case of mixture (Paracchini et al. 2017).
Despite the efficacy of the sequencing methodology currently used for the identification of Gadoid species, the efficiency of barcoding can be improved by targeting additional genomic positions. In this report, we propose a set of novel nuclear barcoding targets for Gadoid species identification. The barcodes were identified using an automated screening approach, then tested on official and commercial samples, raw or mildly treated products, like frozen, or salted, as well as pre-cooked complex mixtures and processed samples.

Sample collection
Thirty three samples were analysed in total ( Table 1) and consisted of official and market samples. Official samples: samples for G. morhua (n = 4) available in the biological reference collection of the FishTrace project (https://fishtrace.jrc.ec.europa.eu/) were obtained from the Swedish Museum of Natural History. Biological samples for G. macrocephalus (n = 6) (Eastern Pacific cod from Alaska and Western Pacific cod from South Korea) were provided by the School of Aquatic and Fishery Sciences of the University of Washington (U.S.A.). The exact nomenclature of each sample was confirmed by COI barcoding. Other fish specimens (n = 23) were purchased from local markets and supermarkets. The tested specimens were sold as frozen Nordic cod (G. morhua), dried and salted cod (Gadus spp.), frozen fish sticks (Hake -Merluccius spp.), Whiting (Merlangius merlangus), Pollock (Pollachius pollachius), Saithe (Pollachius virens), frozen Alaska pollock burger and Alaska pollock fillet (G. chalcogrammus) and surimi (processed product prepared from mixed fish species).

DNA extraction
Total DNA was extracted from 2 to 200 mg tissue sample depending on the availability and abundance of the sample following manufacturer's recommendations (DNeasy® Blood & Tissue Kit, Qiagen). DNA concentration was quantified using Qubit dsDNA BR Assay Kit using the Qubit® 3.0 Fluorometer (Invitrogen). The Nanodrop spectrophotometer (Thermofisher) was used to evaluate the purity of the samples and ratio 260/280 nm was recorded and taken into consideration. DNA samples were diluted to a final concentration of 10 ng/μl unless lower amounts were obtained.

Primers design and selection of nuclear barcodes
Primer design of 10 candidate nuclear barcode regions was achieved through the strategy described by Paracchini et al. (2017). Compared to that study, the main difference is that an additional filter was applied in order to select only primer pairs predicted to amplify a single genomic locus. This modification would allow performing the analysis using Sanger sequencing, while relying on the use of NGS in case of mixtures or complex samples.
From this shorter list, the capability of the different barcodes to differentiate the Gadoid species of interest was predicted by calculating in silico the number of inter-species differences, comparing it, when relevant, to intra-species differences. This was determined as described in the following paragraphs. All calculations were done using R version 3.3.1. (http://www.r-project.org/). A) Intra-species differences: Per primer pair, the amplicon sequences were inspected for their origin (species). If multiple sequences were present for a single species, the sequences were aligned (using the multiple sequence alignment available through the package 'msa'). The Levenshtein distance (i.e. the difference between two sequences) was calculated across all pairwise alignments, using the 'StringDist' function from the package 'Biostrings'. A consensus sequence was also generated (using the 'consensus' function of the package 'seqinr' with 'method = majority') to be used as the sequence to be compared for this species in the evaluation of inter-species differences. B) Inter-species differences: A multiple alignment across all species was generated and the Levenshtein distance across all pairwise alignments was calculated.
The results can be graphically represented, as shown in Figure 1, as a distance matrix. The diagonal of this distance matrix was populated by inserting the maximal intra-species Levenshtein distance for this species (as found during step A). The complete set of Levenshtein matrices for all the genomic regions is presented in Annex 1. Based on the results, the genomic regions were ranked in order of suitability for distinguishing cod from other fish species. For each primer pair, 'suitability' was high when the inter-species differences were high compared to the intra-species differences. In a next step, the top performing regions were used to design primers that allow their amplification from all species within the analysis. The detailed primers sequences are reported in Table 2.

COI and nuclear barcodes sequencing
The identification of the species obtained from official samples and purchased specimens was confirmed in house using the traditional COI The species used for this analysis were those for which a genome was publicly available. Each column/row shows the number of differences found between the species numbered 1-33 (between species variability, the larger the more useful the sequence is in distinguishing species). The diagonal shows the differences found between the different sequence versions of that species (within species variability, the larger the more difficult it is to uniquely identify the species). The results are colour coded red to green in the main matrix body (green = high between species variability) and purple to green on the diagonal (green = low within species variability barcoding method, using published primers and protocols (Ward et al. 2005;Ivanova et al. 2007). Individual PCR amplifications for nuclear target barcodes were performed in 50 μl using 2.5 U/ reaction of AmpliTaq Gold DNA Polymerase, 1 x Buffer II, 2.5 mM of MgCl 2 (Applied Biosystems), 200 μM dNTPs and 200 nM of each primer. DNA samples were amplified in a GeneAmp PCR System 9700 (ABI, U.S.A.), with the following cycling parameters, according to the protocol of the AmpliTaq Gold PCR system: initial denaturation at 95°C for 10 min, followed by 45 cycles of denaturation at 95°C for 30 s, annealing at 55°C or 60°C for 30 s, extension at 72°C for 30 s and final extension at 72°C for 7 min. PCR products were analysed on an agarose gel electrophoresis to verify and confirm the expected size. PCR products were purified by Qiaquick PCR purification Kit (Qiagen GmbH, Hiden, Germany) and bi-directionally PCR sequenced with BigDye Terminator v3.1 Cycle Sequencing Kit (Applied Biosystems, Foster City, California) following manufacturer's recommendations. Sequences were dyeterminator removed by CentriSep Spin columns (Princeton Separations) and therefore run on 3730 Genetic Analyser (Applied Biosystems, Foster City, California). Electropherograms were analysed using Sequencing analysis v5.2 software (Applied Biosystems, Foster City, California).
For the analysis of mitochondrial COI barcodes, the identification of species was performed via the Boldsystem portal (http://www.boldsystems.org/).

Meta-barcoding sequencing and analysis
NGS was performed on a GS Junior System (GS Junior System, 454 Life Sciences, Roche Applied Sciences, Basel, Switzerland). Amplicon libraries were prepared using fusion primers for bidirectional sequencing as described in the Amplicon library preparation manual (Roche Applied Sciences, Basel, Switzerland). Multiplex Identifiers (MIDs) were added in order to allow inclusion of more than one sample in the same experiment.
After amplification, amplicons were purified using AMPure XP beads (Roche Applied Sciences, Basel, Switzerland), quantified fluorometrically (Quant-iT PicoGreen dsDNA Assay kit, Life Technologies, Molecular Probes, Eugene, Oregon, U.S.A.), diluted and pooled to a final concentration of 0.5 × 10 6 molecules μL −1 . Libraries were checked for their quality by performing a Quality Control PCR; they were subsequently visualised using Agilent DNA 1000 Chips (Agilent Bioanalyzer, Agilent Technologies, San

FOOD ADDITIVES & CONTAMINANTS: PART A
Diego, U.S.A.). Emulsion PCR containing between 0.6 and 0.75 copies per bead (cpd) was recovered using vacuum and the successive enrichment led to an enrichment rate between 10% and 20%; only 5% of the enriched beads were subsequently loaded on the chip and sequenced. All steps were in accordance with the manufacturers' instructions.
The output (FASTA format) was split using fas-tx_barcode_splitter from the Fastx-toolkit (v. 0.0.14) (Gordon and Hannon 2010) to isolate the reads from the different samples tested in the same run using the MID sequences. The primer sequences were then trimmed using Cutadapt (v. 1.7.1) (Martin 2011). Scripts were used to analyse each read against the reference files using Glsearch from the FASTA package (v. 36.3.7a) to identify matches between the entire length of each read and local regions of the reference barcode sequences. The number of mismatches allowed was set at 1%, i.e. 0 for barcodes less than 100 bp, 1 for those between 100 and 200 bp and 2 for those between 200 and 300 bp. In case of more than one hit, the most recent common ancestor for the identified species was determined using the API of the Open Tree of Life (Watanabe 2013), following the instructions at https://github.com/OpenTreeOfLife/ germinator/wiki/Open-Tree-of-Life-Web-APIs. If no hit matched the minimum criteria the read was assigned to an 'Unassigned' conclusion. The main reason for this would be errors in the sequences from the 454 pyrosequencing process. The occurrences of these remained limited (on average less than 5% of total reads).

Results and discussion
In the current study, we used a strategy similar to the one we recently published (Paracchini et al. 2017) in order to identify a set of novel barcode regions in nuclear genomes of gadoids. The aim of the work was also to analyse the feasibility of nuclear barcode identification of mildly-treated, processed and mixed cod samples.

Barcode selection
Among the candidate barcode primers, a set of 10 pairs were selected to be evaluated experimentally (Table 2), based on their capability to distinguish species (number of inter-species and intra-species differences). This evaluation has been performed by PCR in silico simulation on all available fish genomic sequences and selecting only those ones with a single predicted amplicon. Due to this specification, it was possible to amplify these fragments and analyse them via Sanger sequencing. Details, including their annotation in the genome, are presented in Table 2. The selection of the barcodes and the length range of the amplicon (spanning from 122 to 305 bp) was also based on the capacity of the NGS technology we used.

Sample analyses
A set of different samples was analysed, including official and market samples, raw, frozen, salted and processed (two samples were baked in the oven at 200°C for 30 min). A complete list of samples included in the study is presented in Table 1. DNA was extracted from all specimens, with different absorbance ratios at 260/280 nm (from 1.2 to 2.1) and different concentrations (1.85-118 ng/uL). Moreover, depending on the type of storage of the samples, and in particular in salted specimens, different degrees of degradation were observed in accordance with the findings described by (Dalmasso et al. 2013).
Extracted DNAs have been used as template for PCR and sequencing by Sanger. A summary is reported in Table 3. For G. chalcogrammus three samples were tested, five samples for G. macrocephalus, four samples for G. morhua, one sample for M. merluccius, eight samples for M. paradoxus, two samples for M. merlangus, one sample for P. pollachius and one sample for P. virens. In the case of official samples, only one of the available specimens was analysed, due to the fact that the specimens belonged to the same individual. In the case of M. paradoxus, only few barcode regions were tested, because they were considered to be the most informative for species discrimination. Moreover, due to the fact that the 7226-aab, -aad, -aae, -aaf primers bind on the same genetic region (7226), the sequencing analysis was performed on only one primer pair for all the samples, i.e. 7226-aad.
The surimi sample, being a mixture of different species, was not tested by Sanger sequencing but with a different approach (NGS), as detailed in section 'Mixture analyses'.
For each primer pair for which a sequence of the amplicon was available for all eight fish species (i.e. 2034-aac, 4049-aab and 7226-aad) corresponding to four different genera, the produced sequences from the various samples have been compared to each other with two different objectives: • Ability to easily distinguish between Gadus and other genera. • Ability to detect a genomic region that is specific to Gadus morhua.
We found that these three tested primer pairs could be used to distinguish, by sequencing the amplicon products, the Merluccius genus, as specific differences exist for each of the conserved genomic loci that have been analysed (see Figures 2-4). In addition to that, primer pair 2034-aac produces an amplicon in Gadus morhua that contains an insertion in its middle that is characteristic of such a species (Figure 3). This region (in blue in Figure 3) will be tested for designing a real-time PCR detection method for Gadus morhua.

Mixture analyses
Once the pure samples were analysed, the system was challenged for its detection capability on mixtures containing different species in complex samples. A commercially available sample containing mixtures of processed white fish labelled as 'Smoked salmon surimi' was tested ( Figure 5). The label listed the ingredients as 'fish flesh 43% (of which smoked salmon 8%)'. Amplification of the COI region, coupled with Sanger sequencing, produced a clean sequence that was analysed in the BOLD systems identification portal (http://www. boldsystems.org/) and identified the product as Table 3. Tested primers on different samples from different species. Each primer has been tested on DNA from different samples. In case of more than one sample per species, all of them have been tested and sequenced. In cases where no PCR product was obtained (T-X), the main reason was a low PCR yield. Legend: T = PCR tested; A = amplicon present; S = amplicon sequenced; X = amplicon not present/amplicon not sequenced; -= PCR not tested.  'Gadus chalcogrammus', i.e. Alaska Pollock or Pacific Pollock ( Figure 5, upper part) a species often found in surimi products (Pepe et al. 2007;Ferrito et al. 2016). The surimi sample was tested in NGS with the nuclear barcode primers designed for this study. A total of 72,108 reads were obtained for the tested barcodes, and the results indicated the presence of different components: 21% of the reads were assigned to Gadus genus or to closely related species (6.3% of the reads assigned to G. chalcogrammus, 1.4% to G. morhua), 19% were assigned to Euteleosteomorpha (a cohort comprehending the genus Gadus), 16% of the reads were assigned to the salmon (Salmo salar) component of the fish flesh which was mentioned on the label, and the rest of the reads (32%) were unassigned ( Figure 5, lower part).

Conclusions
The method suggested in this study is potentially able to detect and identify correctly the species even in difficult matrices like mildly treated or processed seafood specimens thanks to the high resilience of DNA as a marker. Moreover, the use of nuclear barcode targets coupled with NGS could be applied to mixtures, allowing the identification of different species even in complex samples. NGS is an extremely sensitive technique, so its performance criteria in application to regulatory purposes (specificity, sensitivity and reproducibility) require severe quality controls and cut-off values. NGS, in particular when coupled with PCR, is a powerful approach to detect different sources of DNA even in trace quantities, but it cannot be The alignment of sequences amplified by using primer pair 2034-aac is shown. Each sequence represents one of the following species: Gadus chalcogrammus (Gcha), G. macrocephalus (Gmac), G. morhua (Gmor), Merluccius merluccius (Mmer), M. paradoxus (Mpar), M. Merlangus (Mmel), P. virens (Pvir) and P. pollachius (Ppol). As highlighted by colours, there are seven different base positions on this locus that, taken all together into account, can be used to distinguish the Merluccius genus (in red) from the others (in green). In addition to that, Gadus morhua has a peculiar insertion of 12 bases (highlighted in blue). Bases in lowercase indicate those bases that, among the different samples tested in one species, are not always present. The alignment of sequences amplified by using primer pair 7226-aad is shown. Each sequence represents one of the following species: Gadus chalcogrammus (Gcha), G. macrocephalus (Gmac), G. morhua (Gmor), Merluccius merluccius (Mmer), M. paradoxus (Mpar), M. Merlangus (Mmel), P. virens (Pvir) and P. pollachius (Ppol). Primer pair 7226-aad produces genera-specific amplicons in terms of length. Moreover, as highlighted by colours, there are different single base positions on this locus that, taken all together into account, can be used to distinguish Merluccius genus (in red) from the others (in green); in particular, Gadus morhua has two peculiar variations (highlighted in blue).
considered as a quantitative analysis. This is because of the influence of the PCR step preceding the NGS, which may cause incorrect estimation of the relative species composition and misleading results. This could be an issue when trying to distinguish a fraud from an event of accidental cross contamination during production. Nevertheless, because of the fact that cross contamination is accidental, an appropriate sampling strategy, in the number of samples to be analysed, and replicates to be included in the experiment, should provide a more appropriate way to address this point; statistical estimation of the adequate sampling strategy would be crucial in this case.
Quantification was not the objective of the current article. In this case, whole genome sequencing (WGS) could provide the appropriate answers.
Instead, for qualitative questions where detection (presence or absence) of the analyte is crucial, the methodology described here seems promising, even when tested on mixtures. Moreover, the fact that amplicons are short enough allows for the applicability of the method to processed or mildly treated samples, where DNA is usually degraded and standard mtDNA barcoding methods are inappropriate.
An important issue to be taken into account is the need for harmonisation both in terms of analytical strategy and data analysis, i.e. benchmark in bioinformatics pipelines. To be able to properly analyse the results, the need of publicly available, curated databases are extremely important. The method proposed could complement existing fish identification strategies in establishing an efficient framework to detect and prevent frauds along the Figure 5. Analysis of complex, processed fish samplesA 'Smoked salmon surimi' product from the supermarket was analysed. Upper part: the traditional COI barcode, coupled to Sanger sequencing, produced a single sequence that was classified as 'Gadus chalcogrammus' by the BOLD systems identification portal (Ratnasingham and Hebert 2007). Lower part: the NGS analysis, in contrast, also detected the minority salmon fraction of the fish flesh. The picture represents a pool of the results obtained for all the nuclear barcode primers analysed (2034-aac, 4049-aab, 3726-aad, 3726-aab, 7226-aad, 7226-aaf, 10,029-aab, 10,029-aac). food chain, as well as in managing fisheries and conservation strategies.