Genetic diversity and phylogenetic analysis of Chinese Han and Li ethnic populations from Hainan Island by 30 autosomal insertion/deletion polymorphisms

Abstract With the characteristics of low mutation rate, length variation and short amplicon size, insertion/deletion polymorphisms (InDels) have the advantages of both short tandem repeats (STRs) and single nucleotide polymorphisms (SNPs). Herein, people of two ethnicities from Hainan Island were genotyped for the first time using the Investigator DIPplex kit. We investigated the forensic parameters of the 30 InDels and the phylogenetic relationships among different populations. The accumulated powers of discrimination and powers of exclusion were 0.999 999 999 9646 and 0.9897 in the Hainan Han population and 0.999 999 999 9292 and 0.9861 in the Hainan Li population, respectively. Additionally, population comparisons among geographically, ethnically and linguistically diverse populations via cluster heatmap, multidimensional scaling, principal component analysis, phylogenetic tree and STRUCTURE analyses demonstrated that the Hainan Han population had genetic similarities to the other Han, She and Tujia populations, while the Hainan Li population had close genetic relationships to the Zhuang and Miao groups; both populations had a high degree of genetic differentiation from most Turkic-speaking populations. Aforementioned results suggested that the 30 autosomal InDels are highly polymorphic and informative, which are suitable for human identification and population genetics.


Introduction
Insertion/deletion polymorphisms (InDels), also known as DIPs, are abundant in the human genome and usually appear biallelic feature, which have attracted the interests of forensic researchers and population geneticists recently [1][2][3]. With the characteristics of low mutation rate, small amplicon size, length variation as well as absence of stutter peaks and so on, InDels combine the advantages of both short tandem repeats (STRs) and single nucleotide polymorphisms (SNPs) markers and are gradually becoming a promising approach in forensic applications (especially for degraded DNA and mixed stain identification), population substructure analysis and biogeographic ancestry inference [2,[4][5][6]. The Investigator DIPplex kit (Qiagen, Valenci, CA, USA), which contains 30 biallelic autosomal InDels and amelogenin, has been validated and studied in some populations to evaluate its efficacy in forensic applications [7][8][9][10][11][12]. However, the genetic polymorphisms and forensic parameters of the 30 InDels in the Hainan Han and Hainan Li populations remain unknown.
Hainan Island, the second largest island in China, is located at the southernmost tip of China, facing the Mainland to the north across the Qiongzhou Strait (https://en.wikipedia.org/wiki/Hainan). Based on the 2010 National Population Census (http:// www.stats.gov.cn/tjsj/pcsj/rkpc/6rp/indexch.htm), Han (84%) and Li (15%) are the dominant ethnicities of Hainan Island and account for 99% of the island's population. At the time of the Song dynasty (A.D. 960-1279), large numbers of Han people from the Mainland arrived and settled on Hainan Island. However, according to the literature and archeological studies, scholars believe that the Li people are the original inhabitants of Hainan Island and represent descendants of the ancient Yue tribe, who settled on the island between 7 000-27 000 years ago [13,14]. They have their own language, Hlai, which is a subbranch of the Tai-Kadai language family [15]. The special geographic location, population characteristic and history of Hainan Island provide precious resources to conduct genetic-related analysis.
In the present study, we firstly genotyped Han and Li ethnicities from Hainan Island using the Investigator DIPplex kit and evaluated the forensic efficiency of this panel in the aforementioned two populations. Then, we performed population comparisons and genetic structure analysis between the two investigated groups and other previously studied populations based on 30 autosomal InDels. The details of the relevant populations are displayed in Supplementary Table S1.

Sample collection and DNA isolation
A total of 445 blood samples were collected from unrelated individuals (238 Han and 207 Li people) living on Hainan Island with written informed consent. The collection of blood samples was approved by the Ethics Committee of Sichuan University. All participants declared that their ancestors have lived on this island for at least three generations. Genomic DNA was isolated utilizing the PureLink Genomic DNA Mini Kit (Thermo Fisher Scientific, Carlsbad, CA, USA) and quantified using the Nanodrop-2000C (Thermo Fisher Scientific) following the manufacturer's recommendations. Then, DNA samples were normalized to 1.0 ng/uL and stored at À20 C until amplification.

PCR amplification and InDel genotyping
The 30 InDels and amelogenin included in the Investigator DIPplex kit were coamplified on a ProFlex 96-well PCR System (Thermo Fisher Scientific) according to the manufacturer's protocol. Subsequently, PCR products were separated and detected using the Applied Biosystems 3130 Genetic Analyzer (Thermo Fisher Scientific). Allele allocation was carried out using GeneMapper ID-X v1.5 software. Control DNA 9948 (Qiagen) and ddH 2 O were used as positive and negative controls for each batch of genotyping. Our laboratory has been accredited by the China National Accreditation Service (CNAS) for Conformity Assessment and ISO 17025. The experimental methods and procedures of this study were conducted according to the approved guidelines of Institute of Forensic Medicine, Sichuan University.

Statistical analysis
Allele frequencies and forensic parameters, including observed heterozygosity (H o ), expected heterozygosity (H e ), matching probability (PM), power of discrimination (PD), probability of exclusion (PE) and typical paternity index (TPI), as well as the Hardy-Weinberg equilibrium (HWE) and linkage disequilibrium (LD) tests were assessed using the online tool of STRAF [16]. Subsequently, the population relationships between the two studied groups and other previously studied populations were investigated. The heatmaps of allele frequencies and Nei's standard genetic distances (R st ) were produced with an online tool Morpheus (https://software.broadinstitute.org/morpheus/). R st was calculated based on allele frequencies using the PHYLIP 3.695 package. The R st matrix was then used to implement a multidimensional scaling (MDS) plot via the SPSS software (IBM SPSS, version 19.0; IBM Corp., Armonk, NY, USA) and construct the neighbour-joining (NJ) tree via the MEGA v7.0 software [17]. Allele frequency-based principal component analysis (PCA) was carried out using MVSP v3.22 software [18]. Population structure analysis was performed using STRUCTURE v.2.3.4 software with K values spanned from 2 to 7 [19]. Besides, Structure Harvester was used to infer the optimal K value, CLUMPP v1.1.2 (Rosenberg Lab, Dallas, TX, USA) and Distruct v.1.1 (Rosenberg Lab) were employed to visualize the population genetic structures.

Results and discussion
Allele frequency and forensic parameter analysis Supplementary  Tables S4-S5). The observed significance may represent substructure association, which may be caused by the sample size or genetic admixture [20]. The allele frequencies and forensic parameters of the 30 InDels are shown in Supplementary Table S3  These results suggested that this panel was effective for personal identification but not sufficiently powerful to perform a paternity test in the two groups.
As shown in Supplementary Table S6 and Supplementary Figure S2 Figure S3(B)) showed that the population distribution pattern was similar to the MDS, and the top two components could explain 74.01% of the total variance (PC1: 55.85% and PC2: 18.16%). Our studied populations and the East Asian populations were separated from other continental populations and had the greatest genetic differences from the three African populations (Xhosa, Zulu and Nigeria groups). Furthermore, the six Xinjiang minorities were situated between the other East Asian and European populations, in accordance with our previous study [47]. As shown in the NJ cladogram (Supplementary Figure S3

Genetic differentiation along ethnic and linguistic divisions
For the purpose of obtaining a clearer insight into Chinese population relationships, we performed various analyses among our studied populations and 28 Chinese populations [8][9][10][11][12]21,22,[24][25][26][27][28][29][30][31][32][33]. Supplementary Figure S4 presents the cluster heatmap of deletion allele frequencies among the 30 populations, and three primary clusters (I, II and III) were easily distinguished. Cluster I was divided into two subclusters (IA and IB); the IA subcluster included five Tibetan groups, and the IB subcluster As shown in Supplementary Table S6 and Supplementary Figure S5, the Hainan Han population had relatively small R st with the other Han, Hui and Tujia populations, while the Hainan Li population had genetic affinities with the Zhuang, Dong and Gaoshan populations. Both had distant genetic relationships with the six Xinjiang groups. In Figure 2, different coloured fonts represent different language families. The MDS plot (Figure 2A) showed that all of the Turkic-speaking populations were distributed in the second quadrant, and all of the Tibeto-Burman-speaking populations (except the Bai group) were scattered below the x-axis. The Hainan Han population clustered with the Tujia, Bai, She (Hmong-Mien-speaking population) and Siniticspeaking populations, while the Hainan Li population had close genetic relationships with the Zhuang (Tai-Kadai-speaking population) and Miao (Hmong-Mien-speaking population) groups. In our PCA plot ( Figure 2B), the first two components accounted for 61.98% of the total variance. Most Turkic-speaking populations were distributed in the fourth quadrant, and most Tibeto-Burman-speaking populations were situated above the x-axis. The Hainan Han population had genetic similarities to the She, Tujia, Bai, Yunnan Yi and Sinitic-speaking populations, while the Hainan Li population had genetic affinities with two Tai-Kadai-speaking populations (Zhuang and Dong groups) and the Miao group. In the NJ tree ( Figure 2C), three clades were observed: a Turkicspeaking and Mongolic-speaking clade, a Tibeto-Burman-speaking clade (excluding the Yi and Tujia groups), and a clade containing our studied populations and other linguistic groups. The Hainan Han population clustered with the She people and then with three additional Han populations. The Hainan Li population first grouped with the Yunnan Yi, Miao and Dong groups and then with the Zhuang people. In short, we verified the genetic similarities among the same language families through multiple analyses and observed close genetic relationships between the Tai-Kadai-speaking and Hmong-Mienspeaking populations.
Our studied populations shared a resemblance to the East Asian groups, apart from the three Turkicspeaking populations. At K > 3, no further substructure was distinguished with the exception of the Xinjiang Hui population, which may be caused by the complex history of that group.

Conclusion
We firstly reported a batch of genotype data of 30 InDels included in the Investigator DIPplex kit in the Hainan Han and Hainan Li groups and evaluated the forensic application ability of this panel in the two populations. Furthermore, population comparisons along geographical, ethnical and linguistic divisions through multiple analyses manifested that the 30-InDels panel had a certain intercontinental differentiation ability (especially between the East Asian and African populations) and could also distinguish different language populations to a certain extent (especially for most Turkic-speaking populations). For our investigated populations, the Hainan Han population had close relationships to the other Han, She and Tujia populations, and the Hainan Li population had genetic affinities with the Zhuang and Miao minorities.

Compliance with ethical standards
This study was approved by the Ethics Committee of the Sichuan University (Approval Number: K2015008). Blood samples were collected from participants with written informed consent.

Disclosure statement
No conflict of interest was declared by the authors.