Genetic diversity of 23 STR loci of the Guizhou Tujia ethnic minority and the phylogenetic relationships with 22 other populations

Abstract Background Short tandem repeats (STR) are highly polymorphic DNA markers utilised in forensic personal identification and human population genetic research. Guizhou Tujia is one of the ancient minority groups in southwest China, however, the population has not been studied using the highly discriminating 23 STR Huaxia Platinum Kit. Aim To obtain genetic data from 23 autosomal STRs in Guizhou Tujia and examine the population’s relationship with others. Subjects and methods A total of 480 individuals from the Guizhou Tujia population were analysed using 23 STR loci of Huaxia Platinum Kit. Allele frequencies and forensic parameters were estimated. Population genetic relationships were calculated by Nei’s genetic distances and visualised using a variety of biostatistical methods. Results A total of 264 alleles were found, with allelic frequencies ranging from 0.0010 to 0.5104. The combined discrimination power (CDP) and the combined probability of paternity (CPE) of 23 STR loci were 0.9999999999999999999999999996 and 0.999999999710422, respectively. Guizhou Tujia showed closer genetic relationships with Hubei Tujia, Guizhou Gelao, and Guizhou Miao than with other populations. Conclusion We first obtained the population genetic data of Guizhou Tujia using the 23 STR system and demonstrated its value in forensic applications. Comprehensive population comparisons showed an evident genetic affinity pattern between populations that are geographically, ethnically and linguistically related.


Background
short tandem repeats (stRs) are repetitive nucleotide sequences ranging from 2 to 6 base pairs widely distributed throughout the human genome (edwards et al. 1991;ellegren 2004;Jobling and Gill 2004;Chen et al. 2018).Due to their high level of polymorphic information, stR markers have become the gold standard in forensic DNa analysis, especially in human identification and paternity testing, surpassing other DNa markers such as single nucleotide polymorphisms (sNps) (Zhang 2015;Yao and Wang 2016;He et al. 2018;Wu et al. 2020).In recent years, new commercial multiple stR detection systems containing more autosomal stRs have been developed to further improve their discriminating ability with the increasing growth of forensic DNa databases.additionally, stRs are frequently used in human population genetics to predict the population genetic structure of geographically and ethno-linguistically diverse subpopulations via genetic affinity analysis (Gao et al. 2021;Kumar et al. 2021;tran et al. 2021;Wang et al. 2021;Chandra et al. 2022).
Guizhou province, located in southwest China, is democratically one of the most ethnically diverse provinces in the country, home to 56 ethnic groups, 17 of which are ancient indigenous ethnic minorities.among these groups are the tujia, an ancient Chinese ethnic group residing mainly on the border of Guizhou, Hunan, Hubei, and Chongqing.the Guizhou tujia population accounts for approximately 4.1% of all tujia people in China, making them the fifth largest ethnic group in Guizhou province (https://en.wikipedia.org/wiki/Guizhou).the ancestry of tujia can be traced to the ancient s.LI et aL.
Ba people who lived in southwest China during the Xia and the shang dynasties.Furthermore, their unique language is categorised into the tibeto-Burman branch of the sino-tibetan language family.
Currently, many geographically, ethnically, and linguistically distinct populations in China are being studied using the Huaxia platinum Kit, which was developed as a highly discriminating system to characterise Chinese population genetics (Wang et al. 2015;Ludeman et al. 2018).However, there is limited population data available for the Guizhou tujia, hampering their use in forensic DNa cases and population genetic studies.to address this issue, we genotyped 23 autosomal stRs in 480 Guizhou tujia individuals using the Huaxia platinum Kit to provide a more precise population genetic dataset for forensic practice and human population genetics.

Sample
Blood samples were collected using the Fta cards from 480 unrelated healthy tujia individuals, comprising 254 males and 226 females, in the tongren district of northeast Guizhou (Figure s1).all volunteers were self-identified indigenous, had no history of immigration or interracial marriage within three generations, and provided informed signed consent before sample collection.the project was approved by the Biomedical Research ethics Committee of the affiliated Hospital of Zunyi Medical University (ID: KLLY-2021-110).

Data collection
Genomic DNa was extracted from the blood samples according to the Chelex-100 protocol (Walsh et al. 1991).Multiplex pCR was performed using the Huaxia platinum Kit (thermo Fisher scientific) following the manufacturer's guidelines on the Geneamp® 9700 pCR system (thermo Fisher scientific).pCR products were detected by capillary electrophoresis on an applied Biosystems® 3500 Genetic analyser (thermo Fisher scientific).alleles were assigned by comparing sample pCR fragments with allelic ladders provided with the kit using the GeneMapper ID-X V1.4 software (thermo Fisher scientific).the negative control (H 2 o) and positive control (007 DNa) were simultaneously detected in each batch to ensure reliable analytical results and no reagent contamination.

Data management and statistical analysis
allelic frequencies and forensic descriptive parameters, including the observed heterozygosity (Ho), polymorphism information content (pIC), discrimination power (pD), probability of exclusion (pe), match probability (pM) and typical paternity index (tpI), were calculated using the online software stR analysis for Forensics (stRaF) (Gouy and Zieger 2017).Hardy-Weinberg equilibrium (HWe) and Linkage Disequilibrium (LD) tests were performed with the arlequin V3.5 software (excoffier and Lischer 2010).the total population genetic relationships of Guizhou tujia and 22 reference populations (17 Chinese groups and 5 non-Chinese groups) selected based on the allele frequency data of 23 overlapping stRs were revealed using biostatistical methods.pairwise Nei's genetic distances based on allele frequency distribution were computed using the phylogeny Inference packages (genedist package) of the phylip software V3.695 (Ropelewski et al. 2010).Based on Nei's genetic distances, further clear genetic phylogenetic relationships among these 23 populations were illustrated and visualised via heat maps using the tbtools V1.0 software (Chen et al. 2020).additionally, principal component analysis (pCa) and multi-dimensional scaling plots (MDs) were conducted using spss V26.0 (Hansen 2005), and the neighbor-joining tree (N-J tree) was constructed using the Mega V7.0 software (Kumar et al. 2016).

Results and discussions
the allele frequency distributions and related forensic parameters for the 23 stR loci are summarised in table s1.In total 264 alleles were screened with allelic frequencies ranging from 0.0010∼0.5104.after Bonferroni correction, each locus was consistent with the HWe (0.05/23 ≈ 0.002174) (table s1), with no significant LD found in any locus pair (0.05/253 ≈ 0.000198) (table s2).Ho and pIC varied from 0.9271 to 0.6250, and 0.5666 to 0.9071, respectively.penta e had the highest genetic polymorphism, while tpoX had the lowest.pD, pe, tpI, and pM ranged from 0.8043 to 0.9837, 0.3220 to 0.8510, 1.3333 to 6.8571, and 0.0163 to 0.1957, respectively.the combined power of discrimination (CpD) and the combined power of exclusion (Cpe) of the 23 stR system were 0.9999999999999999999999999996 and 0.999999999710422, respectively.as displayed by the data above, these 23 stRs in the Huaxia platinum stR Kit exhibited high genetic polymorphism and discriminating power in the investigated Guizhou tujia minority, demonstrating potential use in forensic cases such as human identity testing and paternity testing.the resulting population genetic data of the tujia minority can significantly enrich the forensic database of Chinese ethnicities and be used in forensic and anthropological population genetic research.
to obtain a comprehensive and clear understanding of the genetic differences and correlations between Guizhou tujia and others, we first analysed the genetic relationship based on the allele frequency distribution of the 23 overlapped stRs.a total of 22 reference populations from three language families -the sino-tibetan, altaic and Indo-european-were selected for comparison through various analyses.the 14 populations of the sino-tibetan language family were divided into four language branches: 4 tibeto-Burman (Hubei tujia  Chen, Wu, et al. 2019a, 2019b ;Liu et al. 2019a;2019b;Zhang et al. 2019;Li et al. 2020;Liu et al. 2020;Zhang et al. 2020).Meanwhile, 3 altaic language family populations in this study were all of the turkic language branches (Xinjiang Kyrgyz [XJK], Xinjiang Kazakh [XJKa], and Xinjiang Uyghur [XJU]) (Jin et al. 2017;Chen, Zou, et al. 2019a, 2019b;Liu et al. 2019aLiu et al. , 2019b)). the 5 reference populations of the Indo-european language family consisted of Us african american [Uaa], Us Hispanic [UH], White British [WB], Northeast Colombian [NC] and Indian [ID] (Hill et al. 2013;Castilloa et al. 2019;Krzeminska-ahmadzai et al. 2021;shrivastava et al. 2022).
the Nei's genetic distance among the investigated 23 populations is presented in table s3 and illustrated in Figure 1 via a heatmap.the closest genetic distances were found between Guizhou tujia and Hubei tujia populations (0.0283).However, Guizhou tujia exhibited the largest genetic distances with the turkic-speaking Xinjiang Kyrgyz population (0.0765) among the 17 referenced Chinese and african american populations (0.1669) out of the 5 compared non-Chinese populations.
the phylogenetic tree presented in Figure 2 shows that the 23 populations analysed could be divided into two main branches: one containing the five non-Chinese groups (clustered into one branch) and the other including 18 Chinese groups.Within the branch of 18 Chinese groups, several distinct clusters were identified.the Guizhou tujia population was first grouped with the Hubei tujia from the adjacent province and then merged with the tai-Kadai-speaking Guizhou Gelao.all sinitic-speaking Han groups from different provinces were grouped into one independent cluster.Furthermore, the Guizhou Han and the Hmong-Mien-speaking Guizhou Miao from the same region were also closely gathered.populations that are both geographically closer and belong to the same language branch, such as the tibeto-Burman-speaking sichuan Yi and sichuan tibetan populations, as well as the turkic-speaking Xinjiang Uyghurs, Xinjiang Kyrgyz and Xinjiang Kazakhs, were prominently gathered together.the phylogenetic tree demonstrated typical genetic affinity patterns based on their geographical, ethical, and linguistic origins.
as shown by the pCa analysis (Figure 3), the first two components extracted a total of 90.87% genetic variance (pC1:71.64%;pC2: 19.23%) and separated the 23 populations into three main distinct clusters: the five Indo-europeanspeaking populations cluster; the turkic-speaking populations cluster; and the remaining 15 sino-tibetan populations cluster, which belongs to sinitic, tibeto-Burman, tai-Kadai, and Hmong-Mien language branches.as for the third cluster, the Guizhou tujia and Guizhou Gelao were found to be very close to each other, and the sinitic Han populations were tightly assembled.
to further illustrate the genetic similarities and differences of the examined populations, a two-dimensional MDs plot was constructed (Figure 4).We observed that the three Guizhou minorities (Guizhou tujia, Guizhou Miao, and Guizhou Gelao) and the Hubei tujia were gathered in the first quadrant.except for the sichuan Han, all Han populations from diverse administrative divisions were tightly aggregated in the upper left of the fourth quadrant.the 4 populations in sichuan province (sichuan Han, sichuan Yi, sichuan Chengdu tibetan, and sichuan Liangshan tibetan) were grouped in the middle part of the fourth quadrant.Further, the Indo-european-speaking populations were scattered on the left side of the MDs plot, while the turkic-speaking populations were gathered in the centre.Comprehensive population genetic analysis using multiple analytical methods revealed an overall consistent genetic affinity pattern.the Guizhou tujia demonstrated similarity with the Guizhou Miao, Guizhou Gelao and Hubei tujia populations, which can be attributed to extensive gene exchange due to long-term historical migration and intermarriage between the four populations residing in adjacent geographic areas.Notably, the Guizhou tujia and Hubei tujia populations share not only geographic proximity but also a common ancestral ethnic/linguistic background, thereby indicating a higher degree of ancestral components shared between them than other populations.the observed correlations in Guizhou tujia in this study were similar to previous reports from analyses of other autosomal stRs, X-stRs and Y-stRs, further supporting our results (Chen, Wu, et al. 2019;Luo et al. 2020;2021).similar genetic affinity patterns shaped by geographic, ethnic and linguistic factors were also evident in other population groups, such as the aforementioned sinitic-speaking Chinese Han populations, the sinitic and tibeto-Burman speaking populations of sichuan province, as well as the turkic-speaking populations of Xinjiang region.

Conclusion
In this study, we employed the Huaxia platinum Kit for the first time to investigate genetic polymorphism in the tujia population of Guizhou, and our results indicate that 23 loci could be valuable for human identification and paternity testing.additionally, our population comparisons by diverse analytical methods revealed an evident genetic affinity pattern between geographically, ethnically and linguistically related populations.the Guizhou tujia has closer genetic relationships with Hubei tujia, Guizhou Gelao, and Guizhou Miao than with the other populations.these findings add to the growing body of knowledge on genetic variation and population history in China and provide insights into the complex interplay between genetic, geographic, ethnic, and linguistic factors that shape population diversity.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.Heatmap of pairwise nei's genetic distance between the guizhou Tujia population and other 22 reference populations.The population names are displayed as in abbreviated form (similar to other figures and tables), and the genetic affinity and genetic differences of the guizhou Tujia with others are demonstrated using distinct colours.

Figure 2 .
Figure 2. neighbor-joining tree based on nei's distance of the guizhou Tujia population and 22 reference populations.The language families/branches they belonged to are denoted using different colour circles.

Figure 3 .
Figure 3. Principal component analysis (PCA) based on nei's distance of the guizhou Tujia population and 22 reference populations.The 23 populations are separated into three distinct main language family clusters, the sino-Tibetan, the Altaic, and the indo-European.

Funding
this work was supported by the National Natural science Foundation [No.81401562]; shanghai Key Lab of Forensic Medicine, Key Lab of Forensic science, Ministry of Justice, China (academy of Forensic science), open project [No.KF202109]; science and technology project of Guizhou provincial Health Commission [No.gzwkj2021-533]; science and technology Cooperation project of Zunyi science and technology and big data Bureau [No.HZ-2022-237]; science and technology Foundation of Guizhou province of China [No.QKHJC-ZK[2023]582].