Geolocation prediction from STR genotyping: a pilot study in five geographically distinct global populations

Abstract Background Traditional CE-based STR profiles are highly useful for the purpose of individualisation. However, they do not give any additional information without the presence of the reference sample for comparison. Aim To assess the usability of STR-based genotypes for the prediction of an individual’s geolocation. Subjects and Methods Genotype data from five geographically distinct populations, i.e. Caucasian, Hispanic, Asian, Estonian, and Bahrainian, were collected from the published literature. Results A significant difference (p < 0.05) in the observed genotypes was found between these populations. D1S1656 and SE33 showed substantial differences in their genotype frequencies across the tested populations. SE33, D12S391, D21S11, D19S433, D18S51, and D1S1656 were found to have the highest occurrence of “unique genotype’s” in different populations. In addition, D12S391 and D13S317 exhibited distinct population-specific “most frequent genotypes.” Conclusions Three different prediction models have been proposed for genotype to geolocation prediction, i.e. (i) use of unique genotypes of a population, (ii) use of the most frequent genotype, and (iii) a combinatorial approach of unique and most frequent genotypes. These models could aid the investigating agencies in cases where no reference sample is available for comparison of the profile.


Introduction
The field of forensic DNa analysis is constantly evolving, with rapid technological advancements.Most laboratories use the capillary electrophoresis (ce) technique to explore repetitions in short Tandem Repeat (sTR) markers to obtain a unique DNa profile (Butler 2007).however, the use of sTR markers has been limited to the purpose of individualisation and it does not provide any additional information regarding investigative leads.In many forensic DNa investigations, sample-limiting conditions can arise, where reference samples are not available for matching (Butler 2015).In such scenarios, the present-day DNa analysis becomes irrelevant, as it fails to provide conclusive identification results.Though database searching for such unknown DNa profiles is an alternative, many developing countries do not have their own databases (Machado and Granja 2020).some of the potential DNa databases in the world include cODIs, eNFsI sTR population database, eMPOP, Family Tree Dna, eNFsI sTRbase, and many more (Ruitberg et al. 2001;congiu et al. 2012).hence, it has become imperative to explore the horizon of sTR-based DNa profiles to provide investigative leads in cases where reference samples are not available.
The advancements in technology have enabled the prediction of an individual's phenotype and ancestry using Next Generation sequencing (NGs) (Yang et al. 2014).several single Nucleotide Polymorphism (sNP) markers have been explored for this purpose.as sNP markers are not as polymorphic as sTR markers, a huge number of sNP markers need to be analysed in comparison to the conventional sTR analysis (Butler et al. 2007).It is not possible to accommodate such a huge number of sNP markers in the conventional ce technique.Thus, it becomes imperative to use NGs technology to predict the phenotype and ancestry of an individual using the sNP approach.considering the cost involvement and expertise required for NGs analysis compared to the ce based approach, a complete shift of work-flow from the ce based approach to the NGs technique is not foreseen in the near future.
sTR markers are highly versatile and have been extensively studied for their individualisation capabilities in various populations.autosomal sTR markers, mini-sTRs (Nieuwerburgh et al. 2014), Y-sTR haplotype prediction (Kayser 2017), X-sTR analysis in sample limiting conditions (Yang et al. 2017), and rapidly mutating (RM) sTRs for individualisation (Ballantyne et al. 2014) have been widely used for forensic DNa analysis purposes.sTR analysis is being performed to solve cases such as paternity disputes, identification, murder, sexual assault, etc. without giving any investigative leads.The use of such markers in other applications besides individualisation has been explored a little to date, such as monitoring of haematopoietic chimerism in patients after allogeneic stem cell transplantation (Tilanus 2006), matching between organ donor and recipients (Mishra et al. 2020) and cell line identification (Reid et al. 2017).
Population-specific useful markers have been explored in many studies (Dash, Rawat et al. 2021).huge population data announcements are being published in various databases and journals.critical observation of such population data reveals unique alleles, as well as least common and most common alleles in each population.The genotype characteristics of each marker also reveal the usefulness of some particular genetic markers in the population (Dash, Rawat et al. 2021).In this regard, the genotype information of a profile is postulated to provide its population-related information.however, none of the studies have explored this possibility to date.Thus, an attempt has been made to correlate the marker-specific genotype information with the population to evaluate their possible relatedness.Prediction of the population origin of an individual from its sTR DNa profile, devoid of any reference samples, will be highly useful in providing valuable investigative leads.

Populations and genotype data
a total of five geographically distinct populations were selected randomly from the published literature.The selected population data included self-declared 361 caucasian, 236 hispanic, and 97 asian (hill et al. 2013) samples, 303 samples from estonia (sadam et al. 2015), and 543 samples from Bahrain (al-snan et al. 2019).Though the sample size for the asian population is less in comparison to the other populations included in this study, it truly represents the population and its allelic and genotypic characteristics.The populations chosen in this study are from geographical isolations, including representatives from americans, asians, europeans, and natives of the Middle east.Genotype data from the said populations were compiled and analysed further.Genotype data were analysed on 21 consistent genetic markers, i.e. csF1PO, D12s391, D13s317, D16s539, D18s51, D10s1248, D19s433, D1s1656, D21s11, D22s1045, D2s1338, D2s441, D3s1358, D5s818, D7s820, D8s1179, FGa, se33, Th01, vWa, and TPOX.

Statistical analysis
Various statistically relevant parameters such as frequency of genotype, the correlation between inter-population genotype frequencies, 2-way aNOVa with replications, prediction of population-specific unique genotypes, prediction of most-frequent genotypes, and other parameters, were performed using Microsoft excel 2019.The phylogenetic tree was prepared using the allelic data of the five studied populations using POPTRee2 (Takezaki et al. 2010).

Occurrence of genotypes and their inter-population variation
Out of 21 sTR markers analysed, se33 was found to generate the highest numbers of total genotypes in the caucasian (195), hispanic (141), asian (64), and Bahrainian (195) populations (Figure 1).D5s818 showed the least number of total genotypes in the caucasian, hispanic, and Bahrainian populations, whereas TPOX yielded the lowest number of total genotypes in the asian and estonian populations.a number of possible genotypes in a population indicates the distinguishable capability of an sTR marker (Dash, Rawat et al. 2021).In this regard, the occurrence of the highest number of genotypes in the se33 marker in all the populations included in this study showed its utility irrespective of the population tested.The usefulness of the se33 marker has also been established in other global populations, including the Indian population  Besides the number of observed genotypes, the occurrence of heterozygous genotypes plays a crucial role in establishing the utility of an sTR marker.any sTR marker having higher observed heterozygosity proves beneficial as it provides higher informativeness in a population (Dash, Rawat et al. 2021).In this regard, the observed heterozygous genotypes were found in a similar fashion as the occurrence of the total number of genotypes, with se33 having the highest number of heterozygous genotypes irrespective of the population tested.The heterozygosity of the sTR markers varied widely in the populations tested, which ranged from 0.66 (D10s1248 in the estonian population) to 0.95 (se33 in the Bahrainian population) (Figure 2).The lowest heterozygosity was observed at Th01 (in the caucasian population), TPOX (in the hispanic population), D16s539 (in the asian population), D10s1248 (in the estonian population), and D5s818 (in the Bahrainian population).Though the highest value of genotype heterozygosity did not differ at the inter-population level, the significant difference in the lowest occurrence of heterozygosity gives a clear indication regarding their distinguishability among the populations studied.
When the interpopulation variation of the total number of genotypes (Table 1a) and the total number of heterozygous genotypes (Table 1b) were evaluated by 2-way aNOVa with replications, we found a statistically significant difference (p < 0.05) in both scenarios.among the populations tested and among the sTR markers analysed, both parameters showed significant variation, which encouraged the authors to evaluate more parameters to study the inter-population variation of the sTR-specific genotypes to distinguish different geographically distinct populations.Phylogenetic analysis revealed that the estonian population and Bahrainian population form a phylogenetic cluster, whereas the caucasian, asian, hispanic, and african american populations form a separate cluster (Figure 3).This suggests that genetic variation exists among the studied populations, and their allelic and genotype data can be explored to attribute the origin of an unknown individual.

Genotype frequencies
The observed genotype frequencies showed a wide variation among the different populations included in this study.however, none of the populations showed a statistically significant variation (p < 0.05) in the calculated genotype frequency in any of the sTR markers except at D1s1656 and se33 (supplementary appendix).These two markers are deemed to be useful in distinguishing among the geographical populations based on their observed genotypes and their frequency.The usefulness of se33 markers has been well established in different geographic populations and has also been described in the previous section.The occurrence of microvariant alleles at D1s1656 contributing to its usefulness has been described by Dash, Vajpayee et al. (2021).In a similar study, the usefulness of the D1s1656 sTR marker has been established in the population of central Poland (Jacewicz et al. 2016), the Maghreb region (cortellini et al. 2011), and the population of Maharastra, India (Badiye et al. 2021).Though the D1s1656 marker showed its usefulness across populations, it has also been explored for its invasion of the neighbour markers to mislead the reading of a DNa profile (Marcucci et al. 2017).Thus, the D1s1656 marker needs to be explored more in a population-specific manner before reaching any conclusions regarding its usefulness.Pearson correlation-coefficient showed a surprising result in this study (Table 2).Though a positive correlation was observed among the inter-population genotype frequencies, a statistically significant positive correlation was observed at p < 0.1 between the populations studied, except asian-caucasian and asianestonian.This opens a new window to attribute an individual to be of asian origin by distinguishing it from the caucasian and the estonian populations.

Unique genotypes assessment
The occurrence of population-specific unique genotypes of an sTR marker can play a decisive role in attributing a DNa profile to a geographic population.In this regard, many unique genotypes were observed concerning different sTR  tested in the populations.as expected, the highest numbers of unique genotypes were observed at the se33 marker, irrespective of the populations tested.Besides se33, other useful markers generating unique genotypes in the populations tested include D12s391 (4.8), D21s11 (4.6), D19s433 (3.4), D18s51 (3.2), and D1s1656 (3.2).The average number of unique genotypes was found in the order of Bahrainian population (5.528), followed by the caucasian population (4.571), the hispanic population (3.142), the asian population (2.857), and the estonian population (1.095).a list of marker-wise population-specific unique alleles is given in Table 3.The highest occurrence of unique genotypes in the Bahrainian population can be linked to the high rate of traditional consanguineous marriage in the population  (17.3,19.3), (20, 27) (18, 20, 26, 27), (19,24.3)-( 24), (19.3,20), (15, 23, 25), (19.3,25), (19.1,23), (23, 26), (20.3,21) (al-arrayed and hamamy 2012).consanguineous marriages have been reported to increase the levels of homozygosity, less diverse genes, deviation from the hardy-Weinberg (h-W) equation, and loss of function mutations (erzurumluoglu et al. 2016).The occurrence of less diverse genotypes can work as a blessing in disguise, as these genotypes can be exploited to predict the geolocation of an individual.Most of the population genetics studies analyse the unique alleles of the population (aalbers et al. 2020).The number of possible unique genotypes increases with an increase in the number of possible unique alleles.hence, it is more useful to analyse the unique genotypes over the alleles while attributing an individual to a geographical location or distinct population.This highlights that the vast occurrence of these unique population-specific genotypes can be attributed to the possible geographical location of an unknown individual.

Assessment of most frequent genotypes
Frequent genotypes found in an sTR marker for a specific population can also be targeted for attributing an unknown individual to a population.a variety of population-specific, most frequent genotypes observed in different sTR markers.some populations shared their most frequent genotypes, for example, (11, 12) was found to be the most frequent genotype at csF1PO in the caucasian, hispanic, and Bahrainian populations.as this genotype was the most common and was shared among three distinct populations, it may not provide any useful information attributing a DNa profile to a specific geographical population.however, the occurrence of ( 12) and (10, 12) genotypes as the most common genotype in the asian and the estonian populations, respectively, were found to be useful in geolocation attribution.Marker-wise, population-specific observed, "most frequent genotypes" are listed in Table 4. Out of all the sTR markers tested, D12s391 and D13s317 were found to be the most useful markers as they produced distinct "most frequent alleles" in the populations analysed, i.e. the caucasian population [(19, 21) and (8, 12)], the hispanic population [(19, 20) and ( 12)], the asian population [(18, 20) and (10, 11)], the estonian population [(18, 22) and (12, 13)], and the Bahrainian population [(18) and (11, 13)].Besides, the asian population showed the highest number of distinct "most-frequent genotypes", followed by the estonian population (8 genotypes), the Bahrainian population (7 genotypes), and the caucasian and hispanic populations (6 genotypes each).however, many sTR markers generated a common "most frequent genotype," irrespective of the populations tested.This does not add any additional information to distinguish between the tested populations.hence the second most frequent genotypes were assessed to reach a sufficient level of discrimination.When the second most frequent genotype was considered, D1s1656, D2s1338, and D7s820 emerged as the most discriminatory markers (Table 4).In these markers, though some of the populations shared the most-frequent genotype, the second most frequent genotype differed from one population to another.In forensic population genetics studies, the most common alleles are analysed for the sTR markers tested (Projić et al. 2007).however, as forensic DNa analysis envisions distinguishing between two individuals, the commonly occurring alleles with low discrimination power are considered least useful (schneider 2012).On the contrary, considering the population-specific most frequent alleles is highly useful in generating investigative leads such as attributing the biogeographical location of an individual (Battey et al. 2020).

Implications of this study
This pilot study, including five geographically distinct populations, showed huge promise in attributing the geolocation of an individual based on their sTR data.In a general phenomenon, the genotype data of an sTR marker shows higher diversity in comparison to the allelic data in the population.hence, exploring the genotype information is considered to be more useful in geolocation prediction.The exploitation of sTR genotype data in five geographically distinct populations suggested that few sTR markers are more useful in estimating the geolocation of an individual.looking at the observed genotypes, and their sTR-specific and population-specific nature, two parameters are deemed to be highly useful in predicting the geolocation of an individual from its genotype data.Based upon this pilot study, we propose three models to predict the geolocation or population of an individual using sTR genotype results, i.e. (i) use of unique genotypes of a population, (ii) use of most frequent genotype, and (iii) combinatorial approach of unique and most frequent genotype (Figure 4).Most of the studies till now explored the sNP based ancestry informative markers to estimate the biogeography of an individual.Besides, sNP analysis requires sophisticated instruments such as a Next Generation sequencer (NGs).With the availability of a huge database of sTR markers across the global population, geolocation prediction will be easy and economical, and can be performed using the existing infrastructure of forensic science laboratories.Besides, additional sTR markers other than the conventionally used markers can also be explored at a population-specific level with high ancestry informativeness for their sustainable use in the prediction of geolocation from genotype.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Figure 1 .
Figure 1.number of genotypes observed at 21 consistent sTR markers in five geographically distinct populations, i.e.Caucasian, Hispanic, Asian, Estonian, and Bahrainian.

Figure 2 .
Figure 2. observed heterozygosity of the 21 consistent sTR markers tested in five geographically distinct populations, i.e.Caucasian, Hispanic, Asian, Estonian, and Bahrainian.

Figure 3 .
Figure 3. Phylogenetic tree for relatedness analysis among the studied populations.

Figure 4 .
Figure 4. implication of this work, prediction of geolocation of an unknown individual from genotype data. in this process, (1) useful sTR markers are identified, (2) sTR profile is generated, (3) identification of unique and/or most-frequent genotypes in the unknown DnA profile, followed by (4) prediction of possible geolocation of the individual.

Table 1 .
2-way AnoVA with replications of no. of observed genotypes and heterozygous genotypes among different populations.

Table 2 .
Correlation-coefficient of the geographically distinct populations included in this study is based on the observed genotype frequency at 21 sTR markers.

Table 3 .
list of unique genotypes observed in five geographically distinct populations.

Table 4 .
marker-wise useful most-frequent genotypes with the capability of distinguishing inter-population individuals.