Genetic diversity and population structure of sorghum [Sorghum bicolor (L.) Moench] in Ethiopia as revealed by microsatellite markers

ABSTRACT Ethiopia is the center of origin of sorghum (Sorghum bicolor L.). Understanding of the genetic diversity of the species is fundamental to design appropriate conservation and management strategies. The present study addressed the genetic diversity of sorghum accessions collected from major growing regions of Ethiopia. Eighty sorghum accessions representing five populations namely Amhara, Dire Dawa, Oromia, Southern Nations, Nationalities and Peoples (SNNP) and Tigray were analyzed with 11 simple sequence repeat (SSR) markers. Analysis of molecular variance (AMOVA) was conducted to evaluate genetic variation within and among populations. The unweighted neighbour-joining-based cluster analysis, principal components of analysis (PCoA) and structure analysis were done to elucidate clustering of populations. The polymorphic information contents (PIC) ranges from 0.50 to 0.86. A high within-populations genetic diversity was confirmed with gene diversity values ranging from 0.51 to 0.77. AMOVA revealed 93.26% of the total genetic variation within populations and 6.74% among populations. Cluster analyses did not show clear grouping of accessions according to their geographical origins, confirming gene flow (Nm = 6.65) among populations. In conclusion, the SSR markers used were polymorphic and highly informative Oromia and Amhara populations displayed genetic diversity greater than mean value of 0.67 suggesting possible target populations for breeding and conservation.


Introduction
Sorghum (Sorghum bicolor (L.) Moench) belongs to family Poaceae (De Wet 1978). Sorghum originated in north-eastern Africa and then spread to different regions all around the world (De Wet and Harlan 1971). As reported by Vavilov (1951), Ethiopia is the center of origin of sorghum. S. bicolor includes all cultivated sorghums as well as semi-wild and wild plants (Mutegi et al. 2011). Both the cultivated and wild varieties are diploids (2n = 2x = 20) with a genome size of ∼730 Mbp (Paterson et al. 2009). The cultivated sorghum has been classified into five races: Durra, Bicolor, Caudatum, Kaffir and Guinea (Harlan and De Wet 1972). Except Kaffir all are found in Ethiopia (Teshome et al. 1997;Girma et al. 2020).
Sorghum is the fifth in the order of cereal crops global production (Cuevas et al. 2014). According to FAOSTAT (2017), United States was the world's largest producer of sorghum. Africa is the world's regional leader in total production of sorghum. Ethiopia is the third largest producer of sorghum in Africa behind Nigeria and Sudan (Wani et al. 2011). In Ethiopian agriculture, sorghum holds the third largest share of total cereal production after teff and maize. It is a very important crop for food and feed (Tariq et al. 2014) as well as widely used as biofuel and fibre (Murray et al. 2008;Elangovan et al. 2012;Disasa et al. 2017).
Several production constraints including drought, striga, insect pest, diseases and low-yielding local cultivars affect the productivity of sorghum (Amelework et al. 2016). To address these problems, it is important to have the knowledge of genetic variability of a crop for efficient selection process (Yaqoob et al. 2015). The application of molecular markers is more effective to assess genetic diversity as well as for selection of traits of interest in breeding programmes (Smith and Smith 1992). Molecular markers, such as simple sequence repeats (Tautz 1989) have great potential in detecting genetic diversity and relationships of crops. Microsatellite is preferred for many sorghum genomics and molecular breeding applications due to its technical simplicity, high throughput level and potential for automation (Missihoun et al. 2015).
In sorghum, limited studies using morphologicalbased genetic diversity analysis have been carried out using sweet sorghum and wild sorghum germplasms (Adugna et al. 2013). Protein (allozyme) studies have been applied to the geographic and altitudinal variation among landraces collected from Ethiopia and Eritrea (Ayana et al. 2001). Some genetic diversity studies have been done on selected sorghum genotypes at the molecular level using DNA markers such as RFLP (Menz et al. 2004), RAPD (Ayana et al. 2000), AFLP (Geleta et al. 2006), ISSR (Tadesse and Feyissa 2013), SSR , expressed sequence tags (EST) (Ramu et al. 2009) and single nucleotide polymorphism (SNP) markers (Khangura 2019;Nida et al. 2019;Mengistu et al. 2020;Enyew et al. 2022;Lee et al. 2022). Some of the genomic diversity studies of sorghum were on a global scale (Hu et al. 2019;Lasky et al. 2015;Morris et al. 2013). Others were at regional scale such as Ethiopian sorghum landraces Menamo et al. 2021;Wondimu et al. 2021) and West African sorghum panel (Faye et al. 2021). In addition, sorghum nested association mapping studies were executed using an Ethiopian landrace as recurrent parent Dong et al. 2022). Recent studies reported that Ethiopian sorghum germplasm is genetically diverse with high levels of admixture (Girma et al. 2020;Menamo et al. 2021).
However, these studies differ in their power of genetic resolution, quality of information content and extents of polymorphism. In this study, we used SSR makers, which are polymorphic loci, present in DNA that consist of repeating units of one to six base pairs. SSR makers are known for their hyper-variability, reproducibility, co-dominant nature, locus-specificity and random genome-wide distribution (Jonah et al. 2011;Jiang 2013). Therefore, the aim of this study was to assess the genetic diversity of sorghum collected from major growing areas of Ethiopia using SSR markers.

Plant materials
Eighty accessions of sorghum genotypes (additional file 1) initially collected from major Sorghum growing regions of Ethiopia were obtained from Ethiopian Biodiversity Institute (EBI). Based on their source of origin/ region, they were categorised into five populations: Amhara, Dire Dawa, Oromia, Southern Nations, Nationalities and Peoples (SNNP) andTigray. Seeds representing each accession were planted at Melkassa Agricultural Research Center (MARC). From each accession, healthy young leaves were collected from two-week-old plants and put in plastic bag containing silica gel to dry and make it ready for genomic DNA extraction.

DNA extraction
Genomic DNA was extracted from young leaves using plant DNA extraction protocol described in Diversity Array Technology (DArT,www.diversityarrays.com) with some modifications. DNA quality was checked by loading 3 µl of DNA mixed with 2 µl of 6x loading dye with gel red on a 1% agarose gel and separated at 100 V for 40 min. The extracted genomic DNA was quantified using a Nano-Drop spectrophotometer (ND-8000, 8-sample Spectrophotometer) and stored at −20°C until used for analyses.

PCR and gel electrophoresis
Fifteen SSR primer-pairs, developed by Billot et al. (2012), were used for initial screening for amplification, polymorphism and specificity to target loci. According to the manufacturer's instruction, the primers were dissolved using nuclease-free water to a final concentration of 100 mol/µl. Out of the 15 screened primer pairs, 11 of them amplified the genomic DNA and showed polymorphism across tested sorghum accessions.
DNA amplification was performed in 12.5 μl reaction volume containing 6.25 μl One Taq ® 2X Master Mix with Standard Buffer, 2 μl 50 ng/µl template DNA, 2 μl forward and reverse primers, 0.25 µl DMSO (Dimethyl sulfoxide) and 2 μl nuclease-free water. Touch-down PCR amplification was carried out using BIO-RAD T100 Thermal cycler. The PCR programme for the primer pairs (Table 1) consisted of 15 min at 94°C for the initial denaturation followed by 12 cycles of 94°C for 30 s, 60°C for 45 s (ramp of 1°C per cycle) and 72°C for 1 min, then by 31 cycles of 94°C for 30 s, 38°C for 45 s and 72°C for 1 min, a final extension of 20 min at 72°C and then put on hold at 4°C. The PCR amplified products were separated in 3% agarose gel electrophoresis by loading 5 µl of each of the PCR product mixed with 2 µl of gel red using 1 × TAE buffer at 100 V for three hours. The product was visualised under gel documentation system (BioDoc-It TM imaging system) and subsequently photographed. A 100 bp size molecular marker was used to estimate the size of the amplified products.

Data scoring and analysis
The PCR products/bands were scored using PyElph 1.4 software (Pavel and Vasile 2012). Genetic diversity and population structure analyses were carried out on the basis of the scored bands. Locus based diversity indices including major allele frequency (MAF), the number of allele (Na), gene diversity (GD) and Polymorphic information contents (PIC) were analyzed using Power marker version 3.25 software (Liu and Muse 2005). Effective number of alleles (Ne), Shannon's Information index (I), Population differentiation (Fst) and Gene flow (Nm) were determined using POPGENE version 1.32 (Yeh et al. 1999).
Population's genetic pattern; number of alleles (Na), number of private alleles (NPA), expected heterozygosity (He), Percentage of polymorphic loci (PPL) and estimate of the deviation from Hardy-Weinberg equilibrium (HWE) were computed using GenAlEx version 6.5 software (Peakall and Smouse 2012). The same software was used to compute pairwise population genetic distances and gene flow, and to perform the genetic differentiation test (Fst) over 999 bootstrap replications.
Analysis of molecular variance (AMOVA) and estimate of the variance components were conducted using Arlequin version 3.5.2.2 (Excoffier and Lischer 2010). Gene flow (Nm) among populations was estimated using the formula, where Fst = the variance among populations/total genetic variations.
A genetic dissimilarity matrix was analyzed using Neighbour Joining (NJ) method and Nei's standard genetic distance (DST, corrected) (Nei 1972) based Unweighted Pair Group Method with Arithmetic Mean (UPGMA) trees were generated using DARwin version 6.0 (Perrier and Jacquemoud-Collet 2006) and Power Marker version 3.25 (Liu and Muse 2005), respectively.
Principal Coordinate Analysis (PCoA) was performed using GenAlex version 6.5 (Peakall and Smouse 2012) software package. Population structure and admixture patterns were determined using the admixture model based on Bayesian algorithm implemented in STRUC-TURE software version 2.3.3 (Pritchard et al. 2000). The admixture model with correlated allele frequencies was used, assuming that the genome of each individual resulted from the mixture of K ancestral populations. To estimate the true number of population cluster (K), a burn-in period of 100,000 was used in each run, and data were collected over 250,000 Markov Chain Monte Carlo (MCMC) replications for K = 1 to K = 10 using 20 iterations for each K. The optimum K value was predicted following the simulation method of Evanno et al. (2005), using the web-based STRUCTURE HARVESTER version 0.6.92 (Earl 2012). A bar plot for the optimum K was determined using Clumpak beta version (Kopelman et al. 2015).

SSR markers and allelic diversity
The 11 SSR markers generated 70 alleles, across all 80 sorghum accessions with an average of 6.36 alleles per marker. The analysis showed that the MAF ranged from 0.20 (Xtxp217) to 0.54 (Xcup53) with mean frequency of 0.31 per locus ( Table 2). The lowest MAF (0.20) and the highest number of alleles (9), effective number of alleles (7.80), Shannon's information index (2.13), gene diversity (0.87) and expected heterozygosity (0.77) were obtained for marker Xtxp217 (Table 2). Conversely, the highest MAF (0.54) and the lowest number of alleles (3), effective number of alleles (2.39), Shannon's Information Index (0.96), gene diversity (0.58) and expected heterozygosity (0.56) were obtained for marker Xcup53 (Table 2). The highest genetic differentiation (0.21) and the lowest gene flow (0.91) were obtained for marker Msbcir276, whereas the lowest genetic differentiation (0.06) and the highest gene flow (3.70) were obtained for marker Xcup02 ( Table 2).
The PIC value for the SSR loci ranged from 0.50 for marker xcup53 to 0.86 for marker Xtxp217 with a mean value of 0.75. With regard to test for the HWE, all the 11 SSR markers showed highly significant (p < 0.0001) deviation from HWE (Table 2).

Genetic diversity within and among populations
The overall genetic diversity estimates within the populations had a number of alleles ranged from 2.64 to 6.18 with a mean value of 4.47 (Table 3). Oromia population showed the highest number of alleles (6.18), effective number of alleles (4.89), Shannon's information index (1.63) and expected heterozygosity (0.77). In the contrary, the SNNP population showed the lowest number of alleles (2.64), effective number of alleles (2.32), Shannon's information index (0.84) and expected heterozygosity (0.51). Except for the population of Amhara where number of private alleles is 0.09, there was no private allele unique to a single population. The percentage of polymorphic loci per population ranged from 90.91% (SNNP) to 100% (Oromia, Amhara, Tigray and Dire Dawa) with a mean of 98.18% (Table 3).

Genetic differentiation, distance and gene flow between populations
The pair-wise genetic differentiation between the populations ranged from 0.02 to 0.16 (Table 4). The highest population differentiation was observed among Tigray and SNNP (Fst = 0.16), followed by Dire Dawa and SNNP (Fst = 0.15). The lowest population differentiation (Fst = 0.02) was observed among Oromia and Amhara (Table 6). The Pairwise Nei's genetic distance and gene flow (Nm) of each population from the other populations ranged from 0.11 to 0.74 and 0.94 to 16.68, respectively (Table 3). The highest measure of genetic distance (0.74) with the lowest gene flow (0.94) was observed between the populations of Tigray and SNNP (Table 5). The second-highest genetic distance (0.62) was observed between the populations of Amhara and SNNP. The lowest measure of genetic distance (0.11) with highest gene flow (16.68) was observed between the populations of Oromia and Amhara (Table 5). The second lowest gene flow (1.27) was observed between the populations of Dire Dawa and SNNP. Where MAF = Major allele frequency, Na = Number of alleles, Ne = Effective number of alleles, I = Shannon's Information Index, GD = Gene diversity, He = Expected heterozygosity, Fst = Inbreeding coefficient within subpopulations relative to total (genetic differentiation among subpopulations), Nm = gene flow estimated from Fst = 0.25 (1-Fst)/ Fst, PIC = Polymorphic information content, P HWE = P-value for deviation from HWE, ns = not significant, * = P < 0.0001 and hence highly significant.

Analysis of molecular variance
AMOVA showed that variability among populations and within populations accounted for 6.74% and 93.26% of the total genetic variations, respectively. The overall fixation index value, used as a measure of population differentiation was moderate (Fst = 0.07) with high gene flow (≥6.65) ( Table 6).
The population structure analysis revealed that maximum delta K becomes a sharp peak at K = 2 (Figure 4(a)), suggesting two clusters. Based on this value, Clumpak result (bar plot) detected a genetic admixture and hence there was no clear geographic origin-based structuring of populations (Figure 4(b)).

Discussion
In this study, 11 SSR markers amplified 70 alleles. The number of alleles per locus ranged from three for Xcup53 to nine for Xtxp057 and Xtxp217 with a mean of 6.36 alleles per locus (Table 2). Mofokeng et al. (2014), have reported a total of 306 alleles with a mean of 6.4 alleles per locus with a range of 2-15 alleles using 30 SSR markers in 103 sorghum genotypes. However, the study by Cuevas and Prom (2013) on population structure and diversity of 137 Ethiopian sorghum germplasm conserved at USDA-ARS National Plant Germplasm System showed 14 alleles per locus. The observed differences in number of alleles could be attributed to the size of the study population, number of accession and types of SSR markers used. The mean gene diversity of the markers observed in this study (0.78) was similar to that previously reported by Thudi and Fakrudin (2011) (0.8) among rabi sorghum genotypes. Ng'uni et al. (2011) reported the mean gene diversity value of 0.53, which is lower than the value obtained in the current study.
The high level of gene diversity of SSR markers observed in this study was probably associated with the genetic diversity in the sorghum accession that represented different geographic origins. The polymorphic information content (PIC) values are usually calculated to assess the level of polymorphism of a marker. As described by Botstein et al. (1980), a marker with PIC values below 0.25 is slightly informative, a marker with PIC between 0.5 and 0.25 is considered as reasonably informative, and those greater than 0.5 are said to be highly informative. In the present study, all the 11 SSR loci used were highly informative since they had PIC values greater than 0.5. The PIC of loci ranged from 0.5 to 0.86 with a mean of 0.75. This result is fairly similar to the mean PIC value 0.78 reported in sorghum genetic diversity study by Cuevas and Prom (2013). However, the observed PIC value is higher than that of Muui et al. (2016), who reported 0.49. High PIC values indicate the high discrimination ability of the selected SSR markers for the studied sorghum genotypes. As indicated by diversity indices across populations by Shannon's information index (1.28) and expected heterozygosity (0.67) (Table 3), relatively all of the populations under the study showed high genetic diversity. Among the studied populations, those of Oromia and Amhara displayed gene diversity greater than the grand mean value of 0.67 suggesting that these populations should be targeted to generate baseline information for breeding and conservation. The study also indicated that only population of Amhara have recorded private alleles, where the number of private alleles was 0.09 (Table 3), suggesting a certain level of independent evolution of their gene pools that allowed maintenance of private alleles at a population level.
AMOVA (Table 6) revealed very low genetic differentiation among populations (6.74%) and high proportion (93.26%) of within population of the total genetic variations. Similar results were observed in previous studies (Ng'uni et al. 2011;Muui et al. 2016;Tirfessa et al. 2020). They have reported large genetic variation (82%, 91.61% and 99.62%) within populations of sorghum genotypes from Zambia, Kenya and Ethiopia respectively.
Fixation index (Fst) ranges from 0 (indicating no differentiation between the overall population and its subpopulations) to a theoretical maximum of 1; and it can be considered as small (0-0.05), moderate (0.05-0.15), large (0.15-0.25) or very high (>0.25) (Wright 1965). The present study (Table 6) revealed the presence of moderate genetic differentiation among populations of sorghum (Fst = 0.07). This moderate genetic differentiation may be because of gene flow through a continuous exchange of genes by sharing common markets among the adjacent areas where different populations were collected. The high variability within individual populations observed in this study could be due to germplasm exchange in breeding programmes. The overall observed gene flow (Nm) or gene migration value observed in this study was 6.65, which showed the approximate number of gene flow from one population to the other. According to Slatkin (1985) and Waples (1987), Nm values grouped into three categories: 0.000-0.249 low, 0.25-0.99 intermediate and Nm > 1.00 high. However, seed exchange among neighbouring sorghum growing regions and high gene flow (Nm = 6.65) could also contribute to the observed low variation among populations. This indicates geographical locations of the populations have no significant effect on genetic variation of sorghum accessions under study.
The unweighted neighbour-joining-based cluster ( Figure 2) and PCoA analysis (Figure 3) did not reveal clear grouping of accessions according to their geographical origins possibly confirming the existence of high gene flow among populations. The clustering pattern is weak enough to support the concept of separation by distance. This is probably due to gene flow Figure 3. Principal coordinate analysis of 80 sorghum accessions using 11 SSR markers. Samples coded with the same symbol and colors belong to the same population. The percentages of variation explained 29.69% of the total variations and the first 3 axes (1, 2 and 3) are 12.13%, 9.07% and 8.49%, respectively. during seed exchange by the farmers and the geographical regions in Ethiopia are not completely different in terms of climate and other soil variables; rather they share similar agro-ecologies. Similarly, a study conducted by Tirfessa et al. (2020) has shown that both hierarchical clustering and PCoA analysis did not reveal clear grouping of Ethiopian sorghum accessions based on either altitude or geographical origin. In addition, Motlhaodi et al. (2014) reported that there was no observed geographical origin or racial clustering pattern in sorghum genotypes. In contrast, Missihoun et al. (2015) reported that SSR analysis of 61 sorghum samples from Benin structured according to their botanical race and morph-physiological characteristics. In the present study (Figure 4), population genetic structure analysis also revealed the presence of weak sub-groups (K = 2) of the five populations of sorghum with a high potential of admixture. All populations possessed genetic background (alleles)  Sorghum is one of the most important multi-purpose crops grown globally. It is one of the major staple crops grown in the poorest and most food insecure regions of Ethiopia. Knowledge of the genetic diversity of sorghum is the foundation for the improvement and sustainable development of new varieties. In the present study, 80 sorghum accession collected from five sorghum growing regions of Ethiopia categorised into five populations were analyzed for allelic diversity by 11 SSR markers. All the markers were highly polymorphic and informative to describe the genetic diversity and population structure of the genotype.
The study also revealed existence of high genetic diversity within sorghum populations of Ethiopia, where 93.26% of the total genetic diversity resides within populations. Among the addressed populations, high gene diversities ranging from 0.51 to 0.77 was observed. Sorghum populations of Oromia and Amhara displayed gene diversity greater than the grand mean value of 0.67 suggesting that these locations should be targeted to generate baseline information for breeding and conservation of the crop. The unweighted neighbour-joining based cluster analysis showed the existence of weak genetic differentiation and all the sorghum population shared genetic background originated from two subpopulations confirming the presence of high genetic exchange or gene flow among populations. This information is significantly crucial for the development of pure sorghum breeding lines. The result of this study suggested that future germplasm collection and utilisation strategies of sorghum in Ethiopia should take into consideration the magnitude and pattern of genetic diversity established at genotypic level.
The present study has contributed substantial knowledge about the genetic architecture of some Ethiopian sorghum germplasm collections to the conservation, improvement and breeding programme of the crop. To come up with better resolution of the genetic diversity of sorghum in Ethiopia, we recommend an investigation which will employ more sample size, wider geographic area coverage and more molecular makers.