Genetic diversity and population structure analysis – a prerequisite for constructing a mini core collection of Balkan Capsicum annuum germplasm

Abstract Pepper (Capsicum annuum L.) is one of the most important vegetable crops worldwide. The Balkans, including Bulgaria, are considered as the secondary centre of pepper diversity especially for C. annuum, where local forms with diverse phenotypes and qualities have formed due to the specific agro-climatic conditions and breeding traditions. Evaluation of the genetic diversity and structure of a pepper collection is an important tool for further development of new varieties and the maintenance of sustainable agriculture. In this study, a set of 179 C. annuum accessions collected from different locations in the Balkan Peninsula was genotyped with 21 simple sequence repeat (SSR; microsatellite) markers. In total, 146 alleles were amplified among which the majority were with low frequencies (<5%). The mean He, Ho and PIC for the 21 SSR loci in the whole set were 0.531, 0.249 and 0.483, respectively. Model-based structure analysis divided the collection into 3 main clusters (K = 3) that grouped accessions with distinct fruit traits like shape, size, pungency. Further genetic structure analysis at increasing Ks suggested the presence of sub-clustering within the three main clusters. A Balkan C. annuum mini core collection was constructed based on the allelic diversity and the inferred genetic structure. As far as the mini core collection captured substantial part of the allele richness, genetic and phenotypic diversity of the analysed 176 non-redundant accessions, while maintaining good representativeness, we believe it will be of high interest to pepper breeders and germplasm conservation specialists. Supplemental data for this article is available online at https://doi.org/10.1080/13102818.2021.1946428 .


Introduction
Pepper (Capsicum spp.) is one of the most broadly cultivated and consumed vegetables worldwide with harvested area and production in the last five years reaching 3.7 million ha and over 40 million tonnes, respectively (www.fao.org). Its production continues to progressively grow because of the high nutritional value of pepper fruits. Pepper fruits have various applications in the human diet like fresh and processed vegetables, flavourings in food products, spice, but also in pharmaceutics, cosmetics and even as an ornamental plant [1,2]. The high content of bioactive compounds, such as capsaicinoids, carotenoids (capsanthin, capsorubin, β-carotene, β-cryptoxanthin, lutein, zeaxanthin, etc.), vitamins (C, E and provitamin A), dietary fibres and some essential mineral oils have made this vegetable crop an excellent source for protection against various chronic degenerative diseases and human health protection [3,4].
Capsicum genus is comprised of as many as 36 species [5] of which five (C. annuum L., C. frutescens L., C. chinense Jacq., C. baccatum L. and C. pubescens Ruiz et Pav) are cultivated ones [6,7]. Among the domesticated Capsicum species, pungent (chilli or hot pepper) and non-pungent (sweet pepper) forms of Capsicum annuum L. are most popular and have a worldwide commercial distribution [8]. Because Capsicum is of economic and nutritional importance, breeders have improved some agronomic traits, such as pungency, fruit shape, abiotic and biotic stress resistance. However, this leads to reduced genetic diversity of breeding lines, so some useful genes in the landraces (local forms) are lost due to the breeding activities [3]. Conservation and sustainable utilization of genetic resources are keys to the continuous improvement of peppers in order to respond to climate change and increasing global food demand in the successive decades. Conservation of genetic diversity is an essential prerequisite to enhance plant breeding programmes and develop new varieties with desirable agronomic traits and to broaden the genetic basis of this economically important crop. There are a number of pepper germplasm collections around the world (the uSA, South America, Asia and Europe). However, many of them are difficult to maintain due to their large size and a lack of adequate information about population structure and genetic diversity at the interspecific and intraspecific level. Selecting a representative core collection is a proven and effective tool for overcoming the expenses and difficulties of managing the huge genetic resources in the gene banks [9]. The core collection is a subset of the germplasm collection that represents the genetic diversity of the entire collection, has no redundant accessions and is small enough to be easily managed [10,11].
Different types of descriptors like passport data, geographic origin, morphological, agronomical, biochemical and DNA markers can be used for phenotypic and genetic diversity evaluation and construction of core collections [12][13][14]. There are already some core collections of Capsicum using phenotypic data [15], genotypic data [16,17] as well as combined phenotypic and genotypic data [18][19][20]. Core collections for disease resistance against northern root-knot nematode and Potato virus y (PVy) have been constructed [21,22]. Hanson et al. [23] developed Capsicum core collection to analyse the antioxidant (Ao) content and antioxidant activity (AoA) in accessions from AVRDC-the World Vegetable Center.
The cultivated Capsicum species capture a broad diversity generated by evolution and natural selection, as well as domestication in different primary and secondary centres of diversity, and artificial selection in distinct agricultural environments [24,25]. The Balkans, including Bulgaria, are considered as the secondary centre of pepper diversity especially for C. annuum, where local forms (landraces) with diverse phenotypes and qualities have formed due to the specific agro-climatic conditions and breeding traditions [26,27]. Locally maintained and well adapted pepper landraces can still be found in small farms and villagers' yards of Bulgaria and other Balkan countries. They have been maintained for centuries by passing down from generation to generation and preserving most important features including orientation, shape, size, colour and taste of the fruit, productivity and content of valuable bioactive components. Some of them in addition show good tolerance to biotic and abiotic factors and have been used in various breeding programmes to develop nutritionally improved and high yielding cultivars [28,29]. The high biodiversity of pepper represents a unique resource that could be used in future breeding programmes to increase the diversity and identify forms with increased resistance to a number of abiotic and biotic stress factors. Given the new priorities, it is relevant to study the biodiversity of Balkan pepper in the above-described aspects. Its in-depth characterization is of fundamental importance in order to avoid genetic erosion and to ensure maximum use of genetic variability in breeding programmes.
The Capsicum collection was created, developed and maintained as an initial step in the pepper breeding work at the Maritsa Vegetable Crops Research Institute (MVCRI), Plovdiv, Bulgaria many decades ago [28]. It includes over 1500 pepper accessions which are collected in various ways: personal contacts, different expeditions, national and international exchange, etc. In recent years, the participation in different national and international projects enabled its increase with additional Balkan varieties, local forms (landraces) and breeding lines most of which have been collected from a large number of locations in six countries on the Balkan Peninsula (Bulgaria, Serbia, North Macedonia, Albania, Romania and Greece) under the SEE-ERA.NET PLuS project ERA 226. An important task of the research is the study and conservation of Balkan Capsicum resources. The large number of accessions of this collection greatly impedes the research process and the choice of genetic material to be included in the breeding programmes. This requires the creation of the core collection that includes a limited number of genetically diverse accessions to represent genetic diversity within the entire collection. This will ensure a more effective study of available pepper germplasm and the assessment of other important traits like productivity, the content of biologically active substances, minerals and trace elements, resistance to diseases and pests and a number of other biological characteristics that are relevant to the breeding process.
up to now, part of the Capsicum annuum accessions with diverse phenotype have been characterized by conventional phenotyping, according to various Capsicum descriptors [30,31]; high-throughput fruit phenotyping using tomato analyser in combination with conventional analysis [32], and by compilation of conventional and high-throughput phenotypic, biochemical and virus resistance analyses [33]; testing for fungal [34] and some important pest infestation [35,36].
However, a large part of this subset of accessions has never been analysed at the DNA level. To date only a small number of Bulgarian pungent small fruited red pepper landraces have been analysed by simple sequence repeat (SSR; microsatellite) markers [37] and some commercial pepper varieties were recently analysed with inter simple sequence repeat (ISSR) markers [29]. Therefore, more precise analysis using codominant microsatellite (SSR) markers is necessary to determine the genetic diversity and infer the population structure of Balkan peppers. This is a very important prerequisite for the next step aimed at development of a core collection with application in future breeding programmes and association mapping of important nutritional and environmental adaptation traits. owing to the above considerations and the importance of more detailed pepper germplasm characterization, the aim of the present study was to evaluate the genetic diversity and to determine the genetic structure of 179 Balkan C. annuum accessions using 21 SSR markers and construct a mini core collection on the basis of these data.

Plant material
The plant material consisted of 179 diverse Balkan pepper accessions, including local forms/landraces, varieties and 4 breeding lines (Capsicum annuum L.). The passport data for all 179 Balkan pepper accessions were described in previous studies [32,33]. They were maintained and phenotypically evaluated in Maritsa Vegetable Crops Research Institute (MVCRI), Plovdiv located in South-Central Bulgaria (42°10′35.3″N 24°45′50.5″E). Accessions were collected from Bulgaria, Serbia, North Macedonia, Albania, Romania and Greece. However, over 63% of them originated from Bulgaria. Some accessions collected from different regions of Bulgaria, Serbia and North Macedonia are known locally by the same name but appear to differ in their morphological characteristics and have been labelled differently (Supplemental Table S1). In addition, accessions showing segregation for some of the analysed phenotype traits were separated as distinct biotypes that were labelled with capital letters (A and B) in order to be genetically analysed and distinguished. Each accession was represented by 30 plants in three replications (10 plants/replication) in field trials at MVCRI in Plovdiv. The tested pepper accessions were divided into seven varietal groups according to their fruit shape, with conical, elongate, pumpkin shape, bell or blocky, conical to blocky, elongate to bell or blocky and round fruit types [33].

DNA extraction
All DNA samples were extracted from freeze-dried leaf tissue (10-15 mg) of field-grown plants (10 individual plants/genotype) using a DNeasy PowerPlant ProHTP 96 Kit (qiagen). The concentration and the purity of DNA samples were determined with a NanoDrop spectrophotometer (Thermo Fisher Scientific, Wilmington, DE, uSA).

Primer synthesis
Published primers for pepper SSRs [38][39][40], http://solgenomics.net (Supplemental Table S2) referred to as locus-specific primers (LSPs) were extended with generic non-complementary nucleotide sequences tagF 5 ' -A C G A C G T T G T A A A A -3 ′ a n d t a g R 5'-CATTAAGTTCCCATTA-3′, respectively, at their 5′ ends as described in Hayden et al. [41]. Primer aliquots (50 pmol/ µL) were prepared by mixing equimolar amounts of forward and reverse primers in miliq H 2 o and were referred to as stock primer sets for each locus. In addition, two generic primers complementary to the LSP extension sequences, tagF' 5′-ACGACGTTGTAAAA-3′ and tagR' 5′-CATTAAGTTCCCATTA-3′, were also synthesized. The tagF' primer was labelled at its 5′ end with one of the following fluorescent dyes: FAM, ATTo565 (PET) or ATTo550 (NED) to allow direct detection of alleles on the automated capillary sequencer (ABI3130, Thermo Fisher Scientific, Wilmington, DE, uSA). All primers were synthesized by Microsynth (Microsynth AG, Balgach, Switzerland).
PCR was performed on a Veriti 96 Thermal Cycler (Thermo Fisher Scientific, Wilmington, DE, uSA) using the following PCR conditions, depending on the melting temperature of the locus-specific primers. These were performed according to [41][42][43] with some minor modifications: 1. The PCR programme 50°С included a denaturing step at 95°С for 3min, followed by 7 cycles of amplification, each including a denaturing step at 92°С for 30 s, an annealing step at 50°С for 1.30 min, and a synthesis step at 72°С for 1 min;

SSR analysis
Electrophoresis and visualization of SSR alleles was performed on an ABI3130 DNA analyser. A standardized multi-pooling procedure was used to prepare SSR products for electrophoresis. After PCR, a 3-fold initial dilution of the PCR products and subsequent mixing (1:1:0.5), respectively, FAM: PET: NED, was performed up to a final dilution of 1/25× to 1/75×. Seven mixed pools were developed according to the length of the PCR products (CAMS142_FAM, CAMS398_VIC and EPMS331_NED; HPMS2-24_FAM, GPMS29_PET and HPMS1-5_NED; HPMS1-6_FAM, CAMS405_PET and CAMS811_NED; CAMS153_FAM, EPMS376_PET and EPMS397_NED; HPMS2-13_FAM, EPMS335_PET and CAMS606_NED; HPMS1-143_FAM, CAMS234_PET and CAMS199_NED; CAMS864_FAM, EPMS418_PET and HPMS1-1_NED). In cases where the intensity of the amplified products was strong, a change in the mixing ratio of the labelled products was performed. Subsequently, the diluted products were mixed with labelled internal standards GeneScan™ 500 LIZ™ dye Size Standard (Thermo Fisher Scientific, Wilmington, DE, uSA) and formamide, denatured and electrophoresed on an ABI3130 DNA analyser. SSR allele sizing was performed with the Gene Mapper v4 software (Thermo Fisher Scientific, Wilmington, DE, uSA).

Data analysis
Allele number, frequency of alleles, observed heterozygosity (Ho), expected heterozygosity (He) and Polymorphic Information Content (PIC) were calculated with Power Marker v3.25 software [44]. The chord distance matrix [45] was used to construct a phylogenetic tree with the unweighted pair group method with arithmetic mean (uPGMA) module of Power Marker v3.25. The resulting tree was visualized and annotated by using the Evolview v3 webserver [46]. The clades in the phylogenetic tree were visually identified as monophyletic groups that could be separated from the main tree with a single cut according to the clade definition provided in [47].

Genetic structure analysis
A model-based population structure analysis was performed in STRuCTuRE 2.3 [48] using an admixture model with correlated allele frequencies. The tested number of possible K was set from 1 to 15 with 15 runs for each K. Each run had a burn-in period of 100,000 and 1,000,000 MCMC iterations. The most probable value of K was determined using the ΔK [49] implemented in Structure Harvester v. 6.93 [50]. The genotypes were assigned to a respective group if showing membership coefficient ≥70%. The graphical representation of the cluster analysis was done using the pophelper package in R version 4.0.4 [51]. The degree of differentiation between the resulting genetic clusters was evaluated using pairwise Jost's D [52] and analysis of molecular variance (AMoVA) in R version 4.0.4 [51].
In order to analyse the presence of patterns manifested by quantitative phenotypic traits in the groups established by the clustering, we performed a principal component analysis (PCA). The results from the PCA were visualized using colour codes for the structure clusters and geometric form codes for the fruit shape. For the analysis we used data for plant height, stem height, fruit length, fruit weight, fruit width and fruit wall thickness. The phenotype data used for the PCA are published in [33]. The analysis was done and visualised in R version 4.0.4 [51].
The SSR marker data were used for construction of a mini core collection with the Core Hunter 3 package in R [53]. For this purpose, we tested two optimization strategies. The first one aimed to minimize the distance between each accession and the nearest entry from the core collection (A-NE). The second was a simultaneous optimisation of A-NE and maximization of allele richness, expressed as the percentage retained in the core collection alleles from the whole collection. Both approaches were done with the Cavali-Sforza distance matrix [45]. The proportion of retained alleles was used as a direct measure for the preserved genetic diversity by the core collections. We also compared this metric of the collections to the mean of 1000 randomly selected samples of respective sizes.

Genetic diversity of Balkan Capsicum annuum accessions using SSR markers
To explore the genetic diversity and population structure, we investigated the patterns of molecular diversity with 21 SSR markers in the Balkan C. annuum germplasm collection consisting of 179 accessions, phenotypically evaluated using various Capsicum descriptors.
The markers were selected to cover 8 of the 12 chromosomes of pepper with a minimum of 1 marker per chromosome. In this study, standardized PCR conditions at 4 different temperatures of annealing (50, 61, 62 and 63 °C) were optimized (Supplemental Table  S2), which enabled amplification of markers in multiplexed PCR reactions. Standardized PCR conditions were achieved by performing an initial optimization step with five different LSP primer concentrations (30,50,60,80 and 100 nmol/L). The adjusted LSP primer concentration enabled the PCR specificity and yield to be controlled and helped to prevent non-specific annealing of LSP during the first few PCR cycles [41]. As previously observed [42,43], this optimization step was a prerequisite for correct and efficient amplification of microsatellite alleles in all tested genotypes and eliminated the need to use complex touchdown PCR or to adjust the annealing temperatures for each locus. The standardized procedure is cost-effective by enabling multiplexing of several loci in one PCR reaction.
Just a few accessions displayed undistinguishable multilocus genotypes at all analysed 21 SSR loci. These genetically redundant accessions (CAPS-021, CAPS-133 and CAPS-143) were removed from the subsequent analyses, leading to a final panel of 176 genotypes.
The diversity pattern of the 21 SSR loci across the whole set of accessions revealed a total of 146 distinct alleles ranging from 2 in locus HPMS1-6 to 18 in locus CAMS864, with a mean of 6.95 alleles per locus (Table  1). Several genotypes showed null alleles in 7 out of the 21 loci tested. out of the 146 detected SSR alleles, 63 were 'common alleles' (frequency of >5% in the analysed set), 43 were 'less common alleles' (frequencies between 1% and 5%), 40 were very rare alleles (frequency of <1% in the analysed set). The fact that about a half of the alleles were with frequencies of less than 5% is evidence for the presence of a broad level of genetic diversity in the studied set of 176 C. annuum genotypes. The mean PIC for all 21 SSR markers was 0.483, with values ranging from 0.022 for marker HPMS2-13 on chromosome P1 to 0.873 for marker CAMS864 on chromosome 7.
The genetic diversity index (He or GD) ranged from 0.022 (HPMS2-13) to 0.879 (CAMS864) with a mean of 0.531. The average observed heterozygosity (Ho) across all 21 loci was 0.249, with the highest values of 0.994 in loci HPMS1-1 and CAMS234, and 0.972 in locus CAMS153. Absence of heterozygous accessions was observed only in the locus HPMS2-13, which showed the lowest level of He.

Genetic structure of the Balkan Capsicum annuum collection
The SSR genotyping results were used to perform population structure analysis for 176 accessions under an admixture model using the STRuCTuRE software [48]. The optimal number of clusters by Evano method [49] was defined as K = 3 (Figure 1(A)). Two additional peaks were detected in the Evano graph, one at K = 4 and a lower one at K = 11. A genotype was considered a member of a particular cluster if the probability for the membership was at least 70%. For K = 3 ( Figure  1(B) Figure 1(C)). This cluster was very diverse in terms of fruit tastes, as it included accessions with non-pungent fruits and ones showing different levels of pungency. The majority of genotypes (almost 63%), however, had pungent fruit taste. The accessions that were not assigned to any of the three clusters formed a group of 24 Bulgarian, 3 Macedonian, 2 Romanian, 6 Serbian accessions and two accessions of unknown origin (CAPS-151 and CAPS-151B). The group consisted of genotypes with various fruit shapes and sizes. Fifty-nine percent of the genotypes in this admixed group were with conical fruits, 27% with elongate, about 11% with pumpkin and one genotype (CAPS-136) with bell or blocky shaped fruits. Most of them were non-pungent but 12 (34.3%) showed different levels of pungency. The presence of secondary and tertiary smaller peaks at increasing Ks pointed at underlying sub-clustering within the 3 main clusters. In order to further explore this hypothesis, we also examined the population structure at K = 4 and K = 11. Clear separation of Cluster 3 into two subclusters, Cluster 3.1 and Cluster 3.2, was evident at K = 4 (Figures 1(B) and 2). Cluster 3.1 consisted of 11 accessions having relatively smaller fruits in comparison to Cluster 3.2, which comprised 41 genotypes with predominantly elongate and conical fruits. Additional sub-clustering in the other main clusters was also observed at K = 11 ( Figure 2). The two main clusters, Cluster 1 and Cluster 2 that remained intact at K = 4, were also each split into three distinct subclusters at K = 11. The two sub-clusters  . phylogenetic tree of the Balkan C. annuum germplasm collection. the tree was produced using the upgma method based on chord distance [45] computed from the allele frequencies at 21 SSR loci. the tree is drawn and annotated using the evolview v3 [46] with the following additional data starting from right to left: barplots of the group membership from the model based genetic structure analysis at K = 3, K = 4 and K = 11; fruit shapes of the accessions annotated with text and different colours; cl-1 to cl-17B represent the main clades in the tree. Different fruit shapes are additionally indicated by different shapes and colours of the tree leaves.
at K = 4 derived from the most diverse main Cluster 3, were further subdivided at K = 11. Cluster 3.1 was separated into distinct subclusters, Cluster 3.1.1 and a group of admixed accessions with more than 50% membership to Cluster 3.1.2. The larger sub-cluster (3.2) was also split into two distinct sub-clusters (3.2.1 and 3.2.2) and a group of admixed genotypes sharing between 20 and 30% membership with both Cluster 3.2.2 and Cluster 3.2.3 ( Figure 2). Notably, Cluster 3.2.3 did not form distinct separation at K = 11 but only participated in admixed accessions.

Genetic structure and quantitative phenotypes
To analyse the distribution of the genotypes in the clusters inferred by the model-based approach, a biplot analysis of several quantitative traits was performed. The greatest contribution to the first composite axis had the fruit width and fruit wall thickness, while stem height, plant height and fruit length contribute to the second axis. The biplot graph (Supplemental Figure S1) clearly illustrated the characteristics of the genotypes grouped together, representing three consecutive zones each dominated by members of one of the inferred genetic clusters. The majority of the accessions from Cluster 1 were situated to the right end mainly in quadrant 4, but some of them fell into quadrant 1. They had wide fruits with average or below the average length, thick fruit wall and a low fruit length-to-width ratio, whose values in most of the genotypes were close to 1 or lower. The accessions from Cluster 2 are situated to the left of the genotypes from Cluster 1, in both quadrants 1 and 4. The group has a greater length to width ratio and a lower average fruit wall thickness compared to the genotypes from Cluster 1, but still higher than the whole collection mean. The genotypes from Cluster 3, a highly diverse group with small round or elongate fruits, had three features in common: the small fruit width and weight and thin fruit wall. The genotypes with elongate fruits also had low fruit length-to-width ratio. The members of the two subclusters at K = 4 (Cluster 3.1 and Cluster 3.2, Figure 2) were interspersed in quadrants 2 and 3 on the biplot figure, and there is no obvious difference between them. However, the accessions from Cluster 3.1 had more than twice as small fruit length-to-width ratio, and much lower weight. The admixed accessions were scattered in all of the four quadrants.

Genetic structure and genetic diversity
The analysis of genetic diversity in the clusters derived using the model-based approach at K = 3 revealed that Cluster 3 was the most diverse group ( Table 2). It had the highest number of alleles, allele richness and gene diversity followed by the group of admixed accessions, Cluster 1 and Cluster 2. All clusters had numerous private alleles: 21 in Cluster 3, 11 in both, Cluster 1 and the admixed group, and 6 in Cluster 2. Most of these private alleles were with very low frequencies and just eight of them exceeded 5% frequency in the respective cluster: four in Cluster 2 and four in Cluster 3 (Supplemental Table S3). To analyse the level of genetic differentiation between the clusters, we evaluated Jost's D index and AMoVA. These analyses were done using only the accessions with more than 70% membership to one of the three main clusters, excluding the admixed genotypes. The observed values of D index ranged between 0.15, for Clusters 1 and 2, and 0.186, between Clusters 1 and 3 (Supplemental Table S4). The results from the AMoVA showed that 15.4% of the genetic variance was partitioned among the populations, 41.1% among the samples within populations and 43.5% within the samples (Supplemental Table S5). When considering K = 4, the differentiation between the clusters was more pronounced. We estimated higher values for D, ranging between 0.15 for the cluster consisting predominantly of bell or blocky fruits and the one with conical ones (Clusters 1 and 2), and 0.29 between the new subcluster 3.1 and the one with predominantly bell and blocky fruits. According to the AMoVA results, the differentiation between the clusters also increased at K = 4, as the percentage of the between-cluster variation was estimated to be 17.6% (Supplemental Table  S5). It is worth noting that the smallest group, Cluster 3.1 had 72 alleles.

Phylogenetic analysis
A phylogenetic tree based on chord genetic pairwise distances [45] was constructed using the uPGMA procedure. The tree grouped the accessions into 17 Clades including two subclades, Clade 17a and Clade 17b. It shows generally good congruence with the three main clusters inferred by the model-based approach at K = 3 and the substructures at K = 4 and K = 11 (Figure 2), providing finer resolution of the population structure. This is best visible in Clade 17, where most of the conical fruit accessions belonging to the main Cluster 2 of the model based clustering were grouped into two large subclades. These were further separated into smaller subclades, grouping together accessions belonging to substructure clusters at K = 11. Some of the admixed accessions that cannot be associated with a single substructure cluster were also grouped in distinct subclades in the phylogenetic tree. It is worth noting that four accessions (CAPS-009, CAPS-049, CAPS-114 and CAPS-018) were separated into single accession clades (cl.-3, cl.-7, cl.-8 and cl.-15, Figure 2)

Construction of a mini core collection
In order to construct a 'multi-purpose' or CC-I type of core collection, according to the classification described by odong et al. (2013) [11], we constructed two series of core collections. Each of the series consisted of six core collections with an increasing number of individuals from 5% to 30% of the whole collection with increments of five percent. The series differed by the optimization method used for the selection of the entries in the core collections. one of them was constructed using minimization of accession-tonearest-entry distance (A-NE), and the other was optimized by combining the A-NE with the maximization of allele coverage (CV). The metrics from both approaches were compared to the means of 1000 randomly selected groups of the same sizes. The selection resulted in six mini core collections consisting of 9, 18, 26, 35, 44 and 53 individuals. The genetic diversity was preserved at significantly higher levels in the collections constructed combining A-NE and allele richness as construction criteria. The approach using just A-NE minimisation as an optimization method resulted in selection of collections which retained much lower proportions from the alleles discovered in the whole collection; their numbers were commensurable with the means of the randomly selected collections of the same sizes (Figure 3(A)). on the other hand, A-NE as a sole optimization criterion provided a better representativeness of the accessions from the whole panel, as manifested by the lower average A-NE distances (Figure 3(B)).
The analysis showed that the method based on the combination of the two criteria was superior for preserving allele richness while keeping the representativeness at an acceptable level. Therefore, we choose the A-NE/CV method for final sampling of the core collections. By using the combined method, samples of 9, 18, 26, 35, 44 and 53 individuals captured 62, 75, 86, 91, 95 and 99% of the alleles from the whole collection (Table 3). The final mini core collection of 44 accessions captured nearly 95% of the SSR alleles, including all common alleles, all less common alleles and 32 out of 40 very rare alleles (Supplemental Figure  S2(A)). out of the 44 entries, 9 were from Cluster 1, 8 from Cluster 2, 18 from Cluster 3 and 9 from the admixed accessions (Table 3, Supplemental Table S6) and they were relatively evenly distributed among different clades of the phylogenetic tree (Supplemental Figure S2(B)).  For each core collection, sample size, mean number of alleles per SSR locus (allele richness), gene diversity (he), observed heterozygosity (ho), polymorphic information content (pic), the % of retained SSR alleles (alleles %) and the distribution of selected entries among the clusters inferred from the model-based analysis are indicated.

Discussion
Genetic resources are an important tool for overcoming the current challenges posed by climate change and the need to provide food security for the growing human population. However, the large size of these collections is a serious drawback for their proper use in conservation and management practices. Establishment of a core collection is a primary objective of many gene banks around the world because of the reduced cost and the efforts for its conservation. one of the most important issues is to develop a subset of accessions in which the gene diversity of the whole collection is preserved.
The Balkan Capsicum collection analysed in this study is a part of the large collection developed through many years of exchange of genetic material among the gene banks, breeding efforts and expeditions in Bulgaria and neighbouring countries on the Balkan Peninsula. In this study, 179 accessions of C. annuum L. most of which representing local forms adapted to the specific agro-climatic conditions and selected through many years' traditions of cultivation in small private farms on the Balkan Peninsula, cultivars and breeding lines were subjected to molecular and phenotypic evaluation as a step towards the development of a mini core collection. The number of accessions in our study is much smaller than those reported in other studies. However, it includes only C. annuum genotypes. For example, Gu et al. [17], Nicolaï et al. [18], Lee et al. [19] and Carvalho et al. [20] have used a higher number of accessions collected from wider geographic areas and included different Capsicum species like C. annuum L., C. frutescens L., C. chinense Jacq., C. eximium, C. praetermissum, C. baccatum L., C. pubescens, C. cardenasii, C. galapagoense and C. tovarii. Although the genus Capsicum has a broad genetic base among species [18], this of C. annuum is much narrower. Searching for C. annuum accessions with high levels of genetic and phenotypic diversity is a prerequisite for development of core collections with broad applicability. The studied collection consists of a large number of C. annuum landraces which constitute about 2/3 of the accessions and represent a valuable reservoir for improvement of the present pepper germplasm.

Genetic diversity in the studied C. annuum collection
To assess the genetic diversity and structure of the selected number of accessions, we used 21 SSR markers distributed on 8 pepper chromosomes. Most markers showed high PIC values with the highest one for CAMS864 and CAMS606 on chromosome 7 and CAMS234 on chromosome 6 which warrants their further use in pepper diversity studies in Bulgaria. The overall genetic diversity (He) of the studied C. annuum accessions is 0.531 with a mean number of alleles (MNA) of 6.952. These values are higher than the ones reported for 3 338 C. annuum accessions by Lee et al. [19] using SNP markers (He = 0.44), 222 accessions of C. annuum genotyped at 32950 SNP loci by Taranto et al. [54] (He = 0.048) and for 1 904 Capsicum spp. accessions by Gu et al. [17] using 29 SSR markers (He = 0.486), but lower than those identified by Nicolaï et al. [18] for 908 C. annuum accessions with 28 SSR markers (He = 0.59, MNA = 12.57) and by Zhang et al. [55] for 372 accessions of which 369 of C. annuum using 28 SSR markers (He = 0.63, MNA = 13.79).
The mean Ho value (0.249) in our study is higher than the one observed by Nicolaï et al. [18], Lee et al. [19], Taranto et al. [54] and Gu et al. [17] (<0.085, 0.12, 0.023 and 0.119, respectively). The higher level of Ho observed in Bulgarian C. annuum collection is due to the higher natural outcrossing rate of pepper, especially in landraces which constitute about 2/3 of the studied C. annuum collection. The lower Ho values observed by other authors could be explained by maintaining and multiplication of accessions through selfing [18] or by long time inbreeding process including artificial selection, non-random mating between individuals, population structure and size as well as Wahlund effect (mixing of individuals from different genetic sources) [56,57].

Genetic structure of the studied C. annuum collection
The model-based analysis of the genetic structure of the collection established that the most probable number of clusters given the marker data is three. The genotypes were combined in the resulting groups irrespective of the country of origin. Although the clusters were dominated by accessions of one or two fruit shape types, there were no clusters consisting entirely of a particular fruit type. Further analysis of the phenotypic diversity within and across clusters, using several quantitative traits, revealed that there are phenotypic features common for the genotypes that grouped together. The quantitative traits of the fruits were the most pronounced, as the patterns of the diversity were mainly imposed by the thickness of the pericarp and the width and weight of the fruits, which was very well demonstrated by the biplot chart (Supplemental Figure S1), where the genotypes from the different clusters prevailed in certain areas depending on the values of these traits. The number of detected clusters in our study is in good agreement with those reported by oh et al. [37] in a collection of 61 Bulgarian mostly pungent small fruited red pepper landraces and other authors studying the genetic diversity in collections of Capsicum annuum of larger sizes [55] and larger places of origin [18]. It should be noted that unlike us, oh et al. [37] found no relationship between the STRuCTuRE derived clusters and the morphological traits of the fruits. The lack of such a relationship is probably due to similarity in the fruit shape and size of most studied accessions. In regard to the relationship between the distribution of genotypes by clusters and the fruit morphology Zhang et al. [55] and Nicolaï et al. [18] observed very similar patterns of clustering. Several authors reported genetic structures in C. annuum collections, consisting of different numbers of underlying clusters ranging from 2 [25,58] to 6 or 8 [59]. However, González-Pérez et al. [25] found that the division into two clusters was mainly geographical and when the analysis was confined to the local Spanish accessions they also established the presence of three clusters in which the genotypes were distributed according to the fruit and plant traits. We also observed additional peaks of ΔK at K = 4 and 11. In our model-based analysis, when the K was set to K = 4, Cluster 3 was divided into two new sub-clusters of accessions with distinct fruit sizes. The division of these sub-clusters led to an increase in the values of pairwise metric (Jost' D) for population differentiation, defining sub-structure Cluster 3.1 as the most divergent group. The AMoVA also showed increased values of the between-cluster variation from 15.36% at K = 3 to 17.64% at K = 4. Although much lower than the between-cluster variation reported by Rivera et al. [59], it is close to the value between wild and domesticated populations in northwestern Mexico reported by oyama et al. [60]. These results show that the clusters could potentially be subdivided into more genetic sub-clusters as suggested by the presence of the additional peak of ΔK at K = 11.

Construction of a Balkan mini core collection
The Balkan Peninsula is an early centre of adaptation of the species from the so-called Mesoamerican food complex in Europe [61]. According to Andrews [61], these species found their way from the markets of the ottoman Turkish Empire through the Balkans to Central Europe. His hypothesis was supported by the results of Nicolaï et al. [18], defining the accessions from these parts of Europe as a 'distinct genetic pool' . To facilitate the conservation of the C. annuum genetic diversity represented in the current collection from the Balkans, we aimed at constructing a multipurpose mini core collection. of the two tested approaches, the one combining minimization of accession-to-nearest-entry distance and maximization of allele richness performed significantly better in encompassing the genetic diversity present in the whole collection in terms of percentage of retained alleles and the number of rare alleles captured in the mini core collection. Several C. annuum core collections have been constructed for different purposes. Nicolaï et al. [18] established a core collection of 332 entries that captured 97% of the genetic and phenotypic diversity for further association studies. Recently, Gu et al. [17] constructed a core collection based on SSR data consisting of 248 entries which covered 75.6% of the SSR alleles for further genotyping and gene-mining research. The final mini core collection of 44 entries proposed in the present study was sampled solely on the basis of the genetic data including allele diversity of 176 C. annuum accessions at 21 SSR loci and the genetic structure inferred from these data. Still, it captured nearly 95% of all alleles and 80% of the very rare alleles while maintaining good representativeness of the accessions, indicating that it could be very useful both in C. annuum improvement by breeding and in germplasm conservation. However, to construct and optimize most suitable mini core collections for specific purposes, the important phenotypic traits must also be considered. Therefore, further research, focused on optimization and testing of most suitable core collections for different purposes, based on different combinations of both genotypic and phenotypic data and comparing various sampling criteria is required to complete this task.

Conclusions
In the present study, a Balkan mini core collection of Capsicum annuum was constructed based on the allele information at 21 SSR loci and the genetic structure inferred from these data. The mini core collection is composed of 44 C. annuum accessions from several geographic locations on the Balkan Peninsula that retained a large proportion of the gene diversity present in the germplasm collection of 176 non-redundant accessions analysed in this study. The proposed mini core collection was also found to have good representativeness of the studied germplasm and therefore could be very useful in both C. annuum improvement by breeding and germplasm conservation. Further studies, including a larger sample of the more than 1500 Capsicum germplasm accessions currently maintained at the Maritsa Vegetable Crops Research Institute could allow selection of larger diverse core collections that will be suitable for gene mining and genome-wide association studies.

Author contributions
Elena G. Todorovska (EGT) conceptualized and designed the experiment, analysed data, prepared tables, prepared original draft and reviewed drafts, approved the final draft. Nikolai K. Christov (NKC) performed experiments, analysed data, prepared figures and tables, prepared and reviewed drafts of the paper, and approved the final draft. Stefan Tsonev (ST) performed experiments, analysed data, prepared figures and tables, prepared and reviewed drafts, and approved the final draft. Velichka Todorova (VT) maintained, evaluated phenotypically, described and contributed the material, reviewed drafts of the paper, and approved the final draft.

Disclosure statement
No potential conflict of interest was reported by the authors.