Digital soil map of Quintana Roo, Mexico

ABSTRACT A digital soil map of Quintana Roo was compiled at a 50 m pixel resolution using a geomorphopedological approach to produce a map that reflects a synoptic view of the geomorphology, environmental conditions and associated soils. Initially, it was developed using a geopedological approach and then converted to a digital map. The map was derived from soil-forming factors using mathematical methods to infer information in places where data were not available. Its compilation included three stages; the first two follow the geopedological approach that consists of a synthesis of data from the characterization of the geomorphological landscapes (vertical dissection, karst geomorphometrics, failures, geology) and soils, and the third stage incorporating environmental components (climate and vegetation) and related variables through various methods of statistical analysis (cluster, principal components and classification analysis) to obtain the pattern of soil distribution and to develop a model for the digital soil map of the study area.


Introduction
According to Food and Agriculture Organization (FAO), the plan for priority action in the field of soil at a global level is the sustainable management of soils. This can be achieved by increasing or maintaining stocks of organic matter, stabilizing or reducing the use of fertilizers of N and P and improving our knowledge on the status and trends of soil conditions (Montanarella et al., 2015). The challenge is to develop maps that show the spatial and temporal variation of the soil physicochemical conditions in ecosystems (Grunwald, Thompson, & Boettinger, 2011;McBratney, Mendonça Santos, & Minasny, 2003).
A soil map is a graphical representation that is used to transmit information about the spatial distribution and attributes of soils. Early soil maps were derived from topographic maps and were primarily used for agricultural purposes. However, their compilation was slow and expensive (Kempen, Brus, & de Vries, 2015).
At the end of the twentieth century, several approaches to the study and mapping of soils emerged, among them the geopedological approach proposed by Zinck (1988). This approach focuses on the study of the relationships between geomorphological and edaphic variables, taking into account the physical environment as an open system that occupies the interface between lithosphere and atmosphere. The approach depends on the stability of this interface, an understanding of the physical environment both in its structure and its dynamics (Zinck, 1988(Zinck, , 2012. The emergence of new technologies has generated a high demand for soil information in environmental monitoring and modeling for a wide range of users (including farmers, developers, politicians, decisionmakers, managers of natural resources, educational institutions, planners, researchers and agronomists) who manage many of the new projects on land use. New generations of scientists are also attracted to the spatial analysis of soils (Behrens & Scholten, 2006;Hartemink et al., 2010).
Digital maps are an alternative to traditional maps in terms of accuracy and cost and increase the interaction and communication between various users (Hartemink & Minasny, 2014;Minasny & McBratney, 2016).
A digital map is a database of soil properties based on field and laboratory observations and quantitative numerical models that allow inference of the spatial and temporal variations of soil types and their properties from environmental variables such as the climate, biota, relief, parental material, age and spatial position (Model Scorpan) (McBratney et al., 2003) and describes the uncertainty associated with the predictions based on time-series data, providing information about the dynamic properties of the soil (Carré, McBratney, Mayr, & Montanarella, 2007;Lagacherie, 2008).
The factors that led to the development of Digital Soil Mapping (DSM) in recent years are: the availability of spatial digital data (such as digital elevation models (DEM), remote sensing images), new methods for analyzing data (statistical techniques, multivariate geostatistics), spatial modeling in a geographic information system (GIS) and the availability and capacity of the computers to process data and access online information (Grunwald et al., 2011;McBratney et al., 2003;Miller & Schaetzl, 2016;Sanchez et al., 2009).
The DSM depends on an understanding of geopedological processes, and paleoenvironmental reconstruction, and is used to develop land use plans, which may help to resolve some of the challenges of our time such as food security, energy security, climate change, environmental degradation, shortages of water and threats to biodiversity and human health. They can be quickly updated at low cost as new and better data are generated (Brevik et al., 2016;Calzolari & Filippi, 2016).
The aim of the present investigation was to carry out spatial analysis of soils using the first stage of the geopedological approach as a basis, and to compile a digital soil map of Quintana Roo, Mexico.
The area has a predominantly low relief, the climate is warm and humid with summer rains, with an average annual temperature of 27.6°C and an average annual rainfall of 1263.3 mm (CNA, 2016). It lies on a structure of tertiary sediments with some quaternary deposits, composed mainly of calcite, dolomite and small amounts of gypsum (Ordoñez & Garcia, 2010). There are few surface rivers, because most of the water moves underground, and there is an abundance of karstic depressions such as sinkholes, uvalas and cenotes. The most common soil groups are Leptosols, Phaeozems, Vertisols and Gleysols, and the vegetation is mainly medium and low tropical rainforest.
There are 23 natural protected areas that represent 25.3% of the surface in the study area (SINAP, 2016). The main economic activities are tourism and trade. Agricultural activity is focused in the south and occupies less than 20% of the surface of the State. Corn, sugar cane and timber are the main crops.

Methodology
The development of the digital soil map consisted of three stages. The first two followed the geopedological approach, and the third incorporated other variables into the model to build and apply pedotransfer functions that relate soil characteristics to other variables ( Figure 2).
The first stage consisted of the geomorphometric and spatial analysis of field data at a scale of 1:50,000 to identify landforms and to map the distribution of limestone karst depressions and their flood types in the state (Fragoso-Servón, Bautista, Frausto, & Pereira, 2014).
The second stage entailed the development of a spatial distribution model of soils associated with previously identified landforms. This was done by adding georeferenced soil data to their respective landforms.
The soil database is a compilation of field and laboratory data from 412 field sampling points and derived from four different sources (Table 1). Using physical and chemical properties, soils were classified in accordance with the World Reference Base system (IUSS, 2007) and allocated to 14 major soil groups (MSG) ( Table 1). The resulting attribute table was completed using the full data matrix built in the GIS spatially joining climate and vegetation data reported for the study area by INEGI (2005) and CONAFOR-SEMARNAT (2011), respectively, to previously identified landforms. This produced two data subsets, one of them formed by all the 412 landforms fully qualified, including MSG, and the other one without the MSG information.
For the third stage, the second subset was fully quantified to assign MSG information to each landform as if we were using a pedotransference function to link MSG to the other variables. To achieve this goal, all landforms were classified, with all variables were used to form classes, except for the MSG information.
The classification algorithms were established using data subsets consisting of landforms for which soil information existed in addition to other variables. Trained algorithms were applied to the full data matrix. The full variable set and its domains are shown in Table 2.
Landforms were clustered based on the similarity of the first five variables and their respective domains. A cluster analysis (CA), using the Gamma coefficient of Goodman-Kruskal as the metric algorithm was performed. This estimator is known as one of maximum similarity that is useful to handle large volumes of hierarchical data that match the order and the value (Nelson, 1986) (Equation (1)): where D is distance or similarity between the pairs of objects; N s is the number of matching objects in attributes and sequence and N d is the number of different objects on attributes and sequence. The linkage algorithm used was the weighted mean. This clustering analysis was validated by three tests (Pseudo F, Pseudo-t test and consistency or distortion of Dunn) to verify that the results are that of greater likelihood: a) A test of Pseudo F, provides the resulting tree and has a probability value for the node formed with regard to the probability of all nodes that make up the group; hence, it appears frequently as the null distribution in the variance analysis (Equation (2)).
where U 1 and U 2 follow a Chi-square distribution with d 1 and d 2 degrees of freedom and U 1 and U 2 which are statistically independent. b) Pseudo-t test, which consists of the comparison of average distances and variances within and between groups, representing the dispersion of the nodes or density of the tree (Equation (3)).
where: X 1 and X 2 are intra and intergroup average distances and s X 1 −X 2 are differences of the variance with respect to the sizes of the groups. c) Consistency or distortion of Dunn for the validity of the Grouping (Halkidi, Batistakis, & Vazirgiannis, 2002;Havens, Bezdek, Keller, & Popescu, 2008;Omran, Engelbrecht, & Salman, 2007) (Equation (4)).  where d min is the minimum distance intergroup and d max is the maximum distance intragroup.
The result of this analysis is a consistent model of the spatial distribution of soil, where data from a soil group were used to assign polygons to the corresponding group (MacMillan, 2004). The next step is related to map validation and verification. The classes were verified by two types of statistical analysis: principal components analysis (PCA) and classification analysis (CA). The PCA identified the sources of data set variability and sorted them by importance (Jongman, Ter Braak, & van Tongeren, 1995).
The hierarchical model obtained from the PCA and the data matrix including soil data was used as input to estimate the uncertainty of the classification with the program WEKA (Hall et al., 2009). Four classifications using three different algorithms were performed: . Classifications 1 and 2 were classifications using decision tables with simple and exhaustive search (Kohavi, 1995;Mukerjee, 2012); . Classification 3 was a classification by construction of exceptions to the initial rule (RIDOR -Ripple-Down-Rules) (Gaines & Compton, 1995); and . Classification 4 was a classification by partition rules (PART) (Frank & Witten, 1998).
The final map shows the spatial distribution of soils. This map was verified for consistency and accuracy against soil and environmental field data.

Results
The first clustering identified 869 entities of identical units by their attributes. To keep uncertainty as low as possible, 85% of similarity was chosen as the threshold value, resulting in 188 groups that depicted various environmental conditions. This clustering was validated by three tests: Pseudo F, Pseudo-t and the distortion coefficient of Dunn. Pedological information was spatially joined to the groups formed by the CA to predict what soil types are likely to be found in each of the polygons in relation to the rest of the attributes, thus establishing a pedotransfer function that depicts the relationship between soil and the other attributes for each polygon. Where more than one soil group is assigned to a polygon, only the first three most probable soil groups are depicted on the map.
Sixty-five groups with no soil data are found in 103 polygons, representing 0.6% of the total number of polygons and occupying only 0.2% of the state area.
To identify and sort the sources of variability in the data set, a principal component analysis (PCA) was performed, defining the variables that have greater weight in the relationship between soils and attributes depending on the set of similarities of the classes formed (Table 3).
A chi-square test of the PCA results gave a value of zero, indicating that the eigenvectors of the variables in the data matrix are not equal or independent, thus demonstrating the relationship that exists between the soil-forming factors that were considered in the analysis and the allocation of the soil types to the map polygons.
The PCA shows seven variables that have the greatest influence on the distribution of soils. This can be seen in Figure 3, where the inflection point separates the most important factors from the other variables in the dataset.
Considering the individual variances and their contribution to the total variance of the system, the analysis shows that vertical dissection (VD) and karstic forms contribute the most to the total system variance (19.0% and 15.3%, respectively) and explain 34.3% of the variation observed in the distribution of soils (Table 3). Following these two variables, karst and fault densities, and the flooding regime for karstic depressions, form a second variable group which along with the first group explain 51% of the variation.  The next group of components is formed by climate followed by the presence of either temporary or permanent bodies of water, and finally geology (age of parental materials). These three variables account for 14% of the variability of the distribution of soils.
Together, these seven factors explain approximately 65% of the observed distribution of soils in the state.
Results from the analysis of classification (Table 4) show that more than 83% of soil type assignments are consistent and that polygons were correctly classified with the selected set of attributes once assigned to its corresponding soil type.
The differences between the three algorithms in terms of uncertainty (incorrectly classified polygons) are insignificant (2.43%). However, analysis of the respective confusion matrices shows that the PART algorithm has the maximum reduction in uncertainty.
Comparing the relative weight of the seven variables provided, and using PCA and the results of two nonsupervised assessments of subsets of the seven variables used, it was found that three complex factors have greater weight in the distribution of soils: landforms (the VD and karstic forms), climate (rainfall and flooding) and geohydrology (lithology and surface hydrology).
The relationships between the variables used in the analyses define the relative importance of various factors on soil formation and allow the prediction of the type or types of soil that can be found at any point within the state (pedotransfer function). Current data allow the compilation of the map with about 14% to 16% uncertainty, which is probably a result of data precision and suggests the need for more control points to refine the model.
In the 14 identified soil groups (Table 5), the group that occupies the largest part of the territory is Leptosols (48.5%) followed by Gleysols and Phaeozems. These three together occupy 75.6% of the State's surface ( Figure 4). Soil groups that together account for   less than 1% of the surface are Calcisols, Kastanozems, Regosols and Fluvisols. Each polygon is dominated by a group of soils that occupy the greatest area, with other groups of soils make up smaller proportions (Table 6). This situation is very frequent in Quintana Roo, where there are different groups of soils in small areas. To qualify the polygons and to build the map and its legend, only the three soils that occupy the largest area of the polygon were considered, resulting in 112 possible combinations of soils. The representation of these combinations and the database that accompanies it constitute the digital map of soils of Quintana Roo, at 50 m pixel resolution (Main Map).

Conclusions
A digital soil map of Quintana Roo was compiled at 50 m pixel resolution using a geomorphopedological approach to produce a map that reflects a synoptic view of the geomorphology, environmental conditions and associated soils.
This approach allows a thorough synthesis of environmental information of the components (climate, vegetation, soil) and geomorphological landscapes (including VD, geomorphometrics of the karst, faults and geology). Using various statistical models (cluster, principal component and classification analysis), the distribution pattern of soils in the territory was obtained.
The map shows a very high heterogeneity of soils in the studied area linked to geomorphometric heterogeneity not described in studies conducted before 2010 and earlier.
The methods used enabled the production of a map with a relatively low uncertainty. These methods are more useful for large-scale land management and decision-making than those currently in use in México.
The map was compiled with data from soil-forming factors and uses mathematical methods to infer information for the places where there are no data thus defining a pedotransfer function.
This research provides a new methodological framework that can be applied in other places and at different scales.
Due to its digital form, the whole database and the corresponding map are easily updatable at reduced costs at any scale equal or lower to the 50 m pixel resolution.
The methodology used allows the attainment of relatively low uncertainties for regional planning.

Software
The map was produced using Esri ArcGIS ® to build and manage the databases. Statistical analysis (clustering and PCA) were performed using SYSTAT ® v13 and classification analysis was carried out using WEKA ® . The final map was produced in ArcGIS ® and exported to PDF format.