A global land cover map produced through integrating multi-source datasets

ABSTRACT In the past decades, global land cover datasets have been produced but also been criticized for their low accuracies, which have been affecting the applications of these datasets. Producing a new global dataset requires a tremendous amount of efforts; however, it is also possible to improve the accuracy of global land cover mapping by fusing the existing datasets. A decision-fuse method was developed based on fuzzy logic to quantify the consistencies and uncertainties of the existing datasets and then aggregated to provide the most certain estimation. The method was applied to produce a 1-km global land cover map (SYNLCover) by integrating five global land cover datasets and three global datasets of tree cover and croplands. Efforts were carried out to assess the quality: 1) inter-comparison of the datasets revealed that the SYNLCover dataset had higher consistency than these input global land cover datasets, suggesting that the data fusion method reduced the disagreement among the input datasets; 2) quality assessment using the human-interpreted reference dataset reported the highest accuracy in the fused SYNLCover dataset, which had an overall accuracy of 71.1%, in contrast to the overall accuracy between 48.6% and 68.9% for the other global land cover datasets.

However, it is always expensive to improve the quality of the global land cover datasets . Another option is to quantify the uncertainty associated with individual land cover dataset and develop a harmonized land cover from these existing datasets. Initiatives like Global Observation of Forest and Land Cover Dynamics (GOFC-GOLD) in conjunction with the Food and Agricultural Organizations (FAO) and the Global Terrestrial Observing Systems (GTOS) have fostered harmonization and strategies for interoperability and synergy of all existing and upcoming global land cover maps . See and Steffen (2006) present a methodology based on fuzzy logic to generate an improved hybrid land cover map for Northern Europe by taking individual similarities and differences of only two global land cover maps into account. Jung et al. (2006) defined a new land cover classification scheme and used a fuzzy lookup table to integrate existing maps to generate a new global land cover map (SYNMAP). Iwao et al. (2011) created a global land cover map by integrating three land cover maps based on the principle that the majority view prevails, but its accuracy was not significantly higher than that of original maps. Ran et al. (2010) developed a new land cover map using multi-source information based on the Dempster-Shafer evidence theory, but it was limited to China. Schepaschenko et al. (2015) develop a global forest mask through the synergy of remote sensing, crowdsourcing and FAO statistics.
Here we propose a method of fusing multi-source land cover information for land cover datasets. The approach integrates not only land cover datasets but also datasets representing the quantitative attributes of specific land cover types. A new global land cover map was produced by applying the method to fuse existing land cover datasets and global tree-cover and crop-cover datasets. Quality of the new datasets were assessed by examining the consistency among the land cover datasets and the accuracy evaluated using human-interpreted points in China.

Data
Eight widely used and publicly available global datasets (Table 1) were selected to produce the fused land cover dataset. These datasets were all produced using coarse resolution (250 m~1 km) satellite imagery, e.g. AVHRR, MODIS, SPOT-4 and MERIS.

Global land cover maps
The majority (5 out 8) of the selected datasets are land cover maps, describing the distribution of cover types over the global land surface. The datasets and their classifications are described below: (1) Global Land Cover Characterization (GLCC), produced by United States Geological Survey (USGS) to provide land cover in several land cover classification, and GLCC in International Geosphere-Biosphere Programme (IGBP) classification (17 classes) is adopted in the analysis to better matching the other selected land cover datasets (Loveland et al., 1999).
(2) University of Maryland land cover product (UMD) is one of the earliest global land cover datasets, which provides land global cover with a simplified version of IGBP classification system, which has 14 classes . (3) Moderate Resolution Imaging Spectro-radiometer annual land cover product (MODIS LC) provides global maps of land cover at annual time steps and 500-m spatial resolution for 2001-present (Friedl et al., 2010;Sulla-Menashe & Friedl, 2018). It is one of the standard MODIS data products (Justice et al., 2002), and it supports multiple land cover classification systems, and the dataset with IGBP classification was selected in the analysis. (4) Global Land Cover 2000 (GLC2000) was produced by European Commission's Joint Research Center (EC-JRC) to provide regional land cover maps for each continent with a flexible classification system based the Land Cover Classification System (LCCS) developed by FAO and UNEP (Di Gregorio & Jansen, 2005). The global map was created by combining those regional land cover maps with converted LCCS code to a less thematically detailed classification LCCS (Bartholomé & Belward, 2005). (5) GLOBCOVER Land Cover Product (GlobCover) is a global land cover dataset produced by European Space Agency (ESA) (Arino et al., 2007;Bicheron et al., 2011). It was produced using the ENVISAT satellite mission's MERIS (Medium Resolution Image Spectrometer) sensor Level 1B data with a spatial resolution of 300 m. The GlobCover products include a map produced for global land cover in 2005-2006 and another map for 2009. The two maps adopted the FAO LCCS classification system, which has 22 classes, allowing change analysis between the two representing periods. Efforts have been carried out for validating these global land cover maps. Most of the datasets (i.e. GLCC, UMD, GLC2000 and GlobCover) were validated using sample collected from a designed sampling method and visually interpreted after examining higher resolution corresponding satellite images, i.e. Landsat TM (Thematic Mapper), SPOT (Systeme Probatoire d'Observation dela Tarre), MERIS (Medium Resolution Imaging Spectrometer Instrument) and Google Maps (DeFries, Hansen, Townshend, & Sohlberg, 1998;Friedl et al., 2010;Mayaux et al., 2006;Scepan, Menz, & Hansen, 1999). The MODIS LC dataset was validated based on a cross-validation using subsets of the training data that not been used for the training (Friedl et al., 2010). The reported overall areaweighted accuracies were 66.9% for GLCC (Scepan et al., 1999), 69% for UMD (DeFries et al., 1998), 75% MODIS LC (Friedl et al., 2010), 68.6 ± 5% for GLC2000 (Mayaux et al., 2006) and 67.1% for GlobCover (Mayaux et al., 2006). However, since different approaches and reference databases were used in the evaluation, the reported accuracies are not comparable and should not be considered as truly robust quantitative estimate (Jung et al., 2006).

Other global datasets
In addition to the land cover datasets, 3 global datasets were selected in the analysis: (1) MODIS Vegetation Continuous Fields (VCF) (MOD44B), which is a collection of annual estimates of several continuous vegetation measurements at 250 m resolution. It provides a global representation of the Earth's surface as gradations of three components, i.e. tree cover, non-tree vegetation and bare land (Hansen et al., 2003(Hansen et al., , 2011Townshend et al., 2011). The percent tree cover estimate is used in the analysis. The MODIS VCF Collection 5 was downloaded from the Global Land Cover Facility (GLCF) at the University of Maryland (http://glcf.umd.edu/data/vcf/). (2) MODIS cropland extent dataset provides estimates of probability of cropland at each 250 m resolution, and it includes two layers, i.e. cropland probability and cropland/non-cropland mask (Pittman, Hansen, Becker-Reshef, Potapov, & Justice, 2010). The MODIS cropland probability layer was derived from a set of multi-year MODIS metrics with incorporating 4 MODIS land bands, NDVI and thermal data, as well as a set of training data like FAO AfriCover and United States National Land Cover Database (NLCD), using classification tree method provided by S-Plus statistical package (Pittman et al., 2010). The probability product was then thresholded to create a discrete cropland/non-cropland indicator map (MODIS Cropland/Non-Cropland), using the data from US Department of Agriculture-Foreign Agricultural Service (USDA-FAS) Production, Supply and Distribution (PSD) database describing per-country acreage of production field crops. MODIS cropland extent probability/mask products over the period 2000-2008 were downloaded from the South Dakota State University (SDSU) (http://globalmonitor ing.sdstate.edu/projects/croplands/globalindex.html).
(3) AVHRR CFTC (Continuous Fields of Tree Cover) product, which provides estimates of leaf type/leaf longevity for tree classes at 1 km resolution. It was derived from monthly Advanced Very High Resolution Radiometer (AVHRR) NDVI composites over 1992-1993 period using spectral unmixing method . The dataset provides fractional estimate of leaf attributes for each pixel, including two layers representing leaf type (broadleaf/needleleaf) and leaf longevity (evergreen/ deciduous) separately; while each pair sums up to percentage tree cover. AVHRR CFTC products at 1 km resolution were downloaded from GLCF (http://glcf.umd. edu/data/treecover/).

Methodology
The global datasets were released in different formats, structures, spatiotemporal reference systems and semantical classifications. To facilitate data fusion and comparison, these selected global datasets are processed to a uniform geospatial reference system (section 3.1) and translated to comparable semantical variables (section 3.2). Another set of methods are applied to the integration of multi-source datasets (section 3.3), and the evaluation of fused dataset (section 3.4).

Geographic reference system
MODIS Sinusoidal projection (Seong, Mulcahy, & Usery, 2002) was chosen as the base geographic reference system. The spatial extent is between 180°W~180°E and 55°S~90°N , with a spatial resolution of 1 km. Firstly, the datasets were reprojected to the base spatial reference system with a resolution of 250 m using nearest neighbor resampling. The finer 250 m resolution was selected to reduce precision loss during the re-projecting. The reprojected data are then aggregated to 1 km resolution by selecting the most dominated land cover type within the extent of each 1 km pixel. GeoTIFF format was adopted for storing the output layers.

Translating semantical definition
3.2.1. Classification scheme A land cover classification scheme is defined as the target for translating the classes of the input land cover datasets. In order to coordinate the existing land cover types, the target classification scheme was defined upon parameters of classification standard for Plant Functional Types (PFTs) (Diaz & Cabido, 1997;Milchunas & Lauenroth, 1993), i.e. the occurrence of life forms and leaf attributes (leaf type/leaf longevity), which are common classifiers for land cover classification (Neumann et al., 2007). Twelve major classes were defined in the target classification scheme (Table 2), including 8 life forms categories, and 9 classes with leaf type and longevity associated with "Trees".

Affinity scores
Affinity score measures the fuzzy relationship between a land cover class in the input dataset and its corresponding classes in the target land cover classification scheme. The scores are assigned to the metrics of life form, leaf type and leaf longevity separately. In addition to the land cover datasets, the MODIS VCF and Cropland Probability layers also contribute additional information on trees and cropland classes. The affinity scores for "Trees" and "Cropland" were assigned using different rules individually compared with the other 6 life forms and leaf attributes.
Taking "Tree" as an example, the affinity score for a land cover class in the input datasets is assigned a score between 0 and 100 in terms of the percent canopy cover and sematic definition of the class (see Table 3). For example, assuming C is a class in an input land cover dataset: (1) If the class C matches "Trees" semantically, the score is assigned as the median of canopy cover for C, otherwise the score is assigned 0 if C and "Trees" are independent from each other. For example, the percent tree cover for 'evergreen needle leaf forest' in GLCC is >60%, the affinity score of tree cover for this class is set to 80.
(2) If the class C is defined as a mosaic type of forest and other vegetation types, and its percent canopy cover is >15%, then the affinity score between C and "Trees" is assigned to the value between the minimum and the median of canopy cover flexibly using expert knowledge, according to forest percent of mosaic class and its semantic relation with "Trees". Otherwise, if the defined percent canopy cover of C is <10%, then 0 is given to the affinity score between C and "Trees". For instance, the percent canopy cover for "mosaic: cropland/tree cover/other natural vegetation" in GLC2000 is 15-100%, the affinity score between this class and "Trees" is assigned to 35.
According to the five semantic rules shown in Tables 4 and 5, the affinity score for each input class is assigned a score between 0 and 100 to represent its likelihood to cropland or other classes.
All affinity scores between the source input class and target class are shown in Appendix 1.

Data integration
The fused land cover dataset is processed by integrating the input global datasets following four key steps ( Figure 1): (1) Fused dataset of Trees and Non-Trees classes are created by combing the tree cover layer from MODIS VCF and tree cover scores produced by applying the affinity scores of "Trees" to the five original global land cover datasets.
(2) If a location is identified as high probably of "Trees" at the previous step, the final forest class with leaf type and left longevity in SYNLCover is estimated by combining the MODIS CFTC information.
(3) Otherwise, the location is considered as Non-Trees, its likeness of "Cropland" is investigated by combining the MODIS Cropland/Non-Cropland layer and crop scores estimated by applying the affinity scores of "Cropland" to the five global land cover datasets. (4) For location with low likeness of "Cropland", they are further investigated by examining the affinity scores of the other six life forms calculated from the input global land cover datasets.
Two life form classes including "Trees" and "Cropland" in the fused dataset are determined according to following Equation (1) that calculates the mean score for each life form (Lf) for grid cell with coordinates i and j of SYNLCover: where: S Lf Mean ði; jÞ is the mean score for "Trees" or "Cropland" of SYNLCover; S Lf M ði; jÞ is affinity score for "Trees" or "Cropland" in the pixel (i, j) of the input global dataset M (Appendix 1.1);  Table 5. Definition example of affinity scores for input classes and target class than are not "Trees" nor "Cropland".  The choice of other six life forms and leaf attributes is made according to Equations (2) and (3), respectively, which calculates total score for other life form (OLf) and leaf attributes (LA) for grid cell with coordinates i and j of SYNLCover: where: S OLf Total ði; jÞ is the total score for life form except "Trees" and "Cropland" (OLf) of SYNLCover; S LA Total ði; jÞ is the total score for leaf attributes of "Trees" in SYNLCover; S OLf M ði; jÞis affinity score for OLf in the pixel (i, j) of input global land cover dataset M (Appendix 1.1); S LA N ði;jÞ is affinity score for leaf attributes in the pixel (i, j) of input dataset N (Appendix 1.2); OLf is each life form of SYNLCover except "Trees" and "Cropland" (Table 2); LA is leaf attributes including leaf type and leaf longevity of SYNLCover; M is the five global land cover datasets, N is five input land cover datasets and MODIS CFTC; i and j is current row and column of pixels, respectively. The maximum total score of S OLf M ði; jÞ and S LA N ði;jÞ is chosen as the best estimate of the life form OLf and leaf attributes LA in pixel (i, j) of SYNLCover, respectively. The calculation example for estimate of other life forms is illustrated in Table 6, and the life form class with the highest score wins, here "Grassland".
In case two or more life forms except "Trees" and "Cropland" get the same maximum total score, the decision which life form class wins is made by a random choice. If more than one leaf attributes receive the same maximum score, a decision matrix shown in Table 7 defines the winning leaf attributes. However, if the maximum score for leaf attributes is 0, both leaf type and leaf longevity are set to "Mixed". This compromise  introduces uncertainty, which is fortunately small since this case is very rare and applies only to "Trees" class, so that only part of the leaf attributes of "Trees" is biased.

Quality assessment
Quality of the fused land cover dataset and the global land cover datasets were assessed using two methods: 1) inter-comparison to evaluate their consistency, and 2) validating the datasets using human-interpreted points in China.

Consistency analysis
The five global land cover datasets are translated to life form and the target class scheme to allow comparison (Appendix 2). The fused SYNLCover is compared to the global land cover datasets by calculating the pixel-based confusion matrices to evaluate its consistency with these global datasets. From the confusion matrices the mean overall consistency is estimated by averaging overall accuracies from comparing datasets to provide a general consistency between SYNLCover and the input datasets: MeanC a ¼ ðC ab þC ac þC ad þC ae þC af Þ=5 Where: C a* separately denotes the overall consistency between pairs of dataset a and another dataset *; Indices a-e are SYNLCover, GLCC, UMD, GLC2000, MODIS LC and GlobCover, respectively.

Accuracy assessment
In addition to inter-comparison between these datasets, independently reference dataset is collected by human interpreting land cover types at randomly collected points to provide a comprehensive accuracy evaluation of the fused datasets.
A total of 3000 points were randomly collected in China ( Figure 2). Because of the complex of land cover spatial distribution in China, the MODIS land cover dataset was aggregate to six classes (trees, grassland, cropland, water, urban and others) and then used as the stratification for collecting the points to increase the efficiency on representing the various of land covers in China.
The collected points were visually interpreted by experts in a Web-based tool ( Figure 3) (Feng et al., 2012). To help the interpreters on identifying the land cover types, the tool presents maps and charts created from various sources, including: 1) Landsat images from the four epochs (1970s, 1990, 2000 and 2005) provided by the Global Land Survey (GLS) (Gutman et al., 2008;Gutman, Huang, Chander, Noojipady, & Masek, 2013, p. 2) NDVI profile derived from the 8-day composited MODIS Surface Reflectance products (MOD09A1) after cloud and shadow masking; 3) geo-tagged ground photos provided by Google Maps. Eighteen image analysts who have experience in land cover participated in the interpretation task.
Confusion matrix, including overall accuracy (OA), user's accuracy (UA) and producer's accuracy (PA), is then calculated between SYNLCover and each of input global land cover datasets using interpreted samples.

Synlcover fused dataset
The global SYNLCover life form (Figure 4(a)) and SYNLCover target classification scheme (Figure 4(b)) datasets were produced from the multi-source datasets using the proposed data fused method. They provide distribution of land cover types globally at 1 km resolution. Most of the input datasets presented the global land cover in circa-2000. Although the GLCC and UMD land cover datasets were produced using satellite data in the early 1990s and GlobCover was produced for circa-2005, which are less than a decade away from 2000. Considering their temporal closeness and the insensitivity to temporal changes at the coarse resolution (Fritz et al., 2011), the produced SYNLCover datasets are considered to delineate the global land cover in 2000. Besides the fused global land cover dataset, the affinity scores for each class is outputted, which represent the probability of the class at each pixel. These layers make it possible for applications to explore the mixture of multiple classes within the extent of a pixel extent.

Consistency comparison between the fused dataset and input datasets
After comparing the SYNLCover and the land cover datasets (Figure 5), the SYNLCover had the highest average overall accuracy for both life form (69.16%) and land cover (61.93%) classes, suggesting improved consistency in the fused dataset over the input land cover datasets. The life form datasets had higher consistency than the land cover datasets, likely due to the higher disagreements among the datasets in the detailed classes introduced in the land cover classes (Jung et al., 2006). Relatively lower consistency was found in MODIS LC, GLCC, UMD and GLC2000. GlobCover get the lowest average overall accuracy for both life forms and land cover dataset.

Accuracy validation using interpreted points in china
After comparing the fused SYNLCover and the other land cover datasets to the human interpreted dataset in China, it reported an overall accuracy of 71.1% for the SYNLCover-Life Form, which is higher than 68.9% for MODIS LC, 65.2% for GLC20000, and significantly higher than the other three global land cover datasets (57.7% for GlobCover, 57.2% for GLCC and 48.6% for UMD). Also, there were obvious differences across both UA and PA of each life form in the new and the original land cover maps ( Figure 6). The UA and PA were between 33.3% and 98.4% for the major land covers except "urban and built-up", which had lower UA and PA for the three 1-km native resolution (i.e. GLCC, Figure 3. Interpreting land cover types at samples collected in a given area (Feng et al., 2012).
UMD and GLC2000). Preliminary checking suggested that the low accuracy of "urban and built-up" class was mainly due to the poor capacity of delineating the small and fractional urban and built-up by the kilometer resolution coarse spatial scale datasets. Compared with the five original land cover maps, the UA of "Trees", "Cropland" and "Urban and built-up", as well as the PA of "Grassland" and "Water" in SYNLCover-Life Form are improved significantly. Figure 6 also presented a general pattern of class accuracy of SYNLCover-Life Form and five input land cover maps. "Trees", "Grassland", "Cropland" and "Others" are described with higher accuracy, whereas "Water" and "Urban and built-up" with lower accuracy. In addition to accuracy comparisons between individual maps, we also compare the accuracy of SYNLCover with the averaged accuracy of five input land cover maps (Figure 7). Result shows that the OA, UA and PA of six life forms of SYNLCover-Life Form, especially for "Trees" and "Grassland", are higher than the respective average OA, UA and PA of corresponding classes of the five input maps. Obviously, the SYNLCover synthesizes information about the basic appearance of vegetation (forest, shrubland, cropland, herbaceous vegetation), the leaf types (broadleaf and needleleaf), the leaf longevities (evergreen and deciduous), and also other types from the original five land cover maps, VCF, Cropland Probability and CFTC data sets.

Conclusions
The existing global land cover datasets provide great value to the land cover user communities, but the low accuracy and inconsistency among the datasets have been affecting the applications of these datasets, especially in land surface process modeling research. We proposed an integration method to produce a global land cover dataset with improved accuracy by synthesizing multi-source global land cover data products using fuzzy logic method. A global 1 km land cover dataset, SYNLCover, was produced using the method with two sets of classification systems to address the need for land cover data regarding delineation of both life forms and land covers. Although these datasets are overlapping between the two classification systems, the life form classes are more generalized than the land cover classes, which further delineated the tree class into forests with different leaf attributes.
The fused SYNLCover was produced by integrating eight global datasets, including five global land cover datasets and three datasets that representing quantitative attributes of specific land cover types. To our knowledge, this effort has been the most comprehensive integration of global land cover datasets. The quality of the fused land cover datasets was evaluated by inter-comparing with the land cover datasets and a reference data produced by human interpretation of 3000 points collected in China. The validation is limited in China, but it was a rigorous assessment of the quality of the datasets because China is considered as one of difficult area for land cover mapping due to its vast geographical extent, highly diverse and fragmental geography (Ran et al., 2010). The validation could also be considered as a representation of the quality of these datasets in larger extent. Both intercomparison and accuracy assessment suggested that the fused land cover dataset had higher accuracy and consistency than the input global land cover datasets. The life form dataset had higher consistency than the land cover classes, mainly because its classes are more general and the reduced the disagreements between the subclasses of forests in the land cover classification system. Higher consistencies were found in most of the classes, except in cropland, wetland and urban, which are more difficult to delineate at 1 km resolution. It will likely require land cover mapping at finer spatial scale to be able to capture the fragmentations of these classes. Although eight datasets were used to produce the SYNLCover dataset, more global land cover or related datasets have become available recently (Tuanmu & Jetz, 2014) or will be produced in future, and the presented method could be applied to integrate these datasets to further improve the quality of global land cover datasets.

Data availability statement
The data that support the findings of this study area available from the corresponding author upon request.

Disclosure statement
No potential conflict of interest was reported by the authors.