Improving global land cover characterization through data fusion

Abstract Global-scale land cover characterization has advanced from a spatial resolution of 1 × 1° in the mid-1990s to 30 × 30 m resolution to date. However, some mapping challenges exist persistently regardless of the increasing spatial resolution. Data fusion has been proved as an effective way of improving land cover characterization. Here we applied a machine learning-based data integration approach for improving global-scale forest cover characterization. The approach employed six coarse-resolution (250–1000 m) global land cover maps as input and various regional, higher-resolution land cover data-sets as reference to build regression tree models per continent. The average error of 10-fold cross validation of the regression tree models varied between 7.70 and 15.68% forest cover and the r2 varied between 0.76 and 0.94, indicating the robustness of the trained models. As a result of data fusion, the synthesized global forest cover map was more accurate than any input global product. We also showed that other major vegetative land cover types such as cropland, woodland, grassland, and wetland all exhibit similar magnitude of discrepancies as forest among existing land cover maps. Our developed method, because of its type- and scale-invariant feature, can be implemented for other land cover types for improving their global characterization. The ensemble approach can also be internalized for improving data quality when generating a global land cover product, where multiple versions can be produced and subsequently integrated.


Introduction
Changes in land cover and land use significantly affect earth system functions and processes including the carbon cycle, the hydrological cycle, biodiversity, and socioeconomic development (Bonan 2008;Foley et al. 2005;Pan et al. 2011;Pimm et al. 2014;Song et al. 2015). Optical satellite imagery is the primary data source for characterizing land cover and monitoring land cover change at the global scale (Townshend et al. 2012). Research on characterizing global patterns of land cover using remotely sensed data has been conducted since the mid-1990s. Early efforts were focused on generating coarse-resolution (250 m−1 km) land cover maps for the main purpose of providing land surface boundary conditions to biogeochemical modeling. These products include Global Land Cover Characterization (GLCC) (Loveland et al. 2000), the University of Maryland land cover (UMD LC) , Global Land Cover 2000 (GLC2000) (Bartholomé and Belward 2005), the Moderate Resolution Imaging Spectroradiometer land cover (MODIS LC) (Friedl et al. 2002), the MODIS Vegetation Continuous Fields (MODIS VCF) (Hansen et al. 2003), and the GlobCover land cover product (Bicheron et al. 2008). Taking the forest class as an example, substantial disagreement in their representation of global forest was observed due to their inconsistent definitions of forest, their coarse spatial resolutions and the diversity of satellite data and methods used in deriving these map products (Figure 1(a)).
Global-scale land cover mapping at 30 m resolution has only become feasible in recent years, owing much to the United States Geological Survey (USGS)'s open Landsat data policy (Wulder et al. 2012) and the reduced cost of data storage and computation. So far, five 30 m resolution global land/forest/tree cover maps have been generated by Chen et al. (2014), Gong et al. (2013), Hansen et al. (2013), Kim et al. (2014), and Sexton et al. (2013) and more data-sets are in production (Giri et al. 2013;Townshend et al. 2012). The spatial details revealed by these Landsat-based map products are 100-1000 times finer than those coarse-resolution maps and the cross-map consistency has improved considerably because of the common data source. Characterizing continuous tree cover Sexton et al. 2013) instead of forest cover also eliminates the definitional ambiguity of forest (Sexton et al. 2016). However, the overall global disagreement pattern of the two Landsatbased tree cover data-sets resembles to a noticeable degree those coarse-resolution forest cover data-sets (Figure 1(b)), highlighting the persistent challenge

OPEN ACCESS
of global-scale land cover characterization in specific regions, such as the dry tropical woodlands in South America and Africa and the transition ecotones in boreal North America and Eurasia.
Increasing the spatial resolution of mapping represents a continued advancement in land cover research. With the open access to standardized, pre-processed Landsat and Sentinel data and the emerging cloudbased computing platforms (e.g. Google Earth Engine, Amazon Web Services), the amount of global-scale land cover maps at medium spatial resolution (10−30 m) can only be expected to proliferate in the coming years. These forthcoming data-sets would likely have discrepancies as has been witnessed with the coarse-resolution data, perhaps concentrated in similar geographic regions. Therefore, exploratory questions can always be asked, for example, what are the implications of the observed agreement and disagreement patterns among different data-sets for future land cover mapping? Can these diverse data-sets, each of which has its unique advantage and as a collection has a reasonable level of agreement, be synthesized to generate a more accurate land cover map, regardless of their spatial resolutions?
Several studies have attempted to adopt a data harmonization approach for improving land cover mapping. Jung et al. (2006) integrated multiple versions of GLC2000, GLCC, and MODIS land cover maps by cross-walking different land cover legends and created notes: (a) agreement map of six coarse-resolution land cover products on forest cover. the six maps are Glcc, UmD lc, Glc2000, moDis lc, moDis Vcf, and Globcover. Values in the legend represent the total amount of products that classify a given pixel as forest, regardless of leaf type or leaf longevity. (b) Difference of two landsat-based global tree cover maps. the two maps are the Global land cover facility (Glcf) tree cover produced by sexton et al. (2013) and the Global land analysis and Discovery (GlaD) tree cover developed by Hansen et al. (2013), both of which estimate percent tree cover at 30-m spatial resolution. per-pixel absolute difference of the two tree cover values were calculated and divided by the Glcf tree cover to derive the normalized difference, which was overlaying on Glcf tree cover to create the map. only data over forested regions with Glcf tree cover ≥10% and normalized difference ≥50% were shown to highlight the maximum difference of the two products. a clear resemblance was noticed between (a) and (b). a joint 1-km land cover map for carbon cycle modeling research. Fritz et al. (2011) developed a synergy approach and created a hybrid cropland map for Africa. The method was recently expanded to the global scale Waldner et al. 2016). Schepaschenko et al. (2011) created a hybrid land cover data-set over Russia by combing satellite-based land cover maps, national statistics as well as in situ information. Recently, Schepaschenko et al. (2015) employed geographically weighted regression and generated a 1-km resolution global hybrid forest mask by integrating remote sensing, crowdsourcing, and national statistics from the United Nations Food and Agricultural Organization (FAO). Latham et al. (2014) produced the FAO Global Land Cover-SHARE database, a 1-km land cover map, by harmonizing various global and regional land cover data-sets following rules defined in FAO's Land Cover Classification System (LCCS) (Di Gregorio 2005). Tsendbazar, de Bruin, and Herold (2016) applied a regression kriging method to integrate four global land cover products and publicly available reference data for deriving user-specific maps.  developed a machine learning-based data fusion approach and created an improved continuous forest cover map over North America. By extending our previous efforts ), here in this paper we applied the well-tested method to other continents and provided a demonstration of global-scale applications.

Data and methods
Our overall data integration methodology follows . As an extension to the previous work, here we first provide a detailed review of the various global and regional land cover products used in this research. Key steps of the data integrated approach are then briefly summarized.

Global land cover characterization
The GLCC database was developed at a continent-by-continent basis (Loveland et al. 2000). Advanced Very High Resolution Radiometer (AVHRR) 10-day Normalized Difference Vegetation Index (NDVI) composites for the period of April 1992 to March 1993 were aggregated into monthly maximum NDVI composites to minimize cloud contamination. Non-vegetated land covers such as barren land, snow, and ice were identified using thresholds of the maximum NDVI and were masked prior to classification. Water bodies and urban land cover were not classified from AVHRR, but imported from the hydrography layer and populated places data layer from the Defense Mapping Agency's Digital Chart of the World (DCW) (Danko 1992). An unsupervised algorithm was applied to cluster the masked AVHRR monthly composites and each cluster was labeled as one of the 961 seasonal land cover regions. Each seasonal land cover region was firstly translated into Olson's Global Ecosystems Legend (Olson 1994) and then crosswalked into six different land cover legends, including the International Geosphere-Biosphere Programme (IGBP) scheme and the USGS Anderson scheme.

University of Maryland land cover
The UMD LC was also generated from AVHRR data but with a supervised classification procedure . The 1-km AVHRR 10-day NDVI composites and the five optical bands were used to create a total of 41 annual metrics including the maximum, minimum, and mean NDVI as well as pixel values of channels 1-5 associated with the eight greenest and the four warmest months. Training data were derived from a total of 156 60 m Landsat Multispectral Scanner images. A decision tree model was trained and applied to the annual metrics to create vegetation classifications. The urban and built-up class was taken from GLCC, which was in turn taken from DCW. The UMD product employed a classification scheme modified from IGBP for use with the Simple Biosphere (SiB) general circulation model (Sellers et al. 1986). Since the SiB scheme did not include agricultural mosaics, wetlands, snow, and ice classes, these classes were not explicitly characterized. Instead, the snow and ice class was contained in the bare ground class.

Global land cover 2000
The GLC2000 project was initialized by the European Union's Joint Research Center with the objective of producing a land cover map for the International Conventions on Climate Change, the Convention to Combat Desertification, the Ramsar Convention, and the Kyoto Protocol (Bartholomé and Belward 2005). Input data were based on SPOT-4 VEGETATION VEGA2000 acquired from November 1999 to December 2000. The whole globe was divided into 19 regions. Classification for each region was carried out by local experts using independent methods and legends that were most appropriate for the respective region. Subsequently, regional land cover classes were translated to a global legend according to FAO's LCCS (Di Gregorio 2005). Some parts of global classification were further improved using ancillary data such as digital elevation and nighttime lights data.

MODIS land cover
One of the standard MODIS products is the global land cover map (MCD12Q1) generated at Boston University. Multiple versions of this product have been generated, all of which employed a supervised decision tree algorithm (Friedl et al. 2002(Friedl et al. , 2010. Input data and features included the nadir Bidirectional Reflectance Distribution Function (BRDF)-adjusted reflectance data, land surface temperature, enhanced vegetation were acquired for the period between December 2004 and June 2006. Cloud-free surface reflectance mosaics were generated through a series of pre-processing steps including geometric correction, cloud and shadow screening, land/water classification, atmospheric correction, BRDF correction, and temporal compositing. The mosaics were stratified into bioclimatically homogenous regions across the world and subsequently converted to land cover maps region by region. The overall classification procedure consisted of a supervised step and an unsupervised step. Land cover classes that were not well represented such as urban and wetland, were classified with a supervised algorithm and the remaining pixels were clustered with an unsupervised algorithm. Clusters with similar temporal features were grouped into a manageable number of spectral-temporal classes and then labeled to land cover types following the LCCS. Some local land cover products were used as reference to finetune the global map as a post-classification step. Flooded forests, which were largely underestimated, were directly imported from the regional data. Delineation of water bodies was improved by incorporating the Shuttle Radar Topography Mission Water Body Data.

Regional high-resolution land cover data-sets
The overall data fusion strategy was to leverage as much as possible existing, regional, high-resolution land cover products. Those regional data-sets were most likely of higher quality than global data-sets over the corresponding regions because they were usually produced by local experts or experts who specialized in the study area. Therefore, regional products could be used as training data in supervised machine learning models, as has been shown in North America ). However, existing data-sets often do not have a representative spatial distribution over a continent. Thus, we classified a total of 28 Landsat images as forest/non-forest maps to augment the training data-set, especially in underrepresented ecoregions. The developed methodology was applied to the global scale at a continent-by-continent basis except Antarctica.
We collected various regional land cover maps at sub-100 meter resolution around the globe (Figure 2). The National Land Cover Database (NLCD) over the Conterminous United States (Homer et al. 2004) and the Earth Observation for Sustainable Development of Forests (EOSD) forest cover map over Canada (Wulder et al. 2008) were used in North America. The PRODES deforestation maps over the Brazilian Amazon (http:// www.obt.inpe.br/prodes/index.php) and the Paraguay forest cover change (FCC) data-set (Huang et al. 2009) were used in South America. To achieve a representative geographical distribution of the reference data, we also added Landsat-based forest/non-forest maps, which were derived through a clustering-labeling approach. These augmented training sites were located in a variety index, and annual metrics (minimum, maximum, and mean) of each band. Training data were derived from Landsat images taken from the System for Terrestrial Ecosystem Parameterization database, which included 1860 sites distributed across the globe. The final map was a result of an iterative boosting procedure -an ensemble classification method in which multiple classifications were carried out based on resampled training data and the final classification was determined by an accuracy-weighted vote. The product is available in multiple land cover legends, including the IGBP scheme, the UMD scheme, the MODIS leaf area index/fraction of photosynthetically active radiation class system (Myneni et al. 2002), an 8-biome classification system (Running et al. 1995), and a 12-class plant functional type system (Bonan et al. 2002). The map in IGBP legend was used in this study.

MODIS vegetation continuous fields
The MODIS vegetation continuous fields (VCF) product is also a standard MODIS land product (MOD44B). It estimates fractional vegetation cover at sub-pixel level, representing a theoretically advanced characterization of land cover over categorical maps. The latest version was generated at a spatial resolution of 250 m annually from 2000 to 2010 (DiMiceli et al. 2011). Following an established method described in Hansen et al. (2003), bagged regression tree models were trained using a large Landsat-based reference sample and annual phenological metrics composited from the 16-day surface reflectance. The models were applied to annual metrics to predict percent tree, herbaceous or bareground cover per MODIS pixel per year. Poor pixels which were either cloud, cloud shadow, high aerosol or had a view zenith angle >45° were reduced through the composition process and the remnant were flagged in the quality assurance layer. White, Shaw, and Ramsey (2005) validated the Collection 1 MODIS VCF using independent field data across the arid southwestern United States. Montesano et al. (2009) evaluated the Collection 4 product using reference data derived from high-resolution images across the boreal-taiga ecotone. Sexton et al. (2013) estimated the error of Collection 5 VCF against measurements of tree cover from small-footprint Lidar data in four sites across three different forest biomes and found that the root-mean-square-error of Collection 5 VCF ranged from 7 to 21% tree cover. Recently,  showed that the annual MODIS VCF product had high temporal precision, which allowed to map global forest cover change at an annual resolution.

GlobCover land cover
The GlobCover project, conducted by the European Space Agency, was designed to generate a land cover map of the world using 300 m data from the Medium Resolution Imaging Spectrometer Instrument on-board the ENVISAT satellite (Bicheron et al. 2008). Input data higher-resolution land cover data-sets were aggregated to 5-km resolution to derive reference percent forest cover; (5) supervised regression tree models were built to relate reference forest cover with forest cover of each product as well as the cross-product agreement information; (6) trained models were applied to generate an integrated percent forest cover map at 5-km resolution. The training-and-prediction procedure was conducted at a continent-by-continent basis and the performance of the model was evaluated based on a 10-fold cross validation for each continent.

Results
It has been demonstrated that the general data fusion approach effectively improved forest cover characterization over North America . Compared with , who used NLCD over the conterminous US as reference for North America and reported an average error of 11.75% and an r 2 of 0.87, augmenting the training data by including EOSD over Canada in this study has reduced the average error to 10.54% and increased the r 2 to 0.89 (Table 1), highlighting the importance of having a sufficient training distribution for the data integration methodology. Across all continents, the trained regression tree models performed better in South America and Africa than in North America, Eurasia, and Oceania. The training accuracies for every continent were all comparable but marginally higher than the 10-fold cross validation accuracies, indicating robustness of the trained predictive models. However, it should also be noted that Eurasia had the worst accuracy and also the largest difference between the training and the cross validation accuracy with an r 2 decreased from of ecosystems including the Brazilian Cerrado, tropical montane forests in Colombia, agricultural areas in Uruguay, temperate forests in southern Chile, and desert in northern Chile. Similarly, the vector-based Africover database (http://www.glcn.org/activities/afri-cover_en.jsp), the 60 m resolution FCC product over Democratic Republic of the Congo (DRC) (Potapov et al. 2012), the South African National Land Cover product (Fairbanks et al. 2000), and augmented training sites were used in Africa. The CORINE Land Cover database over Europe (Bossard, Feranec, and Otahel 2000), the 60 m FCC product over European Russia (Potapov, Turubanova, and Hansen 2011), China's National Land Cover (Liu et al. 2002), and the FCC data-sets generated by the Northern Eurasia Earth Science Partnership Initiative were employed in Eurasia (Pflugmacher et al. 2011). The Australian National Vegetation Information System Major Vegetation Groups map (Thackway et al. 2007) and augmented Landsat-based training data were used in Oceania.

Machine learning-based data fusion
Details of the data integration method and demonstration of its effectiveness have been reported in ). Here we provide a concise summary of the six key steps comprising the method: (1) all global data-sets were standardized to fractional forest/non-forest canopy cover layers based on their original definition of forest; (2) standardized forest cover layers were aggregated in an equal-area projection to a coarser 5-km spatial resolution to derive percent forest cover layers; (3) standardized forest cover layers were also stacked to create a series of cross-product agreement metrics; (4) regional Figure 2. regional land cover data-sets collected or generated for the integration of global land cover products. Black-and-white layer shows the location of each regional data-set, whereas the green background layer is the moDis Vcf tree cover.
notes: for north america, two national-scale wall-to-wall products eosD and nlcD were shown on the map but only a subsample of the data-sets was used as training in the integration model song et al. (2014), the amount of which was on the similar order of magnitude as other continents.
the border between China and Russia, where forests are subject to frequent logging activities and wildfires.

Discussion
Extending a previous study, this paper has applied a data fusion approach for improving global forest cover mapping. However, forests are one of the most easily distinguished vegetation types using remotely sensed data (Hansen and Loveland 2012). Other forms of vegetative land cover such as woodland, grassland, cropland, and wetland show similar, if not necessarily larger, magnitude of discrepancies among different global land cover products ( Figure 5). Existing studies on land cover data integration mainly focused on either forest (Schepaschenko et al. 2011(Schepaschenko et al. , 2015 or cropland Waldner et al. 2016), whereas studies rarely exist for improving global-scale woodland, grassland or wetland mapping. Accurately characterizing woodland and grassland is important not only for understanding the global carbon cycle (Poulter et al. 2014), but also for understanding agricultural production systems, as the world's livestock are mainly produced on pasture land. Wetlands arguably have the largest monetary value in terms of the ecosystem goods and services that they provide to human society (Costanza et al. 2014). However, wetlands appear to be the most poorly mapped land cover at the global scale. The type-and scale-invariant feature of our developed data integration approach may be applied to these major vegetative land cover types. The first step is to understand where and why discrepancies occur ( Figure  5). Creating spatially explicit presentation of agreement data using independent products not only reveals the persistent challenges of land cover mapping (that is, disagreement regions) but also provides a cost-effective way of selecting reliable training (agreement regions) for new 0.81 to 0.76. This was largely due to the vastly diverse forest landscapes across the continental land mass and the considerable discrepancies among the various input data products (see Figure 1(a) for more details).
In addition to the above quantitative evaluation, visual inspection revealed that the integrated forest cover product depicted well-known patterns of forest cover across the globe (Figure 3). At 5-km resolution, a resolution that is too coarse to derive accurate area estimate (Hansen, Stehman, and Potapov 2010;Hansen et al. 2008;Mayaux and Lambin 1995), broad-scale forest cover patterns are clearly delineated, even in complex landscapes (Figure 4). Figure 4(a) focused on central Mato Grosso, Brazil, a transition region from humid Amazonian forests in the west to semi-arid Cerrado biome in the east. Located in the middle of the arc of deforestation, fragmented forest patches as a result of extensive human clearing are also clearly shown. Figure  4(b) showed the decreasing gradient of forest cover density from closed-canopy Congolese rainforests to woody savannas in southern Democratic Republic of the Congo. Figure 4(c) highlighted the difference of forest cover between primary forests (dark green patches) and oil palm plantations (light green patches) that replaced natural forests in Sumatra, Indonesia. Figure 4(d) shows the moderate-to-high density temperate/boreal forests along  Figure 3. the integrated global forest cover map at 5-km spatial resolution.   land cover product generation (Zhang et al. in review). Potential issues of implementation include to reconcile various definitions of land cover and to collect sufficient regional high-quality data for reference. Evaluating the feasibility of method is a subject of our future research.

Conclusions
Global-scale land cover characterization features persistent challenges regardless of the increasing spatial resolution. Data fusion has been proved to be a viable means of improving land cover mapping. The machine learning-based data integration method we have developed provides a generic framework for synthesizing the knowledge of experts who focus on global-scale land cover mapping and the knowledge of experts who specialize in a specific region, while the latter can be used as reference to calibrate the former. The effectiveness of the approach was fully demonstrated in  for North America and the global-scale application was shown in this paper. We expect the type-and scale-invariant nature of the method to be useful for improving global-scale characterization of other major land cover types such as cropland, woodland, grassland, and wetland in future research. The ensemble approach can also be internalized for improving data quality when generating a global land cover product, where multiple versions can be produced and subsequently integrated.

Funding
This study is a contribution to the Global Forest Cover Change project (

Notes on contributors
Xiao-Peng Song is a postdoctoral research associate at the Department of Geographical Sciences, University of Maryland, USA. He received his PhD in Geographical Sciences from University of Maryland, BSc in Geographic Information System, and BSc in Economics (double major) from Peking University, China. His research interests focus on mapping and monitoring land cover and land use change at local to global scales. He is also a recipient of the NASA Earth and Space Science Fellowship.
Chengquan Huang is a research professor at the Department of Geographical Sciences, University of Maryland, USA. He received his PhD in Geography from University of Maryland, MSc in Environmental Sciences, and BSc in Geology from Peking University, China. He is leading efforts to quantify forest change at national to global scales using Landsat data, and is developing approaches for mapping forest structure and biomass change by integrating field inventory data, airborne or space borne lidar, and Landsat derived forest disturbance history.