Big Earth data analytics: a survey

ABSTRACT Big Earth data are produced from satellite observations, Internet-of-Things, model simulations, and other sources. The data embed unprecedented insights and spatiotemporal stamps of relevant Earth phenomena for improving our understanding, responding, and addressing challenges of Earth sciences and applications. In the past years, new technologies (such as cloud computing, big data and artificial intelligence) have gained momentum in addressing the challenges of using big Earth data for scientific studies and geospatial applications historically intractable. This paper reviews the big Earth data analytics from several aspects to capture the latest advancements in this fast-growing domain. We first introduce the concepts of big Earth data. The architecture, various functionalities, and supporting modules are then reviewed from a generic methodology aspect. Analytical methods supporting the functionalities are surveyed and analyzed in the context of different tools. The driven questions are exemplified through cutting-edge Earth science researches and applications. A list of challenges and opportunities are proposed for different stakeholders to collaboratively advance big Earth data analytics in the near future.


Introduction
The 21 st century has witnessed the increasing availability of information about the Earth surface, atmosphere, ocean, solid earth and beyond through cutting-edge technologies, including new satellite observations, in-situ sensors, Internet of Things, and social sensing contributed by human beings (NASA, 2017a). The massive amount of spatiotemporal data obtained can be used to understand our Earth system as a whole for addressing scientific challenges: such as climate change and global warming (Faghmous & Kumar, 2014), increasing intensity and frequent of disasters and better preparedness for dealing with them (Yu, Huang, Qin, Scheele, & Yang, 2019;Yu, Yang, & Li, 2018), ecological change and reduction of impacts (e.g. invasive species, Jeltsch et al., 2013). The data have been used for dealing with the application challenges such as urban and land planning for sustainable development (Yu, Liu, Wu, Hu, & Zhang, 2010), heat island computing infrastructure, analytical tools from both commercial and academic domains.
A key to these efforts is the strategies to analyze the big Earth data to facilitate answering those scientific and application questions.
To better capture the recent advancements of big Earth data analytics and enlighten the future research directions, we conducted this survey for the landscape of big Earth data analytics.

Architecture
Each Earth data system has its own architecture and system approach. A generic multidimensional architecture supporting the lifecycle of transforming data to knowledge is illustrated in Figure 1. This generic architecture contains the technological aspect in the front face, science and application domains on the top face (detailed in section 4), and stakeholders participation on the right face (detailed in section 5). The technological aspect (detailed in section 3) including infrastructural supports for processing, data stores for archive and access, data analytical methods for information extraction and knowledge generation, interfaces for user interaction.

Infrastructural support
Most big Earth data analytical systems have already or are being migrated to a cloud computing environment for rapid prototyping, result sharing, and reproducible research (Peng, 2011). Some choose the private cloud as it allows for full control (Doelitzscher, Sulistio, Reich, Kuijs, & Wolf, 2011), but most adopt the public cloud where a third-party cloud provider performs the updates and maintenance of computing resources (Varia & Mathew, 2014). For example, Mapbox uses Landsat on Amazon Web Services to power Landsat-live, a browser-based map that is constantly refreshed with the latest imagery from the Landsat 8 satellite (Yang, Yu, Hu, Jiang, & Li, 2017a). An emerging trend is to Figure 1. A generic system architecture of big Earth data analytics. use a hybrid cloud, a combination of these two paradigms that inherits the advantages of both to put the sensitive data/systems in a private cloud while supplying a service to the public cloud for public service (Jin et al., 2017).
Cloud computing can support sustainable archive, access to different computing node types, virtual desktops, and collaboration on data analytics. But for large scale, tightly coupled big data analytics or modeling, high-performance computing is still the solution for modeling, colocation of computing and data, data assimilation and inverse problems (Huang et al., 2013). For example, NASA has been planning to go up to support 1.6 Exabytes data with a 0.75 km resolution and global coverage for climate data (Lee, 2018). This means to integrate datasets from global Goddard Earth Observing System Model (GEOS), Global Modeling and Assimilation Office (GMAO), and other sources with sufficient computing and storage capacity to a) provide data/analytical/knowledge services, b) support artificial intelligence/machine learning/deep learning for inference, and c) engage PB level data to support comprehensive analytics and data fusion.
Graphics processing units (GPU) computing has boosted the simulation and analytics of Earth and space phenomena demonstrating significant speedups than conventional central computer processors (Madhukar, 2019). For example, the calculation of aerosol optical depth from the Moderate Resolution Imaging Spectroradiometer (MODIS) satellite data using GPU can be 43 times faster than the one using central processing units . Numerical simulation can also be accelerated using GPU computing. The large-scale simulation of seismic wave propagation on GPU was 45-fold faster than CPU whilst maintaining a precise accuracy (Okamoto, Takenaka, Nakamura, & Aoki, 2013).
Recent computing advancements also distribute some computing tasks to the edge of the infrastructure, for example, the smart things at the edge of the Internet of Things, and the mobile devices of mobile Internet. They are termed as mobile computing and edge computing to conduct early processing or preprocessing of data collected at the sensor side and to provide end visualization and facilitate user interaction.
While the computing infrastructure powers big data analytics, network and security infrastructure as well as monitoring, scheduling, managing, and integration infrastructure enables the computing and analytics to be operated in a smooth, dynamic, safe, and easy-to-use fashion.

Data sources, ingestion, and store
Another important module of the system architecture is the data store, which is responsible for archive and access to Earth data archived. Traditionally, Earth science data can be categorized into the atmosphere, ocean, land, hydrology, and socio-economic data according to their disciplines (Acker & Leptoukh, 2007). New data sources in the Big Data era are expanded to real-time location tracking, observations of the urban environment, and social media data from citizens (Mayer-Schönberger & Cukier, 2013).
Depending on the nature and usage of Earth data, they are traditionally stored in a file system, relational, or No-SQL database. For example, real-time location tracking data are usually stored in a Relational Database Management System (RDBMS) (Tian, Jiang, Chen, Li, & Mu, 2014). Several efforts have been made to store geospatial coverages when structured as arrays with an array-based database as the coverages are not well suited to traditional RDBMSs (Baumann, 2014). Indices, such as the latest spatiotemporal index , are built on top of the data to accelerate the data access. Some have attempted publishing data in the form of "Linked Data" with the linking technologies provided by the Web to enable the virtual integration of a globally distributed database (Goodwin, Dolbear, & Hart, 2008).
The access to data store includes generally two methods: real-time import and batch import. Real-time processing requires continual input, constant processing, and steady output of data, while the batch import is less time-sensitive (Marz & Warren, 2015). A good example is an Earth data search engine where a Web crawler constantly collects data over the Internet and stores data into a pre-defined location (Bambacus et al., 2017). The data storage supports access to different formats through interoperable APIs (Application Programming Interface) and conducts preprocessing including reprojection, fusion, upscaling/downscaling, and data fusion.

Data discovery and analytics
As a prior step to performing any data analytical tasks, traditional data discovery relies on open source technologies such as Solr and Elasticsearch (Nogueras-Iso, Zarazaga-Soria, Béjar, Álvarez, & Muro-Medrano, 2005). Metadata of these data are often stored in a fulltext search engine (e.g. Apache Lucene) (Jiang, Yang, Xia, & Liu, 2016), which can be searched like a google search engine. Recent endeavors started to integrate smart capabilities, e.g. query understanding, ranking, and recommendation, based on artificial intelligence advancements (Jiang et al., 2018Li, Goodchild, & Raskin, 2014;Wiegand & García, 2007). Common Earth data analytical functions range in complexity from simple numerical functions to raster and vector operations, visualization and exploration, and machine learning. More details of analytical functions will be reviewed in the next session.
Distributed computing technologies are widely adopted across different existing systems (Agrawal, Das, & El Abbadi, 2011) for big data analytics. Apache Spark and Hadoop MapReduce are two typical open source distributed solutions for big data analytics. The former is usually much faster as the latter reads and writes from disk more often (Zaharia et al., 2012). For example,  proposed a workflow to accelerate the Weblog mining process using Spark.

Interfaces
The top layer is the interface of the big Earth data analytical systems to support various types of end users: scientists, students, engineers, and decision makers (Nativi et al., 2011). Considering their different backgrounds, multiple interfaces (i.e. client libraries, code editor, Web portal) have been designed to allow various users to interact with a variety of system functions such as search, access, analytics, and visualization. For example, Google Earth Engine's services can be accessed and controlled through an Internet-accessible API and an associated web-based interactive development environment that enables rapid prototyping and visualization of results (Gorelick et al., 2017). ArcGIS API for Python, in contrast to arcpy, serves an analytical library for working with geospatial data and analysis algorithms, powered by web GIS. A new form of service is to provide the climate modeling process as a computational service (Li et al., 2017b). Web-based programmable interaction is also becoming a norm .
To present a comprehensive survey, the data analytics is detailed in Section 3. The science and application aspect is detailed in Section 4. Section 5 describes the stakeholders and their engagement in the challenges and opportunities of advancing big Earth data analytics.

Big Earth data analytics
Big Earth data analytics include the analytical lifecycle of preparing, reducing, analyzing, mining, and visualizing large amounts of spatiotemporal and spectral data, encompassing a variety of data types (Kempler & Mathews, 2017). The volume, velocity, variety, and veracity in the acquired data pose grant challenges in data processing for value . The analytical process enables the discovery of patterns, correlations, principles, knowledge and other information for better understanding our Earth system and responding to problems induced by global and regional changes (Bhattacharyya & Ivanova, 2017). The following sections summarize the literature from different aspects of big Earth data analytics.

Data preprocessing
Among the massive volume and rapidly streaming big Earth data, relevant data are examined and cleaned by preprocessing the raw data, which may be redundant, inconsistent, and noisy data (Bhattacharyya & Ivanova, 2017). As an important early phase, preprocessing consumes around 50-80% of the entire time for data analytics (Kempler & Mathews, 2017). Data preprocessing mainly focuses on extraction, transformation, quality evaluation, reduction, and augmentation (Theodorou, Jovanovic, Abelló, & Nakuçi, 2017;Wang, Hu, Sha, & Han, 2017). Table 1 summarizes the reviewed preprocessing methods.
Quality evaluation will help inform the accuracy of the analytical results either by improving the data quality (e.g. bias correction) or choosing better data resources (e.g. sensitivity analysis and Taylor Diagram) (Chun & Guldmann, 2014;Taylor, 2001).
Data reduction seeks to reduce data redundancy and duplication while optimizing data storage. The reduction in big data at the early stage enhances the data management and data quality. Therefore, it improves the indexing, storage, analysis, and visualization operations of big data systems (Rehman et al., 2016). The methods for data reduction can be categorized into network theory, redundancy elimination, dimension reduction, and data mining. Methods chosen for reduction depend on the objective of specific Earth data analytics. For example, for the climate dynamics study, Donges, Zou, Marwan, and Kurths (2009) constructed climate networks from the global climatological data set using the linear Pearson correlation coefficient and the nonlinear mutual information as a measure of dynamical similarity between regions. Stateczny and Wlodarczyk-Sielicka (2014) utilized artificial neural networks to reduce big hydrographic data acquired from the deep seas. Lidar data can be processed by using clustering, noise detection, vertex decimation approaches, and feature selections to reduce the data size (Rehman et al., 2016).
Data augmentation is a common technique used in data mining. It refers to the creation of altered copies of each instance within a training dataset as there is usually a lack of data to train a model, especially to fine-tune the huge number of parameters in a deep neural network. In the context of satellite imagery processing, data sub-samples can be obtained by rotating the original image, changing lighting conditions, cropping differently, etc.

Data analytical methods
After preprocessing, the main focus of data analytics is to reveal hidden patterns, unknown correlations, and other useful information from a large volume of heterogeneous data to facilitate Earth science study. Big Earth data analytics support all aspects of Earth science research, such as hypothesis and data discovery-driven methods, dynamical models, and goal driven decisions (Kempler & Mathews, 2017). The involved methods can be categorized into model simulation and prediction, statistics, machine learning, and deep learning (Table 2).

Model simulation and prediction
Numerical models have long been used to simulate and predict the status of the Earth including the solid Earth, land surface, biosphere, atmosphere, and oceans within a certain period of time. With the increasing availability of Earth observation and sensing data, the predicting capabilities of the numerical models have been improved through assimilating observation with simulation, which significantly improved the simulation accuracy (Courtier et al., 1994). For example, air quality forecasting models predict the level of pollutant concentration in the atmosphere that is harmful to human health (Binkowski & Roselle, 2003). The general circulation models predict the global circulation of a planetary atmosphere and ocean to better understand the climate and project climate change. Extreme weather events can also be predicted using numerical simulation, such as tropical cyclone and wildfire (Shuman, 1989). Beyond physical processes, social processes occurring on the Earth can also be simulated and predicted using agent-based modeling or cellular automata leveraging big data collected through IoTs or social media. For example, transportation problems, such as travel demand and traffic congestion, can be predicted and resolved through agent-based modeling (Bernhardt, 2007).

Traditional statistical methods
Traditional statistical methods, usually based on certain assumptions, are widely used to discover the loose and complex relationships between variables and improve our understanding of the geographical distribution and frequency distribution of big Earth data (Borradaile, 2013). It uses quantitative methods to sample geospatial datasets, handle the orientation data, regionalized variables, study multivariate systems, and identify cyclicity and patterns (Cressie, 2015). Simple statistics (e.g. mean and variance) can characterize Earth data. The confidence range quantifies the confidence of the estimation/analytical results. Hypothesis testing provides ways to compare the distribution of two sample datasets. Various correlation, interpolation, and extrapolation methods enable the prediction from known values (Kalkhan, 2011). Regression methods (e.g. logistic regression, linear/nonlinear, hierarchy regression) detect the underlying relationships in the Earth systems (Zhou et al., 2017). Clustering methods (e.g. DBSCAN, Density-based Clustering, WaveCluster) group similar spatial objects into classes, benefitting the identification of areas of similar land use in Earth observation data (Han et al., 2001). Classification methods (e.g. regression tree, autocorrelation, spectral schemes) identify pixel values from Earth observation into a particular category (e.g. land cover) (Getis, 1999). Statistical model reduction methods (e.g. Bayesian, approximation theory, error estimation, stochastic modeling, and Monte Carlo methods) reduce the computational complexity of mathematical models in numerical simulations (Lieberman et al., 2010). However, statisticians need to understand how the data is collected, statistical properties of the estimator (e.g. p-value), and the underlying data distribution (Alpaydin, 2014) to effectively apply statistics to Earth data.  (Batty, 2007), Earth system modeling (e.g. climate, weather, land surface, and ocean) (Shuman, 1989;Courtier, Thépaut, & Hollingsworth, 1994;Binkowski & Roselle, 2003) Traditional statistics Distribution discovery (Borradaile, 2013;Cressie, 2015;Alpaydin, 2014), regression (Zhou, Wang, & Cadenasso, 2017), Correlation (Kalkhan, 2011), Clustering (Han, Kamber, & Tung, 2001), Classification (Getis, 1999), Model reduction ( (Wang et al., 2018) 3.2.3. Machine learning methods Evolving from artificial intelligence, machine learning methods develop models that are based on characteristics and features learned from empirical data and can infer unknown problems and discover unknown patterns (Sellars et al., 2013). Machine learning methods generally have the advantage over traditional statistical methods in non-linear relationship understanding, and this advantage can be leveraged to model high-dimensional and nonlinear data with complex interactions and missing values, which is particularly the case for big Earth data (Thessen, 2016). Derived from statistical methods, regression, classification, clustering can also be used as machine learning methods, thus the exact division between machine learning and statistical methods is not always clear. For example, Artificial Neural Networks can produce regression on approximating and predicting ecological conditions (Franceschini et al., 2019). Machine learning classifiers including Random Forest, Support Vector Machines, and Bayesian Classifiers can produce the probability of an observation belonging to a specific class of Earth process, such as landslide (Hong et al., 2016).
Clustering can group observations based on similarity, which is useful in detecting rare events such as fire (Chakraborty & Paul, 2010;Khatami et al., 2017). Fuzzy inference and some tree-based machine learning methods (e.g. Decision Tree) can extract a set of rules from the observation to make predictions, such as forest cover and change (Sexton et al., 2016).

Deep learning methods
Deep learning methods, evolving from machine learning, offer unique capabilities in extracting and presenting features at different and detailed levels from the Earth data (Manning, 2015;LeCun, Bengio, & Hinton, 2015). These features and characteristics are extremely important in Earth data classification and segmentation tasks. Due to its more powerful expression and parameter optimization capability, deep learning has achieved great performance in computer vision, natural language processing, recommendation systems, and others (Collobert & Weston, 2008;Krizhevsky et al., 2012;Schmidhuber, 2015). For example, the deep convolutional neural networks (CNNs), e.g. AlexNet (Krizhevsky et al., 2012), VGGNet (Chatfield et al., 2014), and PlacesNet (Zhou et al., 2014), can perform satisfying results in classifying scenes from high resolution remote sensing imagery into categories such as airport, bridge, desert, forest, and so on. Beyond image classification, objects can be detected and segmented from Earth datasets using deep learning techniques (Cimpoi et al., 2015;Girshick et al., 2014). Deep learning methods can also help increase the computational efficiency of numerical simulations (e.g. weather prediction) whilst maintaining reasonable accuracy (Wang et al., 2018).
We selected popular tools to analyze how they support different big Earth data analytics and compared them ( Table 3) from aspects of scalability, analytical methods, programming languages, and graphical user interface (GUI).

Driving sciences and applications
The big Earth data analytics are critical to answer many scientific and application questions. These questions assist us in better understanding and determining the future directions of geosciences. The analytical functions (described in Section 3) provide the capability of extracting knowledge from the raw Earth data, and the knowledge can be

Global warming & climate change
Fundamentally, climate science is a field focused on studying large-scale changes in the land, atmosphere, oceans, and cryosphere over long temporal periods (years, decades, centuries) (Faghmous & Kumar, 2014). A variety of Earth science data is needed to investigate the response of those features to the climate changes (Guo et al., 2015). With the development of Earth observation technology, a large amount of scientific big data have been generated through various aspects of Earth observation, geophysics, geochemistry, geological surveying, and ground sensor networks (Guo et al., 2017). Utilizing multiple Earth observation platforms, comprehensive, precise, continuous, and various information can be provided to simultaneously and dynamically monitor Earth surface. The information is obtained from multiscale assimilations (De Lannoy et al., 2012), ground-based observation (Dorigo et al., 2015), and satellite observation (Liang et al., 2013). The multi-source technology offers higher precision, spatiotemporally stability, and extends Table 4. Domains supported by big Earth data analytics and exemplary methods and best practices.

Public interests
Developing observed and modeled material in response to major events (NASA, 2017a) the data dimension to monitor the dynamics of Earth surface (Asner et al., 2012), including drought (Zhang & Jia, 2013), water vapor (Liu et al., 2013), land surface temperature (Maimaitiyiming et al., 2014), and vegetation (Guay et al., 2014). These data can be used to analyze climate change factors and their spatiotemporal patterns (Faghmous & Kumar, 2014). Climate science focuses more on understanding natural phenomena systematically, not predicting a certain event, thus emphasizing on the explainable of the data analytical methods (Caldwell et al., 2014). Therefore, multi-source, high dimensional climate datasets are of critical needs to explain the spatiotemporal patterns and significant correlation between climate inducing factors.

Natural resources & environment
Natural resources have been over-exploited by human kind, causing loss and degradation of habitats and depleting biological diversity (Smil, 2013). Human beings, especially the marginalized and vulnerable communities, need to adapt to the rapidly changing environment and its corresponding adverse circumstance, leading to the attention of natural resource conservation and sustainable use of biological diversity (Collen et al., 2013). The capability to monitor the impact of biological diversity and global environmental change is crucial to designing effective adaptation and mitigation strategies to prevent further loss of natural resources (Pettorelli et al., 2014). This requires the scientific community to obtain datasets and assess the spatiotemporal changes in the distribution of atmospheric, ocean, and land surface conditions, and the distribution and function of the natural resource. Big Earth data are the source for mapping the distribution of natural resources, especially over large areas, including forest cover change (Hansen et al., 2013), vegetation cover (Karnieli et al., 2013), and biodiversity dynamics (Jeltsch et al., 2013;Kuenzer et al., 2014). Environmental pollution requires big Earth data to monitor and assess in the long term. Satellite observations, for example, are used in the analysis of European nighttime lights over 15 years, showing complex patterns of light pollution (Bennie et al., 2014), provide insight into global long-term changes in air, water, and soil pollution (Fingas & Brown, 2014;Lehmann et al., 2015;Lin et al., 2015;Schmidt et al., 2015;Van Donkelaar et al., 2015).

Precision agriculture & land evaluation
Precision agriculture, defined as "a management strategy that uses information technology to bring data from multiple sources to bear on decisions associated with crop production" (National Research Council [NRC], 1997), requires crop information spanning different spectral, spatiotemporal resolutions to gather information on crop area, type, condition, calendar, and yield (Whitcraft et al., 2015). Crop information with high spatiotemporal resolutions is required for in-field monitoring (Kross et al., 2015). Unmanned Aerial Vehicles (UAV)-based remote sensing offers great possibilities to acquire in a fast and easy way to ingest field data for precision agriculture applications (Candiago et al., 2015). UAV platforms (multi-rotors, swinglet, model helicopters, etc.), coupled with imaging, ranging, and positioning sensors, are able to collect multispectral imagery at cm-level resolution and offer great possibilities in the precision farming domain (Bendig et al., 2012;Guo et al., 2012;Primicerio et al., 2012). Land evaluation, defined as "the assessment of land performance considering the economics of the enterprises, the social consequences for the people of the area and the country concerned, and the consequences, beneficial or adverse, for the environment" (George, 2005), serves as an essential part of land use planning. EO has enabled the monitoring of land evaluation in a spatially and temporally continuous way with a global coverage by providing vegetation productivity and/or loss as proxies (de Jong, de Bruin, Schaepman, & Dent, 2011). With big Earth data increasingly available on public cloud platforms such as Google Earth Engine, land evaluation, such as agricultural suitability assessment, can be conducted globally online (Yalew, Van Griensven, & van der Zaag, 2016).

Hazards & risk
Satellite remote sensing technology provides a quantitative opportunity for pre-disaster detection and post-disaster damage assessment to assist response operations (Plank, 2014;Skakun et al., 2014;Yamazaki & Liu, 2016). Visual damage assessment can be utilized for qualitative confirming damage areas (Mas et al., 2015). Change detection using time series satellite imagery is widely used for post-disaster damage assessment Skakun et al., 2014). The recent access to a wide range of software, very high-resolution satellite imagery, and active and passive sensors facilitate the collection of data and the analysis and mapping of disaster events within a few hours. The application of remote sensing in disaster management follows the trend towards higher resolution (Ehrlich, Kemper, Blaes, & Soille, 2013), multidimensional (Kostyuchenko, 2015), and multi-technique (Chini et al., 2013;Pradhan et al., 2016). UAV networks serve as the most efficient situational awareness with higher resolution and faster capture and processing time (Ezequiel et al., 2014;Kruijff et al., 2012;Murphy & Stover, 2008;Robinson & Lauf, 2013).
Passive crowdsourcing via social media has emerged as a tool to communicate information in times of emergency (Kaplan & Haenlein, 2010). Social media is increasingly used by both NGO's and government emergency management agencies to determine public sentiment and reaction to an event (Kavanaugh et al., 2012). It is evident that the multidirectional flows of communication and information that crisis crowdsourcing online platforms facilitate can make response and recovery efforts more efficient and effective (Roche et al., 2013).

Security & defense
Big Earth data have been traditionally used for security and defense intelligence applications, through crisis management analysis for enhanced surveillance and proactive decision-making. Surveillance satellites, including IKONOS and GeoEye-1, provide highresolution imagery for crisis monitoring, change detection, and critical location identification (NRC, 2013). The usage of Unmanned Aerial Systems (UAS) has demonstrated its value for emergency management by providing real-time, data-rich descriptions of the location and movement of a certain crisis or accident (McMullen et al., 2016). Interweaving remote sensing with social sensing constitutes a key advancement in space and security domain, where useful information can be derived not only from EO products but also from their combination with news articles and the user-generated content from social media . Moreover, surveillance videos provide the opportunity to discover valuable information and predict crime from high-velocity datasets. Intelligent systems that facilitate efficient management of surveillance videos have been developed . Distributed storage and computing, data retrieval of huge and heterogeneous data, provide multiple optimized strategies to enhance the utilization of resources and efficiency of tasks.

Public interests
Big Earth data connects tightly with government administration and people's life. For example, NASA sets up a particular platform for civil Earth observations, Earth Observing System Data and Information System (EOSDIS). It aims to make federal civil Earth observing data more discoverable, usable, and accessible (NASA, 2017a). Small satellites are increasingly developed to monitor a specific region with high spatiotemporal resolutions for the purpose of daily life support, education, and communication of news and media (Helvajian & Janson, 2009).

Challenges and opportunities
Through the holistic big Earth data analytical review, it is found that there is broad and deep research on utilizing big Earth data analytics to drive understanding and utilizing the Earth system towards a Digital Earth vision for the next decades. Collaboration is a key to facilitate the moving from data to actionable knowledge by leveraging existing assets with incentives to facilitate sharing among big Earth data stakeholders. Existing challenges and opportunities including but not limited to data intensity for data scientists, analytical complexity for methodologists, regulation and cultural complexities for policymakers, system engineering for industry and engineers, and scientific challenges for scientists.

Data scientists
Big data handling methods are developed from different aspects and individual domains. How to integrate the methods to enable the handling of practical big Earth data in the volume of hundreds of Petabytes is still a grand challenge to the computing infrastructure, analytical methods, and user manipulation of the data and systems.
Knowledge-based smart analytics is becoming a requirement in integrated big Earth data analytics with knowledge derived from earth science theories, data collected, data and information usage. The facilitation of knowledge discovered to be used in the discovery, access, analytics, presentation also drives the adoption of information and knowledge extracted for decision support. Past years' investigations have matured the process and demonstrated the early success of utilizing knowledge base and artificial intelligence methodologies to facilitate big Earth data analytics. This should be a focus of the coming years as stressed by NASA, Defense Advanced Research Projects Agency (DARPA), and other agencies or initiatives.
New analytical methods should be developed from the three different aspects of acquiring new data, the emerging new computing infrastructure, and most important the driving needs of sciences and application. New innovations from relevant domains should be leveraged with minimized concerns. Methods can be identified and analyzed from diagnostic analytics, descriptive analytics, predictive analytics, to prescriptive analytics.

Geoscientists
Driving science quests and application challenges are the keys to the advancement of big Earth data analytics as the broadness of Earth science and the integration of Earth data with other domain data could always produce new values for answering new science questions and addressing new application challenges. This aspect includes the engagement of stakeholders with a composed list of funded/planned/existing tools, projects. It would also be preferred to redefine the possibility of a new question to be answered or challenges to be addressed with new known capabilities.
Spatiotemporal understanding sets up the principle and foundation for the integration of big Earth data cross sensors, domains, applications, and usages. While it is pretty matured in the model and simulation analytics, it needs to be expanded and serve as a core for integrative big Earth data analytics from other aspects for integrated sciences. The spatiotemporal understanding also helps us understand the importance of healthiness & cost of the integrated system supporting big Earth data analytics.
New ideas, observations, simulations, needs, technologies, analytics, presentations are interweaved to drive the advancement of each other to support the overall advancement of big Earth data analytics in order to answer grand scientific questions and address bold geoscience engineering challenges.

Engineers
While a large number of advanced research projects have been funded to develop new tools and algorithms to facilitate new big Earth data analytics, the engineering process for fusing the technology developed into operations includes of much engineering investigation of adopting cloud capability, open source, machine learning, advanced search, cloud optimization, and data transformation. This is where industry can be best leveraged.
Grand system architecting is still in its infancy for dealing with integrated Earth system understanding and application development at the regional, national, and global scales but with applicability to our daily life or solve practical problems. New architecture methods should be adopted by leveraging existing assets, respecting the copyright of data/tools and other intellectual properties, curation of the lifecycle of data, analytics, and systems. Workflow management and on-demand analytics have the potential to leverage such assets with reusable premises, e.g., compatible cloud containers, workflows, container/image repository.
The pricing for big Earth data analytics should be built into the architecture systematically considering moving big data in/out the cloud, changing price with glacier and spot instances, doing analytics in the cloud, data-proximal archiving, moving computer next to big data, adopting cloud for analytics, maximum analytics capability at minimum cost, improving existing analytics & leverage external analytics capabilities from commercial and others. This aspect also requires monitoring the system and understanding the backbone of the complete architecture.
Everything-as-a-service should be considered for better architecture to easily leverage assets in a Plug-and-play fashion for the integration of data, algorithms, and interoperable apps in the cloud. Compliance and security for ensuring the compliance of the big Earth data system to relevant security measure is a challenge. For example, it took Amazon Web Service more than 5 years to get to FEDRAMP medium security and it is still tough to handle high security level in an operational public cloud. The interweave between compliance and the open, usability of big Earth data complicates the systematic process. For example, a large data system is experiencing tough security measures for uneven service and performance, significant interface coordination, limited on-demand, fragmentation, discipline-specific support and tools, various user supports, optimization for archive, search, and distribution.
Computing intensity is still a demand for processing intensive data, especially with complex process engaged, such as modeling and comprehensive analytics. This may be shown as supporting a diverse user community with analyses without egressing data, facilitating multi-data comparison and fusion, supporting batch interactive/steaming, providing cost constraints and cost-sharing, processing next to data, optimizing for multidisciplinary research, integrating data access and disciplinary support, bridging with commercial capability and multi-agency needs in a convenient way.

Policymakers
Big Earth data analytics are often of great practical importance to human society and policy change. The United Nations identified 17 sustainable development goals as the society's most urgent needs and many among them are related to Earth science studies, including goals focused on water, energy, resilient infrastructure, and sustainable industrialization, safe and resilient cities and settlements, climate change, the ocean, and the land (United Nations [UN], 2015). To achieve these goals, different policy and management choices should be discussed and the consequences of these choices should be studied and predicted. Bearing with the new technologies and the increasing availability of new sensing data, policy solutions will be proposed and verified through standardization, progress monitoring, and new policy indicator integration to better understand the human-natural system and the probable consequences of different policy choices.
Another challenge for policymakers is the understandability and accessibility of scientific research outcomes. Open data, source, system, domain, and other aspects would help adopt Earth data for addressing challenges that are not tractable in the past. This requires both technology, policy/laws, culture, and other adoptions. On the technical side, standardized APIs, well-documented systems/source code, and interoperable applications are critical. Policy decisions should be made to facilitate the openness of data for different regions. Culture is a long-term challenge that needs to be addressed in a sustainable, smooth, but concrete-advancement fashion. This challenge urges Earth scientists to disseminate science results, as well as providing actionable, sustainabilityrelevant knowledge to policymakers.
In conclusion, big Earth data analytics is a fast-evolving domain and contributing directly to the implementation of many initiatives, such as digital Earth, digital china, smart cities, and the United Nation's 17 sustainable domains. The comprehensive understanding of big Earth data analytics would help us better position a solution and research aspects. The concrete advancement of this domain requires the collaboration of various stakeholders to take the driven from science areas, leverage existing infrastructures, ingest existing and new datasets, and develop new analytical methods.