A unified representation method for interdisciplinary spatial earth data

ABSTRACT Unified representation of spatial earth data is an essential scientific issue. The analysis and mining of interdisciplinary spatial earth data resources can help discover hidden scientific knowledge, and even reveal the intrinsic relationship among different disciplines. However, the different description methods and inner structures among interdisciplinary spatial earth data bring significant challenges to unified data management and collaborative analysis in earth environment research. To address this issue, this paper proposes a unified representation method for interdisciplinary spatial earth data. First, this paper establishes a general metadata model and realizes the unified description of interdisciplinary data. Second, an entity data organization model is presented, which can realize the unified organization of entity data with different inner structures. Finally, we introduce the Spatial Earth Data Format (SEDF), a data format based on HDF5 for implementing the data organization model of interdisciplinary spatial earth data. Data representation experiments and validation are conducted to verify the availability and practicability of the proposed data representation method. The results suggest the powerful ability to represent spatial earth data efficiently and ensure data integrity, which is convenient for data management and application.


Introduction
With the rapid development of computer and ground-and space-based observation technologies, significant discoveries and breakthroughs have been achieved (Li, Wang, Wei, & Lin, 2019;Ye et al., 2015). As an essential means of carrying out scientific research, deepening space exploration activities bring unprecedented opportunities and challenges to the field of geoscience (Guo, 2017;Qiu, Wang, & Ma, 2020). A series of space exploration missions have been set up at home and abroad at the moment, and massive spatial earth data resources, including remote sensing satellite images and space environment monitoring data, have been produced and acquired (Wu, Liu, Qiao, & Jie, 2012;Yao et al., 2020). Furthermore, the earth system can be vertically divided into the surface, near space, and near-earth space from the perspective of space; this research named the monitoring data in different spheres as spatial earth data. These data resources have different representation methods and cover various disciplines, such as geography, atmospheric science, space physics, and astronomy, which provide fundamental data support for the scientific research of earth and space environment, especially research on resource investigation and space physical process (Camporeale, 2019;Wang, Jia, Yin, & Tian, 2019).
There is abundant information in the massive amount of multidisciplinary spatial earth data, which hides undiscovered scientific knowledge. Therefore, further mining the knowledge and discovering the scientific law of various vertical spheres are current research hotspots, wherein the first is to solve the problem of organizing and managing spatial earth data in a unified spatial and temporal framework (Sudmanns et al., 2020). Spatial earth data consist of metadata and entity data. For one thing, the methods and guidelines for describing data are different from each other, resulting in the diverse structure and content of metadata files, which hinders the interoperability among multidisciplinary data . For another, there is also a difference in the multidisciplinary data organization structure and the data format, which leads to complicated data processing and analysis (Yan, Chen, Chen, & Liang, 2020). The above differences in data representation are inconvenient for the unified management, collaborative application, and sharing of spatial earth data. Consequently, realizing the unified representation of heterogeneous data is of great significance.
The primary focus of this study is to solve the unified representation of heterogeneous spatial earth data in interdisciplinary earth environment research, so as to facilitate the data collaborative application and analysis; therefore, a unified representation framework is introduced. First, through an in-depth investigation and research of common metadata standards and specifications in different disciplines, a general metadata model is established using Unified Modelling Language (UML), which is applicable to multiple types of spatial earth data and realizes the unified description of data. Then, a data organization model that is suitable for various spatial earth data and realizes the unified organization of data is built. Furthermore, based on the investigation of the existing scientific data formats, the Spatial Earth Data Format (SEDF) is proposed and designed based on the HDF5 data format. We also implement the conversion between SEDF and other data formats. The experimental results prove that the proposed data representation method can realize the unified representation of spatial earth data, including metadata and entity data, which provides a theoretical basis for unified management and analysis and is significant for interdisciplinary research.
The remainder of this paper is organized as follows: Section 2 describes the background of the research, including the metadata model, data organization model, and the corresponding research works. Section 3 elaborates the main contents of the proposed unified data representation method. Section 4 presents the experiments conducted and the results. Section 5 concludes and discusses the paper.

Metadata model
Metadata is descriptive information about data that can help to obtain a better understanding of data (Duval, Hodgins, Sutton, & Weibel, 2002;Green & Bossomaier, 2003). Establishing metadata models is one of the focuses in the field of data science, which is also the premise and guarantee of data standardization (Chan & Zeng, 2006). To promote data application, different disciplines often establish their respective metadata models along with various structures and contents, which brings inconvenience to data exchange, integration, and unified management (Li & Huang, 2017). In the field of geoscience, research on building geospatial metadata standards has been a research hot spot at home and abroad. National and federal standard organizations, such as the International Organization for Standardization (ISO), Federal Geographic Data Committee (FGDC) and National Aeronautics and Space Administration (NASA), have set up working groups to discuss the formulation of standards from different aspects. Currently, the main geospatial metadata standards include the geographical information metadata standard (ISO 19115) (ISO/TC211, 2019), the Content Standard for Digital Geospatial Metadata (CSDGM) (NASA, 2002). In space physics and astronomy, the data mainly follow Space Physics Archive Search and Extract (SPASE) (NASA, 2020).
(1) ISO 19115 ISO 19115 is developed by ISO/TC211 and defines how to describe geographical information and service. It uses metadata entities and elements based on UML. There are two parts: ISO 19115-1 (2014) and ISO 19115-2 (2019). The former is the fundamental part for describing geographic information resources and defines a series of metadata elements, properties and their relationships, while the latter is documented to augment the former to provide data acquisition and processing information for geographical resources.
(2) CSDGM The CSDGM is developed by the FGDC in 1998, with the objective of providing a common set of terminology and definitions for digital geospatial data. It is organized in a hierarchical structure with sections, data elements and compound elements. This standard consists of seven main sections: Identification, Data Quality, Spatial Data Organization, Spatial Reference, Entity and Attribute, Distribution, Metadata Reference, and three auxiliary sections: Citation, Time Period and Contact.
(3) SPASE SPASE is developed by space physics data holding organizations funded by NASA. This model provides a unified description of Heliophysics resources based on the Extensible Markup Language (XML) to help researchers retrieve data of interest. Specifically, it defines a set of terms to describe data including scientific context, source provenance, content and location. This model has been widely used in space physical exploration missions at present. Based on common metadata standards, researchers have carried out different research under different application scenarios. Fan et al. integrated and managed remote sensing metadata in a distributed data center spatial infrastructure based on ISO 19115 (Fan, Yan, Ma, & Wang, 2017). Takahashi et al. introduced a conceptual model for earth observation data to better manage metadata (Takahashi, Tatedoko, Kinutani, & Yoshikawa, 2009). Gebhardt et al. managed and disseminated spatial data in a web-based information system based on ISO 19115 (Gebhardt et al., 2010). However, the above metadata standards have not taken attribute information, such as data category and data copyright, into account and the data quality information and spatial information are incomplete. This study concentrates on building a general metadata model that is suitable for spatial earth data from different disciplines to realize the unified description of spatial earth data. While defining the basic attribute information of spatial earth data, the other description is supplemented. The unified and complete description of spatial earth data can be realized through the general metadata model, which is the premise of unified data management.

Data storage format
The data storage format is the carrier of scientific data that can store and distribute data. The selection of data format depends on the requirements of different disciplines. Several research institutions and organizations construct scientific data formats that mainly include Geo-Tag Image File Format (GeoTIFF), Hierarchical Data Format (HDF), Network Common Data Format (NetCDF) and IONosphere map Exchange format (IONEX).
(1) TIFF and GeoTIFF Before introducing GeoTIFF, it is necessary to understand the structure of TIFF (Murray & VanRyper, 1996). TIFF is developed by Aldus Corporation and Microsoft for the purpose of providing a public image file format standard. It consists of three parts: file header, file directory and image data. The latest version is TIFF 6.0. Due to the flexible storage pattern and supporting various image modes, TIFF has been increasingly used to store and distribute raster geographic data. However, there is no fixed structure in the format to store geospatial information, and when the raster data is transferred from one system to another, the user should confirm the geographic location information in advance; thus, limitations still exist in applications such as cartography and mapping.
Faced with the deficiencies mentioned above, GeoTIFF came into being (Ritter & Ruth, 1997). Dr. Niles Ritter from NASA's Jet Propulsion Laboratory encoded geographical information using a series of keys in TIFF format and named the format the GeoTIFF 1.0 standard. The GeoTIFF file structure inherits the TIFF 6.0 standard, so strictly speaking, GeoTIFF is a special type of TIFF. Currently, GeoTIFF is adopted by NASA to store earth science data.
(2) HDF HDF is created by the NCSA (National Center for Supercomputing Application), and stores and distributes scientific data in a hierarchical structure to meet the demand of exchanging data among computing platforms (Habermann & Folk, 2014). This format is independent of computer architecture and provides multiple compression algorithms, such as GZIP, LZF and SZIP; thus, the data storage efficiency and transmission speed can be improved. The newest data format version is HDF5 (Koranne, 2011). Compared with HDF4, HDF5 overcomes limitations and supports larger files and more data types; this is the largest difference between HDF5 and other image data formats. Currently, HDF is adopted by NASA and NOAA as their standard data storage format.
HDF5 contains two basic objects: group and dataset, and other objects contain the data space, data type, property and attribute. Data space defines the rank and dimension of scientific data. Data type is the expression of data such as integer. Property defines the parameters of data block and compression. Attribute is the additional description of the scientific dataset.
(3) NetCDF NetCDF is an array-oriented data format proposed by the scientists of UCAR (University Corporation for Atmospheric Research) (Rew & Davis, 1990). It provides an interface between the application and real-time meteorological data. Due to its flexibility, NetCDF is currently widely used to store and distribute scientific data in various disciplines including atmosphere, ocean and space physics. The newest version is NetCDF 4.0 based on the HDF5 library.
From a mathematical standpoint, the data stored in NetCDF is a single-valued function with multiple variables:f x; y; z ð Þ ¼ value. x, y and z represent dimensions, and value is the observation value of the sensor. Dimension can be used to represent elements that have actual physical meaning, such as time, elevation, longitude, and latitude. The observation value is used to represent the physical phenomenon, and generally, it is a multidimensional dataset.
(4) IONEX IONEX is the data exchange format for Total Electron Content (TEC) maps in the ionosphere and is provided by the International GPS Service for Geodynamics (IGS) (Schaer, Gurtner, & Feltens, 1998). It supports the exchange of 2-and 3-dimensional TEC maps given in a geographic grid. The IONEX file consists of two parts: header and data. The former contains the basic information of TEC data, such as the time range, interval, and station identifier. The latter contains practical values of each TEC map in each geographical coordinate.
Based on the existing scientific data format, researchers have tried to build new data formats to meet the requirements of various disciplines. Some researchers organize data from the conceptual level. Sun et al. proposed a unified geospatial data ontology model to facilitate data integration and sharing (Sun et al., 2019). Zhu et al. introduced a unified representation method for 3-dimensional city models to realize the description of complicated objects (Zhu, Li, & Zhang, 2007). However, this method pays more attention to the theoretical realization of data unified representation, and obstacles stand in the way of practical application. Others study from the data itself. For instance, Wang provided a general data representation method for scientific data based on XML (Wang et al., 2008). Krischer et al., Könnecke et al. and de Buyl et al. established new data exchange and archival formats based on HDF5, which achieve the efficient storage, organization and application of seismic data, neutron, X-ray and muon experiments, and neutron, X-ray and molecular data, respectively (de Buyl, Colberg, & Höfling, 2014;Krischer, Smith, Lei, Lefebvre, & Tromp, 2017;Könnecke et al., 2015). Faced with the deficiencies of the Flexible Image Transport System in the process of storing astronomical data, Greenfield et al. proposed a new data format based on Yet Another Markup Language, and proved its applicability through experiments (Greenfield, Droettboom, & Bray, 2015). In addition, as a data organization framework for building Digital Earth, the Discrete Global Grid System (DGGS) takes the whole earth as the research object and subdivides the geospace evenly into discrete grids (Sahr, White, & Kimerling, 2003). The DGGS model realizes the unified organization of entity data and its existing works focus on the grid encoding, coordinate transformation, precision evaluation system, etc. (Lin, Zhou, Xu, Zhu, & Lu, 2018;Ma et al., 2021;Wang, Ben, Zhou, & Zheng, 2021).
The above methods start from metadata or entity data and achieve the unified description or organization of scientific data. However, the spatial earth data in this paper involves in various spheres of earth system, such as image, text, and multidimensional array; thus, currents methods cannot realize the authentic and complete unified representation of spatial earth data in essence. With the premise of combining data science theory and the requirement of practical application, this paper proposes a data unified representation method that establishes a general metadata model and entity data organization model, which can solve the unified representation problem in the process of interdisciplinary data management and application.

Unified representation method for spatial earth data
The core of the proposed unified representation method in this paper consists of the general metadata model, entity data organization model, and data storage format. By establishing the general metadata model and entity data organization model, this study realizes unified metadata description and entity data organization, respectively. Finally, SEDF is presented to store interdisciplinary spatial earth data.

General metadata model
To realize the unified description of various spatial earth data, including surface, near-earth space, and near space monitoring data, the first is to establish the general metadata model (GMM). Spatial earth metadata play an important role during data organization, integration, management, and distribution (Li, Zhang, Zhang, Wang, & Tian, 2016). In this study, it includes the attribute information created for the data file, such as the identifier, category, time range, and spatial range (Nogueras-Iso, Zarazaga-Soria, & Muro-Medrano, 2005).
Through the investigation of ISO 19115-2, CSDGM, and SPASE and according to the characteristics of interdisciplinary spatial earth data, this paper designs a description rule based on the Unified Modeling Language, which describes spatial earth data from nine aspects, such as identification, platform, representation, time, space, quality, copyright, distribution, and extension.
GMM=[Identification, Platform, Representation, Time, Space, Quality, Copyright, Distribution, Extension]. The general metadata model can be established using the above rule, as Figure 1 shows.
As can be observed from Figure 1, there are nine UML classes in the model: GMM_Identification, GMM_Platform, GMM_Representation, GMM_Time, GMM_Space, GMM_Quality, GMM_Distribution, GMM_Copyright, and GMM_Extension. Each class in the GMM contains subclasses or elements. Detailed information for each element in the GMM is shown in Table 1.
The general metadata model provides many elements for describing spatial earth data. Compared with other metadata models, in addition to basic information, the proposed model takes spatial information, data quality, and copyright into account and then redefines them. The key distinctions are as follows.
• From the perspective of data type, interdisciplinary heterogeneous spatial earth data with different formats can be described in the same metadata framework; Figure 1. Structure of the general metadata model.
• From the perspective of spatial information, we consider both the geographical boundary and elevation. Because the geographic boundary of spatial earth data is mostly an irregular geometric polygon, the longitude and latitude values cannot be managed structurally. Therefore, while saving all the latitude and longitude values of the data boundaries, this study calculates the minimum enclosing rectangle of the data vector boundary and stores the longitude and latitude values of the four vertices, which ensures the data availability and facilitates data application. In addition, the elevation of spatial environment data is usually a range, while that of surface environment is a definite value. The metadata model also saves the maximum elevation, minimum elevation and average elevation; • From the perspective of data quality, in addition to the cloud cover information, the data citation information, such as the citation type and description, is also included.
In this study, whether the data are cited by paper, patent, monograph, etc., is taken as a way to evaluate the quality of spatial earth data. • From the perspective of data copyright, we define the data owner and the data provider.

Entity data organization model
Faced with large-scale spatial earth data, the core of data collaborative application is to process and analyze data quickly and efficiently. The data processing and analysis are influenced by the data organization method, and the ultimate goal of realizing the unified organization of data is to help scientists focus on scientific research and avoid exerting too much energy into the cumbersome data processing process. According to different application scenarios, realizing the collaborative analysis of space earth data often requires the combination of different types of observation data.
There are independent data management systems and methods for data acquisition and processing in various disciplines. Due to the complicated data acquisition and processing process, collaborative analysis for multiple disciplines is complicated, resulting in the low utilization efficiency of data resources. Managing different data types in a unified form is better for collaborative application and analysis (Wu, Shen, Wang, & Wu, 2020). This research establishes a data organization model for spatial earth data to realize unified organization. Figure 2 shows the data organization model.
In Figure 2, two components are included in the organization model: dimension and observation data. The dimension is the basis for grouping spatial earth data; usually, it is a physical quantity, such as the time dimension, data type dimension, and elevation dimension. Once the dimension is determined, spatial earth data can be organized and understood in the corresponding framework. Observation data are the monitoring data in various spheres of the earth system, which need to be stored, such as remote sensing data (optical image, radar image, etc.), atmospheric data (wind field, wind temperature, etc.), and ionospheric data (total electron content, electron density, etc.). Moreover, according to the practical application scenario, the stored data can be a dataset or a single data. Specifically, for interdisciplinary research on earth environment, it is often necessary to obtain data from different disciplines. Under this condition, the observation data component is a dataset composed of multiple data. While for some particular studies, such as the extraction of regional feature information, the analysis of atmospheric environment and total electron content, a single data can meet the demand. In this situation, the observation data component is a single data. According to the data organization model, the logical structure is obtained. As Figure 3 shows.
The logical structure is organized into two layers: dimension layer and observation data layer. In the dimension layer, spatial earth data can be classified and organized according to dimension information, such as time (time point or time range), data category and elevation. The dimension information is determined by the requirement of the application scenario, which is not limited to the types listed above. In the observation layer, the observation data are stored. This organization method can be understood as a tree structure, and the function of the dimension is similar to the index. After the observation data are stored, they can be located by searching the dimension.

Data storage format
There are two ways to implement the entity data organization model in data format. One is to use the existing data format that provides mature interfaces. This method can ensure data readability and realize data processing and analysis quickly. The other is to develop a new data format and the corresponding basic function library. This method requires considerable human work, materials and time resources. Furthermore, it will take time for users to become familiar with the new data format, which will influence the data processing and scientific research progress. In conclusion, this study chooses the former method to implement the entity data organization model. This study implements the spatial earth data organization model using a determined data format, which is named as SEDF (Spatial Earth Data Format). SEDF organizes data with a hierarchical structure, and the contents are approximately organized into three sections: • Description of interdisciplinary spatial earth data is stored in the metadata group, which initially is an XML file; • Observation data, also called entity data, including image, multidimensional array, and text record, are stored in the entity data group, with different data formats being the original form; • Other description of spatial earth data that is not contained in the metadata file is stored in other information group. The storage structure is shown in Figure 4.
(1) HDF5 data structure Through the analysis of existing scientific data formats, it can be found that space environmental data mainly adopt NetCDF and IONEX format, and surface environment data mainly adopt HDF, GeoTIFF, and NetCDF. HDF is a data format that can store different data types, including multi-dimensional array, image, and text (Poinot, 2010). Because of its features, this paper organizes the spatial earth data using HDF5 and redefines its rule of the group so that the subdirectory of the root group can no longer store entity data directly.
(2) Spatial earth metadata Although the HDF5 data format can realize the self-description of spatial earth data by defining its attributes, including data type and data space, viewing its metadata information always requires professional software, such as HDFView and HDFExporer, which is complicated. While acquiring data and applying them, generally, the metadata information must be analyzed first. Thus, the research produces a metadata file based on XML specification and encapsulates it in the HDF5 data format.
There are three parts in the metadata file: file header, XML declaration, and metadata content. The file header indicates that the file is a metadata file of spatial earth data; the XML declaration indicates that the file follows the XML specification, which defines the version and coding format; the metadata content includes attributes, metadata elements, and corresponding values. (3) Spatial earth entity data (set) Different types of spatial earth entity data (set), such as remote sensing image data, atmospheric data, and ionospheric data, are stored in this part with the expression of a multidimensional array, image, and text data. Entity data (set) are the key to the data organization model, which can be classified and stored according to grouping rules. The unified storage of various spatial earth data provides convenience for collaborative data analysis in different applications.
(4) Other information Information that is not included in the metadata file but necessary for data processing, analysis and application is explained in this part. For example, illustration of storing a single file or multiple files, data application scenario, or other information. The other information group is optional. If there is no additional information used to describe the spatial earth data, this part will not be embodied in the SEDF.

Experimental data
The experimental data consist of various spatial earth data that include the data in the surface environment, near-earth space, and near space. This study chooses optical remote sensing image data, surface reflectance data, atmospheric wind field data, and ionospheric total electron content data in different formats as typical representatives to conduct experiments. Table 2 presents the details.

Metadata description
To verify the unified description capability of the general metadata model for spatial earth data, this study utilizes the proposed metadata model to describe various specific spatial earth data displayed in Table 1. Figure 5 shows the structure and content of one of the spatial earth metadata files, and the other types of metadata files are shown in Figure S1, Appendix A.  Figure 5, the unified description of various spatial earth data can be realized with the proposed general metadata model, which proves the effectiveness of the model. From the metadata file, basic information, such as data name, category and spatial range, can be obtained.

Entity data organization
Spatial earth entity data are a numerical reflection of the objective world monitored by sensors, which has significant research value (Sudmanns et al., 2020). To verify the unified organization capability of the data organization model for spatial earth data, this study utilizes the data organization model to organize various spatial earth data in the SEDF. Figure 6 shows the storage structure and content of single and multiple types of spatial earth data using the Panoply Data Viewer (https://www.giss.nasa.gov/tools/panoply/) provided by NASA. In Figure 6, "exper.h5" is the data name after format conversion. Based on SEDF, data formats such as NetCDF, HDF, GeoTIFF and INOEX are converted and stored in groups according to data types. For instance, the "Atmosphere" group stores atmospheric wind field data; the "Ionosphere" group stores ionospheric total electron content data; the "Remote Sensing" group stores surface environment data, including remote sensing image data and product. In summary, single and multiple types of spatial earth data can be organized in a unified framework with SEDF based on the data organization model.

Validation
After the unified representation of spatial earth data, including metadata and entity data, the data quality should be evaluated. This study validates the data quality from two aspects: data integrity and data visualization.

Data integrity
This research compares all the values stored in the original format with those in SEDF. Taking three sample points of each data as an example; the result is shown in Table 3. Specifically, gray values of Landsat 8 remote sensing data, reflectance values of MODIS surface reflectance data, wind field values of meridional wind data and values of total electron content data are compared, respectively.
It can be observed from Table 3 that for interdisciplinary spatial earth data, none of the values have been changed during data conversion, and the data accuracy and integrity are ensured.

Data visualization
After checking the spatial earth data integrity, this study validates the data quality from the perspective of data visualization. To display spatial earth data vividly and intuitively, we develop a tool to visualize different types of spatial earth data based on WebGL (Evangelidis, Papadopoulos, Papatheodorou, Mastorokostas, & Hilas, 2018). The metadata elements are parsed first, and the visualization effect is shown in Figure 7.
As can be observed from Figure 7, there is no loss of data features during data visualization, and the efficiency of data rendering is high, which proves that the unified representation of spatial earth data can ensure data quality.

Conclusions and discussion
To overcome the deficiencies existing in the unified representation of interdisciplinary spatial earth data in earth environment research and enhance data management and collaborative analysis, this paper proposes a unified representation method that includes a general metadata model, entity data organization model, and data storage format. Through conducting the unified representation experiments and validation on interdisciplinary spatial earth data, the availability and practicability of the proposed method are proved. The following conclusions can be drawn: • By establishing the general metadata model, this study realizes the unified description of spatial earth data in XML, which proves the effectiveness of the model; • By building the data organization model, this study achieves the unified organization of spatial earth entity data using SEDF according to the grouping rules, which proves the availability of the model; • The validation, including data integrity checking and data visualization, shows that all the values remain unchanged during the data conversion and that SEDF can ensure data accuracy.
The proposed method in this study can realize the unified representation of interdisciplinary spatial earth data, such as geography, atmospheric science and space physics, which provides a new idea for data management and has practical value. However, the limitations of the proposed method are mainly reflected in the following two aspects. First, due to the complexity of the earth system, the experimental data in this study are still insufficient. Therefore, for future work, we plan to investigate more spatial earth data structures with various disciplines and enrich the data format to verify the applicability of the general metadata model and the entity data organization model. In addition, to further verify the availability, the proposed method will be applied in earth environment research to realize the collaborative analysis of spatial earth data, including typhoon and earthquake.