Developing big ocean system in support of Sustainable Development Goals: challenges and countermeasures

ABSTRACT The ocean is a critical part of the global ecosystem. The marine ecosystem balance is crucial for human survival and sustainable development. However, due to the impacts of global climate change and human activities, the ocean is rapidly changing, which poses an enormous threat to human health and the economy. “Conserve and sustainably use the oceans, seas and marine resources” is one of the 17 Sustainable Development Goals (SDGs). Therefore, it is urgent to construct a transformative marine scientific solution to promote sustainable development. Marine data is the basis of ocean cognition and governance. Marine science has ushered in the era of big data with continuous advances in modern marine data acquisition. While big data provides a large amount of data for SDG research, it simultaneously brings unprecedented challenges. This study introduces an overall framework of a system for solving the current problems faced by marine data serving SDGs from the perspective of marine data management and application. Also, it articulates how the system helps the SDGs through two application cases of managing fragmented marine data and developing global climate change data products.


Introduction
The 17 Sustainable Development Goals (SDGs), adopted by all United Nations Member States in 2015, involve an enormous, complex, diverse, dynamic, and interrelated research system. The effective measurement and monitoring of each goal is the most effective approach to achieving the SDGs. SDG research is interdisciplinary, involving resource management, the environment, biology, ecology, society, the economy, humanities, and many other domains. Moreover, the scope of research is typically global, and achieving the SDGs increasingly relies on systematic and reliable primary global data (Guo, 2017(Guo, , 2020Jia, 2020;Merritt, Bi, Davis, Windmill, & Xue, 2018;Wang & Yan, 2020).
Oceans cover 71% of the Earth's surface and absorb over 90% of global heat from solar radiation. The ocean global circulation system is a crucial regulator of the weather system and (QC) methods. A large proportion of abnormal observational data exist due to human error, instrument error, varying observation environments, and other reasons, resulting in uneven data quality. Many countries and agencies have released ocean observing databases, such as the WOD, Global Temperature and Salinity Profile Programme (GTSPP), Argo Float Database, and etc. (Argo, 2020;Boyer, Baranova, & Coleman et al., 2018;Good, Martin, & Rayner, 2013;Gouretski, 2018;UNESCO-IOC, 2010). These databases were released after QC using different techniques and have been widely used in physical oceanography, climate change, and other studies. The international consensus is that severe quality problems are present in these widely used international ocean databases. The International Quality Controlled Ocean Database (IQuOD) was used to perform a comprehensive evaluation of commonly used data QC methods worldwide by comparing the true-positive rate (TPR) and false-positive rate (FPR). The results demonstrate that the current mainstream QC techniques do not consider the optimal state of QC with a high TPR and low FPR (Simon, 2019). Thus, current data QC techniques fail to meet the high-quality data needs of modern marine science.
3) The current multi-source data analysis and application competencies are insufficient, and the application of artificial intelligence (AI) and big data in the marine field is still in its infancy. The research objects of modern marine science are broader and deeper in time and space. Developing multi-scale marine information mining techniques and efficient, quick, and intelligent applications are the bottleneck problems in marine science. AI is an essential driving force for a new round of scientific and technological revolution and industrial transformation. It is also widely used in various fields such as science and technology, economy, etc. However, marine AI and big data applications in marine sciences are in the initial stage of development. The National Oceanic and Atmospheric Administration (NOAA) of the United States released its AI strategy in 2021 (NOAA, 2021), aiming to leverage AI in demand-driven prioritized tasks. NOAA strives to reduce marine data processing costs through this AI strategy and provide high-quality and timely scientific products and services for society. Google, IBM, and other international companies have applied AI techniques to weather forecasting, precipitation forecasting, and etc. The integration of AI and marine big data analysis and mining technologies has become a new growth point for the practical application of marine big data (Li et al., 2020).
The Center for Ocean Mega-Science, the Chinese Academy of Sciences (CAS), constructed a marine big data management and application service system. This study first introduces the overall framework of the marine big data management and application service system. Then, we focus on the demonstrated applications of this system in data aggregation and management, marine data QC, and ocean disaster prevention to address several challenges in current marine data management and application.

Marine big data management and application service system
The proposed marine big data management and application service system encompasses three sections ( Figure 1): a data management platform, a data processing platform, and a data service platform. The system's primary function is to automatically perform multi-source data aggregation, integration, QC, and product development. Additionally, the system can provide high-quality marine observational data and data products for scientific researchers, government agencies, and social groups through a unified data service platform ( Figure 2). Further, the findings of different researchers, such as mathematical models and AI algorithms, can be transformed into a visualized decision support system for governments, enterprises, etc., offering services for marine health assessment, disaster prevention, and mitigation.

Data management platform
The data management platform is designed to perform real-time aggregation and standardization of multi-source data and construct the initial ocean observation database. SDG research is typically large-scale and multidisciplinary, requiring data support from various sources. It typically involves marine survey data, such as ship-based survey data, buoy data, ocean station data, and model reanalysis data. The proposed system formulates the automatic data format conversion method to standardize the multi- Figure 1. Overall framework of marine big data management and application service system. The green numbers (1, 2, and 3) represent the three platforms of the system. The contents of the orange numbers will be described in Sections 3.1 and 3.2. source and multi-type marine environment historical data. The data are simultaneously sorted and marked according to the data source, subject, data usage, and etc.
The primary data types are as follows: • Public data. This data type refers to data released by different data centers or agencies with fixed data sources and regular updates. The data has been standardized in a fixed format. The platform cooperates with data publishing agencies to automatically obtain the latest daily public data. The obtained data are reorganized according to the system format. At the same time, the data source is marked in the data service system. We require users to quote in accordance with the requirements of data owners. • Ocean dark data. Dark data is data that has not been sufficiently explored or understood. In marine scientific research, dark data refers to the scattered data generated by researchers that cannot be effectively archived or saved. The ocean dark data management platform was developed (Easy Data-Management (EasyDM), see Section 3.1 for details) to handle this data type. Data owners can independently integrate and manage the data and authorize the data-use scope.
These data management methods can effectively expand data source channels and standardize and unify data from different sources. They also shorten the data collection and compilation times. In order to ensure the data security, all the data collected by the data center are protected by the unit firewall and the data center firewall. On the other hand, when using EasyDM system for personal data aggregation, all operations are carried out in the internal network of the unit and included in the data resource pool after being checked by the data center. At the same time, we manage the data resources at different levels, including public sharing, protocol sharing, and classified data. The public sharing data allows users to directly access. For protocol sharing data, users can access the metadata, but they cannot directly access the data resources. These data need to be distributed to users by managers after authorization. The classified data is stored in the confidential computer alone and cannot be accessed on the Internet.

Data processing platform
The primary functions of the data processing platform are to control the quality of multisource data and build a high-quality ocean observation database. High-quality marine observational data is the foundation of SDG research. Due to the quality issues of commonly used global data, priority should be given to parameters with a high usage frequency and a wide range of applications. Therefore, we conducted research and developed a multi-parameter data QC technique. The parameters include marine hydrology, marine geology, and marine chemistry, and they are described in detail as follows.
• Marine hydrology (Temperature and Salinity) data exhibits regional variations and periodic seasonal characteristics. Therefore, we developed a new data QC module including different data characteristics: range, peak, equivalence, gradient, maximum depth, regional meteorology, and regional gradient meteorology checks. • Marine geology (Water Depth) data is predominately fixed and unchanged. Our proposed platform incorporates satellite altimetry data to determine the initial data fields. It uses the Expert-Machine combination method to perform QC and form a data QC module, including an extreme value check, eigenvalue distribution check, track map diagnosis, and adjacent voyage comparison check. • Marine Chemistry (Oxygen, Nitrate and Nitrite, Silicate, pH, Chlorophyll and Phosphate) data exhibits regional and seasonal variations, particularly in coastal areas, which are greatly affected by human activities and land-based materials. Therefore, the platform checks the data completeness, data scope, regional meteorology, geographical correlation, time scale correlation, and parameter correlation of this data type.
We constructed a high-quality ocean observing database after the QC and deviation correction of the initial ocean observing data (Zhang, Cheng, et al., 2020). The datasharing service is provided through the data service platform described in Section 2.3.

Data service platform
The data service platform provides convenient data acquisition and system platform services for different user groups. The platform is divided into two types of based functions: the data resource service platform and data visualization service platform, which are described in detail below.
• The data resource service platform offers external data management services and conducts the online operation processes (msdc.qdio.ac.cn) such as data collection, browsing, application, and audit. In addition, this platform provides a high-quality global ocean observing database (Section 2.2) with download service, map research and diversified data retrieval. Users can locate target data through multiple parameters such as classification label, attribute, space-time range, and survey platform, and they can download the results online. This platform also provides an online data Digital Object Identifier (DOI)/ China Science and Technology Resource (CSTR) registration service. All the data submitted to this platform will be registered with DOI/CSTR to identify the data source. When users use data, we ask them to identify the data source and reference the data. In this way, we hope that as far as possible to strengthen data intellectual property protection and protect the rights and interests of data owners. • Data visualization platform. Based on the Cesium data visualization system (https://cesium.com/platform/cesiumjs/), we present a component-based interactive visualization structure for multidimensional and spatiotemporal ocean data based on Cesium 3D Platform. It is free and has rich data visualization plug-ins. Based on this structure, users can customize the system to visualize other scientific data in a rapid manner. On the one hand, it can display all marine environmental monitoring data types in real-time. On the other hand, it can integrate warning models and forecast marine disasters such as offshore marine ecological disasters and dynamic disasters (http://casearthocean.qdio.ac. cn/oceanVisual). Thus, this platform can intuitively display the early warning information for marine disaster prevention and mitigation.

System architecture
The overall architecture of the marine big data management and application service system includes resource layer, platform layer and application layer (Figure 3).
• The resource layer is the foundation of the whole service architecture. Its main function is to provide computing, storage and data service resources for the platform layer. The resource layer is further divided into physical layer, data layer and virtual layer. The physical layer includes computing server, physical storage, high-speed network and computing model. The data layer integrates the existing data resources, stores data in the storage system based on the unified data storage model. Through virtualization technology, the virtual layer integrates different nodes and heterogeneous physical resources to form a large resource pool for the platform layer. • The platform layer is the middle layer connecting the resource layer and the application layer. The platform layer includes three parts: cloud service, cloud platform and resource registration and monitoring. Cloud services mainly provide data services, data loading services, data query services, computing services and model services. The platform contains a distributed data storage structure, which stores the data on each node in a distributed way. The data are readed and processed directly from each node during data access. In the master-slave database, the master is responsible for the load of the write operation, while the read operation is completed by the slave to ensure the security of the data. • The application layer is the top layer of the whole framework. The application layer provides an interactive interface for users through the user interface. At the same time, it provides a management interface for administrators through the management interface to manage the permissions of users and all resources in the management system. Users can query data resources to meet their own needs in various forms. The system also provides online model calling function to realize complex service functions such as online data processing, calculation simulation and analysis.

Applications of the data system
An integrated marine big data management and application service system is built to aggregate and share marine data effectively. This system also offers enormous marine data and technical support for SDG research, particularly to achieve the Ocean Decade goals. This section explains the applications of the system through two cases.

EasyDM serves Ocean Decade
"One accessible ocean", which refers to open and fair access to ocean-related data, information, technology, and innovation, is one of the seven goals of the Ocean Decade. Inequalities in marine science must be eliminated by simultaneously advancing data acquisition and QC. Additionally, data collection, knowledge generation, and technology development skills must be enhanced, particularly in the least developed countries, small island developing states, and landlocked developing countries.
Each developed coastal country has established a national data center to aggregate and integrate its marine data effectively. However, much data remains scattered throughout various research institutions. Researchers own some data. These data can be categorized as ocean dark data, which cannot be aggregated to a data center for unified management. Data owners often lack the workforce and technology to manage data effectively, leading to wasted data resources. This problem is more prominent in less developed countries.
Therefore, an intuitive ocean dark data management platform Easy Data-Management (EasyDM), was designed and developed to standardize data management and sharing. The platform system adopts C/S(Client/Server) architecture design and development. The primary server is in the data center; thus, the researchers do not need to purchase new equipment for data storage. The proposed system was jointly constructed and managed by the data owners and centers. In addition, the system was embedded with common marine survey data keywords and data record formats (e.g. longitude and latitude record format, measurement unit, record length, and etc.). Users can dynamically customize a data storage sheet by selecting the embedded keywords according to the topic data. After converting the original data to the selected data format, users upload the data independently by hitting a button. The system then automatically establishes the database for data storage.
The system contains built-in data and a user management system (Figure 4). The data and users are classified and set using different labels. Each user can customize his/her data usage (e.g. public sharing, protocol sharing, personal use, etc.) and scope. The public sharing data is incorporated into the unified management platform owned by the data center, and the data service platform offers external services. Regarding the data sharing protocol, the metadata information is released through the data service platform. When other users apply for data usage, the data center provides data sharing services with the data owner's consent. Concerning data for personal use, only the data owner can browse and use the data. This system solves the problem of data resource loss caused by a lack of data management technology. Additionally, the proposed system organizes massive historical data resources, which helps users manage data and significantly increases user initiatives for data sharing, thus expanding the data resource pool.
At present, the system has been applied to the marine chemistry data management of the Institute of Oceanology, CAS. After integrating the self-owned data with the international shared data, a marine chemistry database containing over 19 million data points was established to support global ocean acidification research ( Figure 5) (Zhong et al., 2020;Wang et al., 2021). Marine researchers in different countries can further optimize this system to promote the management and sharing of marine data and support the Ocean Decade.

Building marine grid data products to serve SDG 13 research
Research on ocean environmental change is a critical part of SDG 13: climate action. A gridded temperature/salinity database, such as WOA, EN4, and Ishii, with complete global ocean coverage is a critical foundation for studying climate change and sea-level rise. However, due to limited data quantity and quality, analysis results exhibit high levels of uncertainty. For example, the mainstream ocean temperature data products cannot identify a reliable long-term trend in the upper ocean between 0-700 m depth. The uncertainty of the deep-sea between 700-2000 m is even greater. This uncertainty is predominantly due to the systematic deviation of observational data, the lack of observational data from the last century, and data quality issues (Zhang, Wang, & Cheng et al., 2017).
Researchers from the Institute of Atmospheric Physics, CAS, and the Institute of Oceanography, CAS, constructed an automatic data product development system ( Figure 6) that provides monthly ocean temperature and salinity grid data products.
Researchers called the temperature and salinity observational data through the data service platform. After data QC and deviation correction, the researchers employed the ensemble optimal interpolation method with a dynamic ensemble (EnOI-DE;  to build the global ocean temperature (Figure 7) and salinity 1°-and 0.5°-resolution data products (Cheng et al., 2017). We automated these data products from data acquisition, processing, grid, and the final product release service, which successfully improves the data product development efficiency and provides faster and more accurate data support for global climate change research (Cheng et al., 2021).
• Data resources: In addition to the international shared data that is commonly used in marine scientific research, such as WOD, ARGO, and GTSPP, data resources include the special Chinese offshore survey voyage data to the Western Pacific Ocean and the Eastern Indian Ocean, Chinese offshore observational buoy data, and deep-sea buoy data from Western Pacific scientific observing network. These data were obtained from the CAS, Ministry of Natural Resources, Ministry of Education, and other institutions in China, and they were collected and compiled using the data platform. • Data preprocessing: Chinese original and independent observational data underwent integrity verification, preprocessing, metadata information association, duplication removal, format standardization, etc., before being integrated with the international shared data. Thus, a unified ocean temperature and salinity database was constructed with data from over 15 million stations. • Data quality control: We optimized the data quality control methods commonly used in the world, especially the local climate state check module, which is considered to be one of the most effective means to identify temperature and salt data errors. We divide the whole ocean into 1°×1° sub regions, and use the statistical distribution characteristics of historical observation data in each sub region to define the threshold range (Figure 8 and Figure 9). If an observation data exceeds the set threshold, it will be marked as suspicious data. For the statistical characteristic data of normal distribution, the threshold (upper and lower bounds of climate state) is determined by using the PauTa Criterion (extreme value test). For skewed distribution data, Quantile Method is used to determine the threshold.
We use the WOD data to test the accuracy of data quality control. The test data is download from NOAA/NCEI (https://www.ncei.noaa.gov/products/world-oceandatabase). Data location is from 73.5°E to 74.5°E, 37.5°N to 38.5°N, and time spans from 1940 to present. Figure 9a demonstrates all the profiles of the original temperature observation data. Figure 9b shows the correct data after NCEI quality control. Figure 9c shows our quality control results. Through comparative analysis, we can see that our quality control system has identified more abnormal data. • Data deviation correction: Regarding the ocean observational data deviation, the system adopted the expendable bathythermograph (XBT) and mechanical bathythermograph (MBT) data deviation correction schemes proposed by the Institute of Atmospheric Physics, CAS, to correct the historical data deviation. The XBT bias correction scheme is recommended by the international XBT research group . The scheme can correct XBT systematic bias and ensure the accuracy of historical ocean subsurface observing data. At present, the IQuOD only uses the XBT scheme. • Gridded data technology: A new spatial interpolation scheme was adopted that uses the EnOI-DE method and fifth phase of the Coupled Model Intercomparison Project multi-mode historical simulation to provide dynamic ensemble samples. The aim was to provide an improved initial field (as a prior estimate) and background error covariance (defining how information is transferred from the observed area to the non-observed area). Thus, we can identify the anisotropic correlations between different grid data  . Comparison of quality control results for ocean temperature data: a) the original temperature observation data; b) data after NCEI data quality control; c) data after data quality control.
points more accurately and reconstruct the temperature/salinity field in the nonobserved area. • Data products publishing and sharing: Data products are published and shared through the marine data resource service platform (http://msdc.qdio.ac.cn/). The ocean data visualization platform presents these data products for scientific advancement ( Figure 10).
The proposed system provides comprehensive process services from data acquisition to data product development and sharing. For example, the global ocean heat content changes grid data products have been used in over 90 countries and widely cited in the literature.

Conclusions
The ocean is the primary driving force and regulator of global climate and environmental changes and a critical part of the Earth system. Therefore, it can provide solutions to climate change mitigation and sustainable marine economic development (including the future of food and energy). However, due to human activities on land and the interactions between the ocean, atmosphere, and cryosphere, the oceans face many issues, such as frequent marine ecological disasters, severe marine pollution, and ecosystem degradation, which pose an enormous threat to human health and the economy.
Ocean data are critical for ocean cognition and governance. Therefore, marine data management and application service system processes have been established to effectively aggregate and standardize data from different sources and subjects in real-time. Additionally, high-quality and multidisciplinary data QC techniques have been developed to provide high-quality scientific data for other users and support SDG research effectively.
The proposed system exhibits three superior characteristics compared to those of the traditional data management system: • In addition to conventional data aggregation and management, an intuitive data management platform EasyDM has been developed for the users lacking data management experience. The proposed platform assumes that data centers and users jointly build and manage data resources, which improves user datasharing initiatives and fragmented data aggregation and management. Furthermore, it expands the data resource pool. • In addition to providing a data-sharing service similar to that of the traditional data management system, the proposed system exhibits sufficient expansibility. It can further integrate relevant data product development modules to automatically perform data acquisition, QC, deviation correction, and product production. Thus, the efficiency of data product development is improved, which can quicken the progress of scientific research. • To further serve the SDGs, the findings of different researchers, such as scientifically developed marine dynamical and ecological disaster models and early warnings methods, can be transformed into platforms. For example, such a platform, CASEarth-Ocean, may use the ocean data visualization platform to provide intuitive disaster early warning information for the government and the public.
The development of SDG research increases the demand for interdisciplinary data integration and deep data mining applications. There is an increasingly prominent contradiction between traditional data analysis and mining methods and the needs of modern marine science. Therefore, in the future, researchers can focus on the critical technologies of AI oceanography. They can integrate and develop relevant AI algorithms according to typical marine targets, marine data shape, and sample distribution characteristics. Researchers can also construct a corresponding AI deep learning network model to conduct deep information mining of marine big data and provide updated information technology support for SDG research.

Data availability statement
The data that support the findings of this study are available in Marine Science Data Center, Chinese Academy of Sciences at http://msdc.qdio.ac.cn. These data were derived from the following resources available in the public domain: 1. Partial pressure of carbon dioxide in Southern Ocean: http://english.casodc.com/data/meta data-special-detail?id=1409813943244832769; and 2. Global Ocean temperature grid data product: http://english.casodc.com/data/metadataspecial-detail?id=3.

Disclosure statement
No potential conflict of interest was reported by the author(s). From 2015 to 2017, he was a postdoctoral researcher at Institute of Oceanology, Chinese Academy of Sciences (CAS), Qingdao, China, where he was an engineer from 2017 to 2019, and has been a senior engineer since 2019. His current research interests include ocean data quality control, big data analysis and mining, and artificial intelligence applications.

Fuchao
Li is a Professor at Institute of Oceanology, Chinese Academy of Sciences. He is engaged in genomics-based research on marine microbial communities, functions and product diversity. In recent years, he has organized the construction of Oceanographic Data Center, which integrates marine data resources, and builds a comprehensive platform of marine scientific data for multi-scene applications such as marine disaster prevention and mitigation, marine economic development, and marine scientific research.
Dr. Li is an Associate Editor of the IEEE Transactions on Geoscience and Remote Sensing and the International Journal of Remote Sensing. He is an Editorial Board Member of the International Journal of Digital Earth, Big Earth Data, and the Journal of Oceanology and Limnology. He is also the Executive Editor-in-Chief of the Journal of Remote Sensing (a Science Partner Journal).