A global process-oriented sea surface temperature anomaly dataset retrieved from remote sensing products

ABSTRACT From the time that it first develops, a sea surface temperature anomaly (SSTA) will develop in space and time until it dissipates. Although many SST products are available, great challenges are still faced when attempting to directly explore the evolution of SSTAs. To address some of these problems, in this study, we developed a global SSTA dataset that included details of the spatial structure of SSTAs and their temporal evolution. This dataset is called GDPoSSTA. GDPoSSTA is comprised of three datasets and two relationship files and covers the period from January 1982 to December 2009. The three datasets are in SHP format and consist of a dataset of processed object-oriented SSTAs named DSPOSSTA, a dataset of sequenced object-oriented SSTA series named DSSOSSTA, and a dataset of variation object-oriented SSTA named DSVOSSTA. The two relationship files, which are in CSV format, store the evolving behavior of the SSTA sequence object and SSTA variation objects. Finally, geographic spatiotemporal statistics are derived for the DSPOSSTA and a comparison of applying TITAN to DSVOSSTA and DSPOSSTA is carried out which demonstrates the feasibility and applicability of GDPoSSTA. The GDPoSSTA dataset is available on ScienceDB platform (http://www.doi.org/10.11922/sciencedb.j00076.00090).


Introduction
The sea surface temperature (SST) is one of the most important marine climate variables (GCOS, 2011;Hollmann et al., 2013) and plays an essential role in climate change monitoring, weather forecasting, and marine fishery monitoring (Dai, 2016;Murtugudde et al., 2004). Advanced Earth-observing technologies make it possible to acquire lengthy time series of SSTs from multiple remote-sensing images (Yang et al., 2013), and many algorithms have been developed to produce SST products from satellite imagery in recent decades (Cao et al., 2021;Banzon, Reynolds, Stokes, & Xue, 2014;Legeckis & Zhu, 1997;Liao, Dong, Xue, Bi, & Wan, 2017;McClain, Pichel, & Walton, 1985;Merchant, Borgne, Borgne, Marsouin, & Roquet, 2008;Ping, Su, & Meng, 2015;Reynolds, Rayner, Smith, Stokes, & Wang, 2002;Walton, Pichel, Sapper, & May, 1998). A large number of widely used global SST datasets are produced both in China and abroad; some of these are listed in Table 1. Changes in SST anomalies (SSTAs) in space and time can be a driver of extreme regional climate events such as extreme rainfall (Guo et al., 2021;Yu, Fan, Zhang, Zheng, & Li, 2021). This spatiotemporal evolution of SSTAs may be able to provide information that is more important than information about the SST itself for studying global climate change (McPhaden, Zebiak, & Glantz, 2006;Saulquin et al., 2014;Wu et al., 2008). However, methods for studying the evolution of SSTAs in space and time are lacking.
Based on these satellite-derived SST datasets (Reynolds et al., 2002;Saha et al., 2018;Wentz et al., 2014), many studies have focused on new approaches to identifying the dynamic characteristics of SSTs. For example, Steinbach et al. (2006) proposed a clusterbased method for finding the time-averaged spatial distribution of SSTAs and tried to construct patterns showing the spatial relationship between SSTAs. Kawale et al. (2013) took a time as an additional dimension and designed an SRNN method for finding the dipole modes of SSTAs in ocean. Xue, Dong, and Qin (2015b) proposed a cluster-based method for identifying sensitively spatial regions and temporal duration of SSTAs. These studies found that the variation of SSTA has a spatial coverage. And their movement of SSTAs' spatial coverage was also used to derive more meaningful findings: e.g. Song, Dong, and Xue (2016) used variations in SSTAs to define a new ENSO (El Niño Southern Oscillation) index and identify ENSO events; Ding et al. (2019) analyzed the relative contributions of North and South Pacific SSTAs to ENSO events; Xue, Wu, Liu, and Su (2019a) analyzed the merging and splitting of SSTAs and found a close relationship between the evolution of SSTAs and the strength or weakness of the ENSO. Thus, knowing where, when and how SSTAs vary and evolve is important to understanding regional and global climate change. Unfortunately, until now, there have been no datasets concerned with the evolution of SSTAs.
Based on these considerations, we designed a process-oriented algorithm to develop a global dataset that describes the evolution of SSTAs based on commonly used satellitederived SST products. We named this dataset GDPoSSTA (Global Dataset of Process-Oriented SSTA). GDPoSSTA differs from existing widely used SST products in that it  (Saha et al., 2018) 4 km 1981.01-2020.12 Sub-daily 2 AMSR-E (Wentz, Meissner, Gentemann, & Brewer, 2014) 0.25°2002. 06-2011.10 Daily/Monthly/ Weekly 3 NOAA OI SST V2 (Reynolds et al., 2002) 1°1981.12-2020.12 Monthly/Weekly 4 NOAA OI SST V2 High Resolution Dataset (Reynolds et al., 2007)  includes information about the spatial structure of SSTAs and their temporal evolution: that is, GDPoSSTA is a dynamic dataset. GDPoSSTA consists of three datasets and two relationship files. The datasets consist of a dataset of processed object-oriented SSTAs named DSPOSSTA, a sequenced object-oriented SSTA series named DSSOSSTA, and a dataset of variation object-oriented SSTA named DSVOSSTA. One of the two relationship files describes the evolution of SOSSTAs (SSTA sequence objects), and the other file describes the evolution of VOSSTAs (SSTA variation objects). Figure 1 gives an example of the GDPoSSTA and shows the relationships between GDPoSSTA, POSSTAs and VOSSTAs. For simplicity, the diagram shown in Figure 1 does not include SOSSTAs. Thus, in the diagram, GDPoSSTA consists of just two datasets -DSPOSSTA and DSVOSSTA. DSPOSSTA includes five POSSTAs -POSSTA1 to POSSTA5. DSVOSSTA includes 16 VOSSTAs. Thirteen development behaviors and two splitting behaviors are also included. At each time snapshot, several independent VOSSTAs exist; e.g. there are four VOSSTAs at time T 1 and five VOSSTAs at time T 2 . VOSSTAs with the same POID (Process object identifier) belong to a POSSTA; e.g. POSSTA1, POSSTA2 and POSSTA4 all include four VOSSTAs, POSSTA3 includes five VOSSTAs, and POSSTA5 includes three VOSSTAs.
Compared with other commonly used SSTA products, e.g. Kaplan extended V2 SSTA (Kaplan et al., 1998), the GDPoSSTA dataset has the following advantages.
(1) Each VOSSTA has a clear spatial coverage and a thematic attribute at a given time snapshot.
(2) The evolution of each SOSSTA between successive VOSSTAs is similar. (3) Each POSSTA shows the changes in an SSTA with time as well as its spatial coverage and thematic characteristics. (4) Each POSSTA evolves from the time of its appearance through its development until it eventually merges, splits or dissipates. That is, from a POSSTA, it is possible to determine when and where SSTAs are generated and dissipate, and also how they develop, merge and split.

Methods
A SSTA can be defined as an abnormal increase or decrease in the SST over a specific spatial domain for a specific time (Xue, Song, Qin, Dong, & Wen, 2015a). SSTAs develop in space and time until they dissipate (Liu, Xue, Dong, Wu, & Xu, 2019;Xue et al., 2019a). Here, we define this evolution in terms of a process object named a POSSTA.

Input data
The original SST product used in this study consisted of monthly global AVHRR data for the period January 1982 to December 2009. These data have a spatial resolution of 4 km. The dataset was obtained from the AVHRR Pathfinder Version 5 and Version 5.1 SST Project (https://data.nodc.noaa.gov/pathfinder/); further details can be found in Kilpatrick, Podestá, and Evans (2001).

Data processing
The process of producing the POSSTA dataset (DSPOSSTA) from a time-series of satellitebased SST products consisted of four stages, and three intermediate datasets -a SSTA dataset, a VOSSTA dataset and a SOSSTA dataset -were produced during this process. POSSTA also included two relationship files -a SequenceRelationshipFile and an ObjectRelationshipFile -in which the evolution behaviors between VOSSTAs and SOSSTAs were stored. The overall workflow is shown in Figure 2. Within a DSPOSSTA, a POSSTA describes the evolution of an SSTA as it develops from its appearance through to eventual dissipation; this process includes one or more SOSSTAs. A SOSSTA consists of several VOSSTAs within successive time snapshots. The evolution of all VOSSTAs belonging to a particular SOSSTA are at similar stages of their evolution -e.g. developing, expanding or weakening. All VOSSTAs belonging to the same SOSSTA are tagged with the same sequence identifier (SID), and all SOSSTAs belonging to the same POSSTA are tagged with the same process object identifier (POID). SequenceRelationshipFile stores details of the type of evolving behavior between SOSSTAs, and ObjectRelationshipFile stores details of the type of evolving behavior between VOSSTAs.

Key steps
Four key steps are shown in Figure 2 (1) Removal of seasonal variations in grid-based SST time-series to generate monthly averaged SSTAs. SSTs vary with the season. These variations are mainly due to changes in solar radiance and need to be removed from climatological sequences before anomalous events can be identified (Xue et al., 2015a). The standard monthly averaged anomaly algorithm, denoted as the z-score (Zhang et al., 2005), is suitable for removing seasonal fluctuations. The z-score takes all the values for a given month from January to December from a long time-series of images and then calculates the mean and standard deviation for that set of monthly values. Each of the original values is then standardized by subtracting the mean and dividing by the standard deviation, as shown in Equation (1): where i is the year, j is the month from January to December, X j and δ j are the average value and standard deviation for the j th month, respectively, and X 0 ij and X ij are the raw and transformed values (of the monthly SST anomaly) obtained from the long-term series of images, respectively. The intermediate dataset produced in this way consists of anomalies in the monthly average SST, i.e. SSTAs.
(2) Extracting VOSSTAs from spatiotemporal SSTA data. In this step, first the spatiotemporal neighborhoods of SSTAs are reconstructed by taking into account the spatial and temporal continuity of SSTAs as well as their thematic attributes. Second, density-based clustering based on spatial connectivity and temporal evolution is used to define a spatiotemporal neighborhood search window. The spatial proximity, continuity in time and homogeneity in thematic attributes are taken into consideration to obtain the spatiotemporal clustering cores and spatiotemporal density of each grid cell. Finally, the grid cells belonging to the same cluster at each time are taken as being an object, defined as a VOSSTA. More details about the density-based spatiotemporal clustering method can be found in our earlier study (Liu et al., 2018).
(3) Matching and tracking successive VOSSTAs to form a SOSSTA. In this step, it is assumed that VOSSTAs that are spatially connected, are present at successive times, and have consistent thematic attributes belong to the same SSTA sequence -i.e. a SOSSTA. Thus, the spatial structures, times of appearance and thematic features of VOSSTAs are used as the input features to construct the cost function of VOSSTAs and to then match and track a SOSSTA. These spatial features include the eccentricity, rectangularity, soundness, and shape index of the VOSSTAs; the thematic feature used is the mean SST value of the VOSSTA. For details of the process used to construct the cost function for tracking successive VOSSTAs, please refer to our earlier study (Xue, Liu, Yang, & Wu, 2019b). The Hungarian method is used to solve the assignment problem (Kuhn, 2005).
(4) Linking and reconstructing SOSSTAs to form a POSSTA The SOSSTAs generated using the Hungarian assignment method (Kuhn, 2005), as described in step (3), exhibit no merging or splitting behavior. It is known that SOSSTAs that overlap in space and time are part of the same SSTA evolution process and belong to a POSSTA. Thus, at this step, the spatiotemporal topological relationships between SOSSTAs are used to reshape and construct a POSSTA. During this linking process, one SOSSTA may be linked with two or more SOSSTAs if they consist of the same VOSSTAs. Thus, a recursive loop that links all the related SOSSTAs is designed to reconstruct a new SOSSTA until the new sequence object exhibits no changes. The linking strategy is designed so that if two SOSSTAs from the same time overlap in space, the original SOSSTAs will be replaced by the union of the two SOSSTAs. This is the key to the linking in the recursive loop.
(5) Identifying different types of evolving behavior from a POSSTA Once a POSSTA has been generated, a process-oriented graph model with a pairwise node-edge is used to represent and store the POSSTA; i.e. the VOSSTA is stored as a node, and the evolution between two VOSSTAs is stored as an edge (Xue et al., 2019a). Using an out-degree of the preceding node of an edge, which is given the name PreNOD, and an in-degree of the next node, which is given the name NextNID, the edge relationship can be directly obtained. PreNOD is defined as the number of edges out from the node, and the NextNID is the number of edges coming into the node. The details used to identify the four types of evolving behavior are described below.
Development: This describes how one object moves from the previous to the current and then to the next snapshot; i.e. there is no interaction with other objects at three successive snapshots. In this case, both PreNOD and NextNID are equal to 1.
Merging: This describes the merging of two or more objects from the previous snapshot into one object at the current snapshot. In this case, PreNOD is equal to 1 and NextNID is not less than 2.
Splitting: This describes an interaction in which one object at the current snapshot splits into two or more objects at the next one. In this case, PreNOD is not less than 2 and NextNID is equal to 1.
Splitting-merging: This describes a situation in which one object splits into two or more objects and simultaneously, one of these objects merges with another one object to form a new object. In this case, both PreNOD and NextNID are not less than 2.

Data records
The dataset produced in this study was given the name GDPoSSTA. GDPoSSTA consists of three datasets in one SHP format (ArcGIS format) based on the WGS-84 coordinate system and two relationship files in CSV format (Excel format). The first dataset is DSPOSSTA. This dataset stores details of POSSTAs under file names of the form ProcessObjectDatasets.shp. Each record gives the extent of the spatial coverage, the duration time and the attributes of a process object. The second dataset is DSVOSSTA and stores SOSSTAs in files with names of the form SequenceObjectDatasets.shp. Each of these records contains details of the spatial extent, duration, attributes and type of a sequence object. The last of the three datasets is DSVOSSTA, which contains files with names in the form VariationObjectDatasets.shp. These files store details of VOSSTAs, and each record contains details of the spatial extent, time of occurrence and thematic attributes of these objects. The first type of relationship files have names of the form SequenceRelationship. csv and store details of the evolving behavior between two SOSSTAs, and the other relationship files have names of the form ObjectRelationship.csv and store details of the evolving behavior between two VOSSTAs. There are a total of 87 records in DSPOSSTA. The spatial distribution of these records is shown in Figure 3, and the fields included in the ProcessObjectDatasets.shp files are shown in Table 2.
There are a total of 492 records in DSSOSSTA. The fields in the SequenceObjectDatasets. shp files are shown in Table 3. The SequenceObjectDatasets.shp files are associated with the ProcessObjectDatasets.shp files through a POID field.
There are a total of 1232 records in DSVOSSTA. The fields included in the VariationObjectDatasets.shp files are shown in Table 4. The VariationObjectDatasets.shp files are associated with the SequenceObjectDatasets.shp files by a SQID and with the ProcessObjectDatasets.shp files by a POID.  A total of 326 evolving behaviors of SOSSTAs are stored in the SequenceRelationship.csv files. The fields that comprise these files are shown in Table 5. A total of 1066 evolving behaviors of VOSSTAs are stored in the ObjectRelationship.csv files. The fields that comprise these files are shown in Table 6.
All of the above datasets are available from the ScienceDB record associated with this publication. Further details of the available files are listed in Table 7.

Technical validation
Each GDPoSSTA consists of three kinds of objects: a POSSTA, a SOSSTA and a VOSSTA. A SOSSTA is an intermediate object that consists of series of VOSSTAs; thus, we can evaluate a VOSSTA and a POSSTA to test a GDPoSSTA.

Validation of VOSSTAs
The core idea behind the extraction of VOSSTAs from long-term grid-based SST products is the use of a density-based spatiotemporal clustering algorithm; i.e. DcSTCA. This algorithm was developed in our earlier study (Liu et al., 2018). In DcSTCA, the grid cells that belong to the same cluster at a given time are taken as forming one VOSSTA. As a VOSSTA relates to an abnormal increase or decrease in SST over a specific spatial domain, we used k-Standard Deviation, which is a spatiotemporal statistical algorithm, to identify objects and evaluate the VOSSTA datasets. The identified objects were given the name VOSSTA-Ks. The k-Standard Deviation algorithm first identifies the abnormal grid cells in a time-series: abnormal means that their values are greater than k times the standard deviation of the long-term SST values. The abnormal grid cells from a given time are then spatially connected to form an object; i.e. a VOSSTA-K. Figure 4(a) shows five VOSSTAs that were retrieved from a remote-sensing SST dataset from March 1983 using the proposed DcSTCA. The background consists of the monthly SSTA data. Figure 4(b) shows a comparison between the results obtained using the k-Standard Deviation algorithm and our proposed algorithm. The results show that our VOSSTAs are between VOSSTA-1s and VOSSTA-2s.

Validation of POSSTAs
The tracking of a POSSTA and its construction from successive VOSSTAs depends on the PoAIR algorithm that was introduced in our earlier study (Xue et al., 2019b). PoAIR has been evaluated using 10 simulated process-oriented datasets and the  (Dixon & Wiener, 1993). Table 8 shows a comparison of the tracking of process objects in terms of the probability of detection (POD). This table is modified from the results shown as Figure 9 and Table 3 in Xue et al. (2019b). Table 9 shows a comparison of the results for the identified evolving behaviors between process objects, also given in terms of the POD. The results shown in these two tables demonstrate that PoAIR performs better than TITAN in terms of POD for both basic and complicated process objects, and that PoAIR clearly outperforms TITAN when dealing with the splitting, merging, and merging-splitting behaviors of VOSSTAs. In Table 8, NTO (number of target objects) means the number of real abnormal variation objects belonging to the same process object and NDO (number of detected objects) means the number of abnormal variation objects detected by TITAN or the PoAIR algorithm. If a process object is tracked as two or more independent process objects, NDO will have two or more values; e.g. process object 1 was tracked as two process objects by both TITAN and the PoAIR algorithm. In this case, one process object includes six variation objects and the other process object includes four variation objects. In such a situation, the POD for the larger NDO is taken as the probability of detection; thus, the POD for both TITAN and the PoAIR algorithm is 60.00%. If a noise object is identified as an object belonging to a process object, the NDO is taken as the number of detected objects plus the number of noise Table 8. Comparison of the tracking of process objects by the proposed PoAIR algorithm and TITAN (Modified from Figure 9 and Table 3 in Xue et al. (2019b)  objects; e.g. for process object 2, the NDO for TITAN is 12 + 1. Here, 12 is the number of detected objects and 1 is the number of noise objects; both types of object are regarded as belonging to the same process object 2 by TITAN. The NTE (number of targeted evolving behaviors) means the number of real evolving behaviors linking all the process objects. NDE (number of detected evolving behaviors) means the number of evolving behaviors detected by TITAN or by the PoAIR algorithm.

Analysis of a specific POSSTA
A POSSTA describes the evolution of an SSTA from its origin through its development until it dissipates in space and time. The evolution of SSTAs is closely related to global climate change (McPhaden et al., 2006;Song et al., 2016;Xue et al., 2019a); e.g. ENSO and the Indian Ocean Dipole (IOD). Thus, we analyzed the dynamic evolution of a specific POSSTA and its relationship with the ENSO to indirectly test the applicability of the use of POSSTAs. Figure 5 depicts the evolution of the selected POSSTA in the Eastern Pacific Ocean from November 1982 to August 1983. Using the area covered by the VOSSTA at a particular time to represent the intensity of the POSSTA, Figure 6 shows that there is a close relationship between the evolution of the POSSTA and the ENSO event that lasted from May 1982 to September 1983. From Figures 5 and 6, when and where the SSTA originated and dissipated as well as when and where the SSTA became stronger (merging) and weaker (splitting) can be determined. Analyzing the evolution of SSTAs will give a better understanding of the evolution of the ENSO, and POSSTAs provide more information than that provided by static SSTAs.

User notes
In contrast to SSTA remote-sensing datasets, the GDPoSSTA dataset provides information about not only the spatial distribution of changes in SSTAs -i.e. SSTA snapshots -but also information about the evolution of SSTAs with time. The evolution of SSTAs in space and time plays a much more important role in global or regional climate change than their static characteristics in space. For example, the dissipation or origin of a POSSTA, a spatial distribution of evolution relationship, and its migration in space are all closely related to different types of ENSO: i.e. the Eastern Pacific ENSO, Central Pacific ENSO and MIX ENSO. The files in the three datasets that make up the GDPoSSTA are in SHP format. This is a commonly used format in GIS, which means that the files can easily be read by commercial GIS software such as ArcGIS and MapGIS, as well as open-source GIS software such as QGIS and GRASS. The two types of relationship file, which are in CSV format, can be read by Microsoft Excel.
The GDPoSSTA source code will be provided upon request for the purpose of replicating the reanalysis data described in this paper. The code may be requested from the corresponding author by email.