The first 10 m resolution thermokarst lake and pond dataset for the Lena Basin in the 2020 thawing season

ABSTRACT Climate warming rates in the Arctic are far greater than the global average, exerting stronger impacts on permafrost degradation and thermokarst landform development. Thermokarst lakes and ponds (TLPs), which are widely distributed in the Lena Basin in the Russian Arctic, play a vital role in altering local ecosystem. However, the detailed distribution of TLPs in the Lena Basin still remains poorly known. In this study, we built the first 10 m resolution TLP dataset for the Lena Basin in the 2020 thawing season by utilizing 4902 Sentinel-2 images. A robust mapping workflow was developed and implemented in the Google Earth Engine (GEE) platform. The accuracy assessment demonstrates a satisfactory accuracy (93.63%), and our results exhibit a better consistency with real TLPs than global water body products. A total of 380,477 TLPs (~0.53% of the total surface area of the Lena Basin) were identified, showing an uneven distribution in the five sub-basins. The TLPs were found to be mainly located within plain areas, with an active layer thickness in the range of 80–100 cm. The higher ground ice content and mean annual ground temperature were favorable for TLP development. This dataset will be valuable for investigating the complex interaction between TLPs and permafrost. It will also serve as a baseline product for better incorporating thermokarst processes into permafrost-climate models.


Introduction
The Arctic has experienced accelerated climate warming (nearly four times the global average) over the past four decades (IPCC, 2021;Rantanen et al., 2022).Permafrost, which underlies most of the Arctic terrestrial surface, has undergone extensive degradation due to climate warming (Biskaborn et al., 2019;Smith et al., 2022).The thermokarst processes, caused by the thawing of ice-rich permafrost, would cause collapse, erosion and surface subsidence, and further trigger the development of distinctive thermokarst landforms, including wetland, lake and hillslope landscapes (Grosse et al., 2013;Olefeldt et al., 2016).Therefore, the thermokarst processes profoundly alter the geomorphological configurations of the Arctic landscape (Farquharson et al., 2019;Kokelj & Jorgenson, 2013).Among various landforms, the thermokarst lakes and ponds (hereafter referred to as "TLPs") are one of the most significant thermokarst landforms in the Arctic and the Qinghai-Xizang Plateau (QXP) (Luo et al., 2015;Morgenstern et al., 2011;Niu et al., 2014).In areas with thick unconsolidated organic sediment and massive ground ice, surface subsides due to permafrost degradation and then meltwater fills in the topographic depressions, forming TLPs (Jones et al., 2011).The life cycle of TLP is mainly divided into four stages: lake initiation (ice wedge thaw, shallow depression formation), development (horizontal and lateral expansion, talik development), drainage and re-development of drained basins (permafrost development, ice aggradation) (Kokelj & Jorgenson, 2013;Morgenstern et al., 2011;Olefeldt et al., 2016).The TLPs are also classified into three categories depending on their surface area: depression (<10 m 2 ), ponds (10-1000 m 2 ), and lakes (>1000 m 2 ) (Zabelina et al., 2021).The surface area of TLPs showed significant variance, ranging from a few square meters to several square kilometers in the Arctic and the QXP (Mu et al., 2023;Muster et al., 2017;Wei et al., 2021).
TLPs play a vital role in the hydrological-thermal energy exchange between atmosphere and subsurface (Jones et al., 2022) and serve as a key indicator in understanding permafrost degradation (Xu et al., 2023).In addition, TLPs provide the necessary wildlife habitats (Bouchard et al., 2016), as well as water sources for industrial activities and human society (Bogdanova et al., 2023;White et al., 2007).However, it has been confirmed that TLPs have exacerbated permafrost degradation through thermal and mechanical processes in the circumpolar Arctic (Brouchkov et al., 2004;West & Plug, 2008), as well as in the QXP (Lin et al., 2016;Ling et al., 2012;Luo et al., 2022;Peng et al., 2021;Șerban et al., 2021).The developing TLPs triggered permafrost thaw and further surface subsidence, which can cause irreversible damage to the stability of infrastructures (Instanes et al., 2016;Xu et al., 2023).As an important sequestration pool for organic carbon, TLPs played an unignorable role in forming positive climate feedback system through emitting greenhouse gases (GHGs) into the atmosphere (Abnizova et al., 2012;Mu et al., 2023;Serikova et al., 2019).Under the scenario of rapid permafrost degradation and enhanced anthropogenic activities in the cold regions, it is of great importance to obtain the spatiotemporal variability of TLPs and further investigate the driving factors (Desyatkin et al., 2007;Tian et al., 2017).
The Lena Basin, as one of the five largest river basins in the Arctic (Grosse et al., 2013), is extensively underlain by complex Yedoma ice deposits in Russia (Olefeldt et al., 2016;Strauss et al., 2013).An obvious year-on-year increment in soil temperature and moisture has been witnessed in the Lena Basin, resulting in a rapid increase in permafrost thaw (Dzhamalov et al., 2012;Gautier et al., 2021;Iijima et al., 2010).The recent warming and wetting trends of climate, together with the impact from changing permafrost, have altered the dynamics of TLPs in the Arctic (Bouchard et al., 2016;Morgenstern et al., 2011;Muster et al., 2017).However, the distribution of TLPs for the entire Lena Basin still remains relatively unknown.In order to better identify hot spot areas to precisely monitor TLP dynamics, there is an urgent need to map and obtain quantitative information about the extensively distributed TLPs in the Lena Basin.
The rapid development of Earth observation technology has facilitated the monitoring of TLPs in remote data-sparse regions.Owing to the large discrepancy in surface reflectance between water bodies and land, optical images are considered the main data source for monitoring TLPs.Several early attempts have been made to provide global or regional water body thematic products, including the Global Surface Water (GSW) product (Pekel et al., 2016), the HydroLAKE product (Messager et al., 2016), and the circum-Arctic Permafrost Region Pond and Lake (PeRL) database (Muster et al., 2017).The 30 m spatial resolution GSW and HydroLAKE datasets included all global water bodies which brought unexpected discrepancies in distinguishing TLPs from other water bodies, and the mapping accuracy in polar regions was not as satisfactory as those in mid-latitude regions.The PeRL dataset provided the detailed distribution of TLPs on local scale in the circum-Arctic, yet there was only one study site located in the Lena Basin (Muster et al., 2017).The relatively coarse resolution and incomplete spatial coverage of these products cannot meet the demand for obtaining high-resolution TLP distribution information at a basin scale, and introduced biases in quantitatively evaluating the environmental impacts of TLPs (Grosse et al., 2008).
Concerning the various methods to identify TLPs, the traditional manual interpretation of TLPs based on aerial and satellite images is highly time-consuming when mapping TLPs over large areas.Several spectral indices are capable of mapping water bodies with a higher efficiency, such as the normalized difference water index (NDWI) (Gao, 1996) and the modified normalized difference water index (MNDWI) (Xu, 2005).However, the accuracy of these indices is greatly hampered when mapping shallow lakes in areas of complex topography (Feyisa et al., 2014;Kaplan & Avdan, 2017).The random forest classifier, as a machine learning method in the form of an ensemble of decision trees (Pal, 2005), has been widely applied in land-cover classification due to its robustness and high accuracy (Belgiu & Drăguţ, 2016).Recently, the random forest classifier, when combined with a manual interpretation method, has exhibited remarkable precision in identifying water bodies in cold regions (Hu et al., 2021;Wei et al., 2021).The Google Earth Engine (GEE) platform, which is a cloud-based platform supported by Google's massive remotely sensed data and computational capabilities, made long time-series and largescale geospatial analysis possible (Gorelick et al., 2017).Due to their high resolution and short revisit cycle, the twin Sentinel-2 satellites from the European Space Agency (ESA) provide us with an opportunity to monitor TLPs, especially small TLPs based on their spectral information (Drusch et al., 2012;Șerban et al., 2020).The GEE platform has collected considerable Sentinel-2 images since March, 2017.
In this study, we aimed to build the first 10 m resolution TLP dataset for the Lena Basin based on 4902 Sentinel-2 images obtained in the 2020 thawing season.First, we developed a robust mapping workflow specific to TLPs based on its characteristics in the GEE platform.Then, the accuracy of the mapping results was comprehensively evaluated based on comparisons with widely used land-cover products as well as a validation data set.Finally, we analyzed the spatial distribution of TLPs and investigated the role of climatic and environmental factors in TLP development in the Lena Basin.

Study area
The Lena River is the longest river entirely located within Russia (~4294 km) and the easternmost of the three large rivers (Ob, Yenisei, and Lena) in Siberia which enter the Arctic Ocean.The surface area of this basin is nearly 2.46 million km 2 .This magnificent river transports 485-530 km 3 of freshwater to the Laptev Sea every year (McClelland et al., 2004;Paffrath et al., 2021).As shown in Figure 1(b), the Lena River flows northeast from the Baikal Mountains and then it is joined by the Vitim and Olekma rivers.In the Central Yakutian Lowland, it is joined by the Aldan and Vilyuy rivers.The Lena River then flows to north until it expands into a large delta, one of the largest pristine river deltas in the world, finally it flows into the Laptev Sea.Various types of permafrost occupy about 90% of the land area in the Lena Basin, with a majority being continuous permafrost (McClelland et al., 2004;Obu et al., 2019).The tree-cover is the dominating land cover type in the Lena Basin, followed by the grassland (Figure 1(c)) (Zanaga et al., 2021).The Lena Basin has witnessed an increase of mean annual air temperature (0.74°C/decade) during 2003-2019 (Liu & Wang, 2022).The active layer thickness (ALT) was estimated with a mean value of 1.69 m during 1961-1990(Zhang et al., 2005)).Based on the in-situ monitoring data from the Circumpolar Active Layer Monitoring (CALM) program (Brown et al., 2000), the mean annual ALT during 2008-2022 varied from 30.86 cm in the northern part to 202.53 cm in the southern part.(Lena, Vilyuy, Aldan, Olekma, and Vitim).The permafrost type is derived from the 1 km 2 permafrost zonation data (Obu et al., 2019).The base map exhibits topographic relief based on the ALOS 30 m DEM data (Tadono et al., 2016).(c) the land cover is derived from the ESA 10 m WorldCover dataset (Zanaga et al., 2021).

Sentinel-2 satellite imagery
In 2015 and 2017, the European Space Agency (ESA) launched the twin Sentinel-2 satellites (S2A and S2B, respectively) of the second Copernicus Mission (Drusch et al., 2012).The Sentinel-2 constellation provides dense time-series observations at a five-day revisit cycle with a swath width of 290 km.The optical payload, namely the Multi-Spectral Instrument (MSI), covers 13 spectral bands at different spatial resolutions of 10-60 m, and the four 10-m resolution bands are within the visible (Vis) and near-infrared (NIR) domain.A total of 4902 Level-2A orthorectified atmospherically corrected images obtained in the 2020 thawing season (mid-June to mid-October) were obtained through the Google Earth Engine (GEE) platform.The spatial availability of the Sentinel-2 images across the study area is exhibited in Figure 2, where it can be seen that the available Sentinel-2 images are concentrated in high-latitude regions.

Permafrost, climate, and auxiliary data
As shown in Table 1, we collected various data which were related to permafrost properties, namely permafrost type, mean annual ground temperature, active layer thickness and volumetric ground ice content, to investigate the spatial distribution relationship between these key characteristics and TLPs.The 1 km 2 spatial resolution permafrost zonation data, which were mainly developed based on remotely sensed land surface temperature and reanalysis data, were obtained from Obu et al. (2019).In addition, we collected 1 km spatial resolution mean annual ground temperature (MAGT) data and active layer thickness (ALT) data during 2000-2016 in the Northern Hemisphere, which were generated by a statistical learning model based on in-situ borehole observations and satellite data (Ran et al., 2022).We also downloaded the 1 km spatial resolution volumetric ground ice content (GIC) data, which indicated the pore and segregated ice contents in the top 5 m of permafrost based on a statistical model (Karjalainen et al., 2022).
Due to the uneven distribution of monitoring sites, limited in-situ observations of climate factors (temperature, precipitation, evaporation, etc.) failed to meet the demand to consistently provide available data with sufficient spatial coverage of the Lena Basin.The fifth generation atmospheric reanalysis from the European Centre for Medium-Range Weather Forecasts (ECMWF), namely ERA5-Land (hereafter referred as "ERA5L"), were then collected.It provides hourly climatic variables at a spatial resolution of ~9 km, from 1979 to the present (Muñoz-Sabater et al., 2021).Since the TLPs are initiated by the thermalhydrological energy exchange between atmosphere and permafrost.We collected eight critical atmospheric variables (monthly averaged) to investigate their impact on the spatial distribution of TLPs (Table 1).Two land-cover/water body products were obtained to construct a training sample dataset for the TLP classification.The first was the ESA WorldCover 10 m 2020 product, which is a new global baseline product which consists of 10 types of land cover, with an overall accuracy of 75% at a 10 m spatial resolution based on Sentinel-1/-2 images obtained in 2020 (Zanaga et al., 2021).The second was the 30 m spatial resolution Global Surface Water (GSW) product from the European Commission's Joint Research Centre (JRC), which provides detailed spatio-temporal variations of global surface water bodies since 1984 based on over three million Landsat images (Pekel et al., 2016).The GSW yearly data were collected through the GEE platform.The ALOS World 3D (AW3D) 30 m digital elevation model (DEM) data (Tadono et al., 2016) were employed to obtain the ground elevation and slope angles.The 30 m spatial resolution Global River Widths from Landsat (GRWL) Database (Allen & Pavelsky, 2018) was applied to mask the non-TLP water bodies which were directly connected to rivers.To comprehensively evaluate the accuracy of the mapped TLPs, two global/regional 30 m spatial resolution water body products were employed, namely the GSW data (Pekel et al., 2016), the HydroLAKE data (Messager et al., 2016).

Definition of TLPs
Given that permafrost is a subsurface cryosphere component, it is difficult to confirm whether a lake is initiated by thermokarst processes solely based on Earth observation data (Mu et al., 2020).In general, the surface area is the most reliable parameter when identifying TLPs based on remotely sensed imagery (Muster et al., 2017).As shown in Table 2, published literature illustrates that the surface area of identified TLPs ranged from tens of square meters to several square kilometers (Jones et al., 2011;Mu et al., 2023;Muster et al., 2017;Șerban et al., 2020;Wei et al., 2021).
The reliable water body identification normally required at least five or six consecutive pixels (Polishchuk et al., 2018), so the minimum surface area is set as 500 m 2 based on the Sentinel-2 imagery (100 m 2 for a single pixel).As summarized in Table 2, the maximum surface area of identified TLPs varied from 0.5 to 3 km 2 in the Arctic and the QXP.Therefore, the surface area of TLPs is determined from 0.0005 to 3 km 2 in the Lena Basin.Another key characteristic of TLPs is that they are mostly isolated and not directly connected to hydrological networks (Manasypov et al., 2020).

Mapping workflow of TLPs
In this study, a mapping workflow specific for TLPs was developed, which consisted of four main parts: 1) data filtering and pre-processing; 2) training sample data set construction; 3) TLP identification; and 4) post-processing (Figure 3).

Data filtering and pre-processing
The Sentinel-2 images acquired in the thawing season (15 June to 15 October) with a cloud cover of less than 5% were selected in the GEE platform.We applied the QA60 band, which is a bitmask band with opaque and cirrus cloud information, to mask the cloud pixels.To increase the processing efficiency, the Lena Basin was divided into 341  grid cells (size of 100 × 100 km, except for those grid cells in the fringe area).We then calculated the median value of each pixel to synthesize a single composite image for each grid cell.All 10 bands (B1-B8, B11, and B12) of the composite images were resampled to a spatial resolution of 10 m.

Construction of the training sample dataset
Prior to the TLP classification, we established a training sample dataset, containing two categories: "TLP" for water bodies and "noTLP" for other land-cover types.For the visual interpretation, we carefully selected 100-150 sample points in each grid cell, depending on its surface area and water body density.The selected points were evenly distributed in each grid cell and were manually checked on Google Earth.The traditional manual selection of sample points is believed to be the most accurate method and widely employed in researches related to water bodies extraction (Hu et al., 2021;Liu et al., 2018); however, it is time-consuming and highly dependent on the quality of satellite images.Besides the manually selected sample points, we also automatically selected training sample points.Based on the GSW dataset, for each grid cell, we randomly selected 100 points which were categorized as permanent water.As for the ESA WorldCover product, in order to improve the sampling representativeness and reduce the error, the stratified random sampling method was employed.The count of the selected sample points for the different land-cover types within each grid cell is provided in Table 3. Due to the inherent classification errors in the two afore-mentioned products (GSW, ESA WorldCover), the automatically generated training sample dataset likely included misclassified points.Therefore, we modified this dataset by employing the self-organizing map (SOM) method (Kohonen, 1990).As an artificial neural networkbased algorithm, the SOM method is mainly trained by unsupervised competitive learning (Kohonen, 1990), and is well known for its clustering capacities (Vesanto & Alhoniemi, 2000).A SOM network normally consists of two layers, namely the input layer and the competitive layer, with each neuron in the competitive layer corresponding to all the neurons in the input layer by various weights (0-1) (Hu & Weng, 2009).In this study, a network of 100 neurons was built based on the batch weight/ bias rules algorithm (Avci & Abdeljaber, 2016), and this network was then iterated 200 times for the preliminary training.The input dataset of the water class consisted of 2N TLP points and N randomly selected noTLP points, and vice versa, for the input dataset of the no-water class.A purity threshold of 80% was set to filter out those clusters not exhibiting significant differences between water and no-water sample points.Finally, by combining the manually and automatically selected points, we established a training sample point dataset of 153,212 points, including 65,267 TLP points and 87,945 noTLP points in the Lena Basin.The spectral, topographic, and textural features for each sample point were also obtained.

Identification of TLPs
The random forest classifier was employed to map TLPs in GEE.Random forest is an ensemble classifier that uses classification and regression trees (CARTs), where the number of CARTs is of great significance in determining the classification accuracy (Breiman, 2001).Due to its robustness and satisfactory accuracy, this classifier has been widely used in land-cover classification (Belgiu & Drăguţ, 2016;Pal, 2005) as well as TLP mapping (Janiec et al., 2023;Șerban et al., 2020;Wei et al., 2021).A total of 21 features were selected, including 11 spectral features, seven textural features, and three topographic features.These features were calculated for each pixel and were utilized as the input for TLP classification in the random forest classifier.
Based on the principal component analysis (PCA) algorithm (Wold et al., 1987), the first three components (PC1, PC2, and PC3) of the Sentinel-2 spectral bands within the visible domain were obtained.In addition, three tasseled cap transformation (TCT) indices (Greenness, Wetness, and Brightness, hereafter referred as "TCG", "TCW" and "TCB", respectively) (Crist & Cicone, 1984) as well as the enhanced vegetation index (EVI) (Jiang et al., 2008), the normalized difference vegetation index (NDVI) (Rouse et al., 1974), the normalized difference built-up index (NDBI) (Zha et al., 2003), the modified normalized difference water index (MNDWI) (Xu, 2006), and the bare soil index (BSI) (Diek et al., 2017) for each pixel were obtained.The calculation formulas for the five spectral indices are as follows: The seven textural features were derived based on the gray level co-occurrence matrix (GLCM) algorithm (Haralick et al., 1973).These features were the angular second moment (asm), contrast (con), correlation (corr), variance (var), inverse difference moment (idm), sum average (savg), and entropy (ent).The window size of the GLCM is a vital parameter which determines the quality of the textural features.Based on several examinations and recommendations from previous studies (Chrysafis et al., 2019;Mohammadpour et al., 2022), a window size of 5 × 5 was chosen.The three topographic features were elevation, slope, and aspect from the ALOS 30 m DEM data.
For each grid cell, 70% of the points of the sample dataset were randomly employed to train the random forest classifier, and the remaining 30% of the points were set as the test dataset.We examined different CART values, and found that the performance of the random forest classifier became stable when the value reached 30.

Post-processing
During the post-processing procedures, a surface area mask (0.0005 to 3 km 2 ) was employed to remove the noTLP water bodies, namely the water bodies with their surface area in the range of 0.0005-3 km 2 were classified as potential TLPs.Based on morphological operation (opening) processing, the small isolated water pixels were eliminated, and the boundaries of the larger potential TLPs were smoothed without significantly changing their shape (Haralick et al., 1987;Said et al., 2016).
A study conducted along the Qinghai-Xizang Highway (QXH) and Railway (QXR) investigated the distribution of TLPs in mountain, plateau and basin regions, and results indicated that the number (6.0%) and total surface area (5.9%) of TLPs in mountain are significantly lower than those in the basin and plateau regions (Niu et al., 2011).In addition, it has been suggested that there is little probability of TLPs being located in areas with a slope steeper than 15° in the QXP (Wang et al., 2017).Considering the relatively flat topography in the study area, the slope threshold was set to 20° to mask mountainous area, and the grid cells with ground slopes of over 90% of the area exceeding 20° were removed.Moreover, a GRWL river mask was applied to remove small rivers, as well as the connected water bodies.We examined several values of buffer radius and finally set this to 50 m.Due to the coarse resolution of the GRWL data (30 m), we applied a shape mask for the filtered water bodies based on the landscape shape index (LSI).The calculation of the LSI is as follows: where C is the perimeter and S is the surface area of a single TLP.This index examines the divergence of landscape features from a circle, where a larger LSI value indicates a linear shape (McGarigal et al., 2009).In this study, a water body was marked as a river feature when its LSI exceeded 5. Finally, the remaining water bodies after post-processing were identified as TLPs.

Accuracy assessment
To evaluate the performance of the random forest classifier, the confusion matrix was employed.By comparing the percentage of classified sample points of each class with the ground truth data, the confusion matrix calculated four indices which were widely applied to illustrate the classification accuracy (Congalton, 1991).The four indices are the overall accuracy (OA), user accuracy (UA), producer accuracy (PA) and kappa coefficient (k), respectively.Due to the uneven distribution of TLPs, direct selection of validation points severely diminished the probability of the points being located within TLPs.Therefore, we set up a buffer based on a method proposed by Xu et al. (2022), which considered the perimeter and surface area of features.It was proved to be effective in randomly generating validation sample points for the impervious area classification in the Arctic.The radius of buffer area was calculated as follows, where C is the perimeter and S is the surface area of the TLP: Within the buffer area, we randomly selected 8000 validation points, and conducted the manual interpretation for each validation point.
A visual assessment was also conducted.Due to the lack of available in-situ observations, the mapped TLPs and the corresponding water bodies from the three land-cover products were overlaid with Sentinel-2 false-color images to examine the spatial consistency of our results, compared with these land-cover products.

Processing of the ERA5 land reanalysis data
In-situ monitoring sites cannot provide sufficient and continuous observations in the wide Lena Basin.Thus, the gridded ERA5L reanalysis monthly averaged data was employed, for which there was a total of 88,101 points in the Lena Basin.In order to ensure the reliability of the analysis, a buffer of 2 km to the centroid point of each TLP was constructed for selecting the nearest ERA5L gridded points, and 9024 matching pairs of TLPs and corresponding ERA5L points were finally filtered out.

Evaluation of the mapping performance
In this study, the OA is 93.63% and the kappa coefficient reaches 0.87.The PA and UA are 94.43% and 93.30%, respectively.To encapsulate, the UA is slightly lower than the PA, indicating that the classification errors can be attributed more to misclassifying no-water pixels as water pixels.We also compared our results with three land-cover products (Figure 4).When compared with the HydroLAKE and GSW dataset, our results effectively reduce the false identification of TLPs and show a better performance in mapping the boundaries of TLPs.The HydroLAKE and the GSW datasets included all kinds of global water bodies, however, large water bodies cannot be classified as TLPs based on the definition of TLPs, even these water bodies located in the permafrost zone.Our workflow fully considered the key characteristics of TLPs, which effectively excluded large water bodies (i.e.upper-left corner in Figure 4(f, j, n, r) and lower-right corner in the Figure 4(g, k, o, s)).There is a more reasonable consistency between our results and the ESA WorldCover product.However, as illustrated in Figures 4(g,  s), our results provide more reliable information about TLP boundaries.Moreover, there is less misclassification for our results in areas with a dense distribution of small TLPs and small rivers (Figure 4(h, t)).These comparisons confirm the satisfactory mapping accuracy of the developed workflow, especially in the aspects of identifying small TLPs.

Spatial distribution of TLPs
A total of 380,477 TLPs were mapped, for which the surface area is ~1.23 × 10 4 km 2 , accounting for ~0.53% of the total surface area of the Lena Basin.The mean surface area of the TLPs is 32,171 m 2 , with a median of 3,765 m 2 .Concerning the permafrost type, the distribution of the TLPs exhibits strong differentiation, namely the majority of the TLPs were distributed in the continuous permafrost zone.Figures 5(b-h) show several examples of the mapping results obtained under different ground elevations.
As shown in Figures 6(a-b), there is clear longitudinal and latitudinal zonality for the distribution of the TLPs.The majority of the count as well as the surface area of TLPs are within the longitude interval of 118°E-128°E and the latitude interval of 61°N-67°N.
Considering the spatial distribution of the TLPs in the five sub-basins, the surface area and count of the TLPs in the Lena and Vilyuy sub-basins are much higher than those in the other three sub-basins combined (Table 4).In addition, the mean surface area of the TLPs in the Lena and Vilyuy sub-basins are 35,211 and 37,760 m 2 , which are far beyond the The mapped TLPs were then classified into three types (small, medium, and large), depending on their surface area: 0.0005 to 0.01, 0.01 to 0.1, and 0.1 to 3 km 2 , respectively.There is a uniform pattern of TLP distribution in the five subbasins, namely, small TLPs predominate (Figure 7).This is consistent with previous research by Șerban et al. (2020) in the Headwater Area of Yellow River on the northeastern QXP, which concluded that small-size TLPs took up large part (93% in count).The maximum proportion of small TLPs occurs in the Olekma sub-basin, reaching 78.89%.Although small TLPs make up a lower proportion (64.60%) in the Lena sub-basin, there are still over 100,000 small TLPs.Concerning the TLPs surface area, the large TLPs, which make up a minority of the quantity, generally account for the largest part of the total surface area in the five sub-basins (Figure 7).

Impacting factors for the distribution of TLPs
The development of TLPs is influenced by various climatic factors, especially the hydrological-thermal conditions of permafrost.In this study, a point-based impact analysis of the various factors based on the centroid points of each TLP was conducted.
It was found that 84.76% and 74.69% of the count and surface area of TLPs are in areas with the elevation ranging from 0 to 400 m (Figure 8(a)).Only 3.49% of the TLPs are distributed in areas with elevation over 1000 m.The average and median elevation of the TLPs is 257 and 185 m, respectively.In the aspect of slope, we found a similar spatial pattern (Figure 8(b)).A large number of TLPs (63.66% in count and 81.02% in surface area) are distributed in plain with a local slope of less than 5°.The average and median slope of the TLPs is 6° and 4°, respectively.
As depicted in Table 5, 89.53% of the TLPs (with their surface area accounting for 96.66% of the total area) are distributed in the continuous permafrost zone.The proportion of TLPs, regardless of count and surface area, is less than 5% within the other three types of permafrost.
As shown in Figure 9(a), in the central plain, the active layer is normally shallower than that in the mountainous areas, even in the same latitude.We found that the ALT for over half of the count (58.83%) and surface area (59.62%) of the TLPs is in the range of 80-100 cm (Figure 9d).However, the distribution of TLPs is rather sparse over the deeper active layer (>120 cm).As for the MAGT, it exhibits a stronger latitudinal zonality than ALT, varying from − 9.4°C in the northern region to 2°C in the southernmost part (Figure 9(b)).Over 55% of the total count and 61% of the total surface area of TLPs are in regions with the MAGT in the range of − 4 to − 2°C (Figure 9(d)).Due to its significant latent heat effects, the ground ice content (GIC) alters the permafrost thermal regime and triggers ground subsidence (Noetzli & Gruber, 2009;Smith et al., 2022).Ice-rich permafrost is widely distributed in the Lena Basin, with the GIC exceeding 40% (Figure 9(c)).Among the sub-basins, the GIC is much higher in the Lena and Vilyuy sub-basins, which occupy the largest proportion in the count of TLPs.In general, the higher the GIC within the permafrost layer, the more advantageous it is for the initiation and development of TLPs.We further analyzed several atmospheric and soil variables (Table 1) by processing the ERA5L gridded data to investigate their impacts on the spatial distribution of TLPs.The average 2 m air temperature (T2M) and skin temperature (SKT) in the 2020 thawing season for 61.63% and 62.42% of the TLPs are within 10-12°C, followed by 8-10°C (Figures 10(a-b)).Only a small part of the Lena Basin had a temperature exceeding 12°C during thawing seasons.Therefore, it illustrated that higher air and surface temperature were more favorable for the development of TLPs. Figure 10(c) exhibits a pattern of net evaporation for the TLPs, indicating that the water level of the TLPs is not elevated due to the weak summer rainfall.This suggests that the thawing of ground ice in permafrost, rather than summer precipitation, was still the main source for TLP development and expansion.Figure 10(d) displays the distribution of the TLPs under different soil water volume (SWV) intervals within four soil layers (0-7, 7-28, 28-100, and 100-289 cm).The TLPs with SWV larger than 0.6 m 3 /m 3 are concentrated in the fourth layer, and TLPs with SWV lower than 0.3 m 3 /m 3 are mainly distributed in the first layer.The highest density of TLPs is found where the SWV reaches 0.3-0.4m 3 /m 3 .

Importance of the evaluation on classification features
We employed multi-source features which reflected the different properties of TLPs and effectively provided more available information for the classification.The input features influenced the classification accuracy to different degrees in the Lena Basin.Therefore, it was necessary to evaluate the importance of each feature.
The importance of various features employed in the random forest classifier was calculated in GEE based on the Gini index (Zhi et al., 2022).We found that the TCW plays the most significant role in the random forest classifier (Figure 11).
Comparing the other land-cover types, the water bodies exhibit unique characteristics in the wetness index, due to the higher moisture.This TCT component effectively eliminates other landscapes which are easily misclassified as water bodies, such as shadows, dense vegetation, and burned bare land.The EVI, MNDWI, and NDVI indices are also major contributors, which is consistent with previous studies related to water body mapping (Acharya et al., 2018;Yang & Chen, 2017).In contrast, the BSI, NDBI, and GLCM features were found to be less important in TLP classification.
During the post-processing procedure, we employed several masks (river, slope) and the LSI index to eliminate water bodies that were misclassified as TLPs.Finally, based on the validation dataset and the comparison with the three land-cover products, we concluded that our results exhibit reasonable accuracy (93.63%).Thus, the proposed mapping workflow for TLPs could be implemented over large spatial scales, such as the whole of the Arctic.

Interaction between TLP dynamics and permafrost degradation
Under the scenario of climate warming, TLPs can in turn have a significant impact on permafrost degradation through the strong thermal disturbance (Li et al., 2014).The TLPs altered local hydrological connectivity based on lateral and basal heat erosion, therefore triggered further permafrost degradation in the Arctic and the QXP (Gao et al., 2017;Karlsson et al., 2012;Peng et al., 2021;Qin et al., 2023).Some TLPs are not frozen up during wintertime, and the temperature at the lake bottom is above 0°C throughout the whole year (Lin et al., 2010;O'Neill et al., 2020).The lake depth, together with the lake bottom temperature, poses impacts on the lateral erosion of TLPs (Lin et al., 2017), and the intensifying lateral erosion triggers lake expansion and other permafrost regional disturbances (Nitze et al., 2018).Moreover, deep TLPs serve as a heat source and trigger permafrost degradation, as well as greenhouse gas emissions during wintertime (Walter Anthony et al., 2016).
However, the insufficient knowledge of TLP distribution over large spatial scales has diminished the reliability of modeling and assessing the contribution of TLPs to permafrost degradation based on Earth system models (Nitzbon et al., 2020;Schuur et al., 2015).Benefiting from the development of interferometric synthetic aperture radar (InSAR) and time-series InSAR, permafrost degradation can be investigated by retrieving the periodic ground displacement (Chen et al., 2022;Rouyet et al., 2019;Short et al., 2014).However, the existing studies have mainly concentrated on a single TLP or a few TLPs for investigating the interaction between TLP dynamics and ground displacement (Liu et al., 2014;Niu et al., 2011;Wen et al., 2016).A recent study examined this interaction at a basin scale (~4600 km 2 ) on the QXP, and the results indicated that the density of TLPs is positively correlated with ground subsidence as well as ALT (Xu et al., 2023).However, there is still a lack of such investigations in the circumpolar Arctic.This baseline dataset allows us to identify hot spot areas with a dense distribution of TLPs under complex topography, which provides an opportunity for investigating the interaction between TLP dynamics and permafrost degradation at a regional or even larger scale.

Limitations and future work
In this study, the grid cell size of 100 km 2 is relatively coarse, and the various landforms and unevenly distributed TLPs within a single grid cell increase the uncertainty of accurately mapping TLPs.In some grid cells, the sparse distribution of training sample points may fail to comprehensively reveal the characteristics of the TLPs.The applied masks helped to filter out those water bodies connected to rivers and removed the shadow areas.However, the employment of empirical thresholds as well as the river products with a relatively coarse resolution cause bias in TLP mapping.In future work, we will investigate the adoption of more robust partitioning strategies and masks with locally adaptive thresholds under different topographical conditions.
As for the proposed workflow, the maximum surface area threshold of 3 km 2 introduces omission of larger TLPs which form in deep surface depressions with serious permafrost degradation.The minimum threshold for TLP identification was set as 500 m 2 .Undoubtedly, the limitation of inherent spatial resolution (10 m) of Sentinel-2 images led to the underestimation of TLPs, especially failing to identify small lakes and ponds during their initiation stage.Therefore, the thermokarst lakes and ponds were not discriminated in this study.The advanced image processing technologies, such as the super-resolution imaging and the deep learning algorithms, were proved to be efficient in mapping water bodies (Isikdogan et al., 2017;Lanaras et al., 2018;Yang et al., 2020).These rapidly developing technologies provide new opportunities to discriminate thermokarst ponds from lakes and improve the accuracy of large scale TLP mapping based on remote sensing images.
As for the TLPs themselves, the floating aquatic plants (e.g.algae, phytoplankton) can cause the water surface to be highly similar to the surrounding vegetation (Schmidt & Skidmore, 2003).In addition, we found that differences in water color due to various biochemical composition are quite normal, even within adjacent TLPs in the Lena Basin.As depicted in the accuracy assessment, 3.5% of validation samples were misclassified as TLPs, this is mainly found in the mountain shadow area, which exhibited resemblance in spectral information (dark color) as water bodies.Moreover, the water level during thawing seasons may exhibit large variations due to the impact of snowmelt, ground ice thawing, summer precipitation and evaporation.We calculated the median of available S2 images during thawing season for each grid; however, the selection of study period as well as the median filter may cause differences in TLP mapping results.
The data collected from the field surveys, i.e. the unmanned aerial vehicles (UAV) and the very high resolution (VHR) aerial photos, are critical for the verification of TLP classification based on remote sensing images.It is suggested that validation based on field surveys effectively improved the accuracy of TLP mapping results along the Qinghai-Xizang Highway (QXH) (Mu et al., 2023).Unfortunately, we did not collect available UAV or VHR images covering the Lena Basin.In future work, by combining the long time-series Landsat image archive with the modified workflow, a reliable annual TLP dataset spanning over 30 years will become possible.This could help remedy the lack of thermokarst processes for climate and permafrost models, which will be helpful for comprehensively understanding permafrost degradation as well as carbon emissions in the Arctic.

Conclusions
In this study, a 10 m resolution TLP dataset for the Lena Basin, Russian Arctic, in the 2020 thawing season was produced based on 4902 Sentinel-2 images.The construction of the training sample dataset took full advantage of the widely used land-cover products, and it was then filtered based on a SOM network.The random forest classifier was employed for TLP identification based on various spectral, terrain, and textural features.After a series of post-processing procedures specific to TLPs, the mapping results were examined and were found to reach an OA of 93.63% and a kappa coefficient of 0.87.Since our workflow is highly automated and especially designed for mapping TLPs in permafrost zone, we are confident that our method can be easily migrated to other permafrost zone, including the circum-Arctic and the QXP.
Overall, a total of 380,477 TLPs were mapped, with their total surface area reaching ~1.23 × 10 4 km 2 and a mean surface area of 32,171 m 2 .We then analyzed the distribution of TLPs within the five sub-basins of the Lena Basin, and found that 73.28% of the TLPs are found in the Lena and Vilyuy sub-basins.The small TLPs took up most of the total count, while large TLPs dominated the surface area.Nearly 77% of the TLPs were found to be located within continuous permafrost zone, and a majority of the TLPs were found to be distributed in plain with a local elevation of less than 400 m (84.75%) and a slope angle of less than 5° (63.66%).The TLPs in the Lena Basin are mainly distributed in regions with the ALT in the range of 80-100 cm, while higher GIC and MAGT are more favorable for TLP development.
This dataset provides the first quantitative spatial distribution of TLPs in the Lena Basin.It could serve as a baseline product for investigating the interaction between TLP dynamics and permafrost degradation.It could also make up for the shortcomings of insufficient incorporation of thermokarst processes in climate and permafrost models and improve the underestimation of organic carbon emission modelling.

Notes on contributors
Yining Yu received his M.S. degree in Global Environment Change from Beijing Normal University, Beijing, China, in 2020.Currently, he is pursuing the Ph.D. degree with the School of Geospatial Engineering and Science, Sun Yatsen University.His research interests primarily focus on investigating the dynamics of key features of Arctic permafrost using a variety of remote sensing data and methods, including machine learning and time-series InSAR.

Figure 1 .
Figure 1.Overview of the study area.(a) the location of the Lena Basin (orange) in the circumpolar Arctic.(b) the Lena Basin and its five sub-basins (Lena, Vilyuy, Aldan, Olekma, and Vitim).The permafrost type is derived from the 1 km 2 permafrost zonation data(Obu et al., 2019).The base map exhibits topographic relief based on the ALOS 30 m DEM data(Tadono et al., 2016).(c) the land cover is derived from the ESA 10 m WorldCover dataset(Zanaga et al., 2021).

Figure 2 .
Figure 2. The pixel-based distribution of the available Sentinel-2 images (cloud cover less than 5%) in the Lena Basin in the 2020 thawing season.

Figure 3 .
Figure 3. Overview of the TLP mapping workflow.The workflow was divided into four main parts: data filtering and pre-processing, training dataset construction, TLP identification, and post-processing.

Figure 4 .
Figure 4. Comparison between our results (e-h) and three land-cover products, which are the HydroLAKE data set (i-l), the global surface water data set (m-p) and 10 m ESA WorldCover data set (q-t).The left column exhibits the Sentinel-2 false-color composite images (near-infrared, red, and green as RGB, a-d), which represent the real condition of TLPs.

Figure 5 .
Figure 5. (A) the spatial distribution of the mapped TLPs (pink) as well as rivers (blue) in the Lena Basin.(b)-(h) examples of the detailed TLP distribution under different ground elevations.The elevation showed in (a)-(h) was derived from the ALOS World 3D (AW3D) 30 m DEM data (Tadono et al., 2016).

Figure 6 .
Figure 6.The proportion of the count and surface area of the mapped TLPs within different intervals of longitude (a) and latitude (b).

Figure 8 .
Figure 8.The count and surface area of the TLPs within different intervals of ground elevation (m) and slope (degree).

Figure 9 .
Figure 9. Maps of three key features of permafrost and the TLP statistics: (a) active layer thickness (ALT); (b) mean annual ground temperature (MAGT); (c) ground ice content (GIC); (d) the proportion of count and surface area of the TLPs under different classes of ALT, GIC, and MAGT.

Figure 11 .
Figure 11.The importance of the 21 features used in the random forest classifier.

Fengming
Hui received his Ph.D. degree in Cartography and Geography Information System from Nanjing University, Nanjing, China, in 2007.He currently serves as the Assistant Dean of the School of Geospatial Engineering and Science, and Professor of Polar Remote Sensing and Climate Change at Sun Yat-sen University, Zhuhai, China.His research interests primarily focus on mapping physical parameters of snow and ice in polar regions using remote sensing data and polar sea ice dynamics changes.Xiao Cheng received his Ph.D. degree in Cartography and Geographical Information System from the Graduate School of Chinese Academy Sciences in 2004.He played a pivotal role in establishing the School of Geospatial Engineering and Science (SGES at Sun Yat-Sen University), where he currently serves as the Dean.He has published over 100 peer-reviewed journal papers.His research interests include polar remote sensing technology and the study of climate change in polar regions.Chong Liu received his Ph.D. degree in remote sensing science from the State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University in 2015.He is now a research associate Professor of School of Geospatial Engineering and Science, Sun Yat-sen University.Dr. Liu's research interests focus on environmental impacts of land use/cover and climate change with remote sensing techniques and methods.Yu Zhou received the B.S. and M.S. degrees in geodesy from Wuhan University, Wuhan, China, in 2010 and 2012, respectively, and the Ph.D. degree in Earth sciences from the University of Oxford, Oxford, U.K., in 2016.He is currently a Professor with the Guangdong Provincial Key Laboratory of Geodynamics and Geohazards, School of Earth Sciences and Engineering, Sun Yat-sen University, Zhuhai, China.His work focuses on using geodetic techniques to measure the deformation throughout the earthquake cycle.

Table 1 .
The detailed information of various datasets used in this study.

Table 2 .
The threshold settings for surface area of TLPs in published literature.

Table 3 .
The randomly stratified selected sample points for each land-cover type from the ESA 2020 WorldCover product within one grid cell.

Table 4 .
The statistics for the TLPs in the five sub-basins.

Table 5 .
The proportion of count and surface area of the TLPs among the four types of permafrost.