Automatic segmentation algorithm for high-spatial-resolution remote sensing images based on self-learning super-pixel convolutional network

ABSTRACT Super-pixel algorithms based on convolutional neural networks with fuzzy C-means clustering are widely used for high-spatial-resolution remote sensing images segmentation. However, this model requires the number of clusters to be set manually, resulting in a low automation degree due to the complexity of the iterative clustering process. To address this problem, a segmentation method based on a self-learning super-pixel network (SLSP-Net) and modified automatic fuzzy clustering (MAFC) is proposed. SLSP-Net performs feature extraction, non-iterative clustering, and gradient reconstruction. A lightweight feature embedder is adopted for feature extraction, thus expanding the receiving range and generating multi-scale features. Automatic matching is used for non-iterative clustering, and the overfitting of the network model is overcome by adaptively adjusting the gradient weight parameters, providing a better irregular super-pixel neighborhood structure. An optimized density peak algorithm is adopted for MAFC. Based on the obtained super-pixel image, this maximizes the robust decision-making interval, which enhances the automation of regional clustering. Finally, prior entropy fuzzy C-means clustering is applied to optimize the robust decision-making and obtain the final segmentation result. Experimental results show that the proposed model offers reduced experimental complexity and achieves good performance, realizing not only automatic image segmentation, but also good segmentation results.

use changes (Li, Huang, and Gong 2019) and digital city evolution . Accurate segmentation is the premise and foundation for the extraction, analysis, and interpretation of ground object information, and is also the key to performing data mining and informatization (Pal and Pal 1993;Thakur and Nileshsingh 2013;Santos and Gosselin 2012;Sirmacek and Unsalan 2011;Troya-Galvis et al. 2015). Accurate segmentation results are of great significance for subsequent steps such as image understanding. In the industrial field, accurate segmentation improves the efficiency and accuracy of industrial production, thereby improving the level of social productivity, such as through character extraction on a workpiece surface or the image segmentation of industrial parts. In the agricultural field, accurate segmentation results have great application potential in crop disease detection and yield prediction. In marine applications, good segmentation results provide effective information for economic development and environmental protection, such as ship extraction and marine oil spill identification.
At present, there are several low-, mid-, and high-level semantic segmentation techniques. Lowlevel semantic segmentation is good for grayscale image segmentation, but can lead to problems, such as the edges of ground object segmentation results not meeting the requirements or the segmentation producing missing contours and position deviations when segmenting color HSRSIs containing a large amount of ground object information. Among the mid-level semantic segmentation methods, super-pixel segmentation is considered a representative example. Mid-level semantic segmentation includes graph theory-based methods and gradient descent-based methods. The abovementioned approaches optimize the target energy function by cutting or adding edges to generate super-pixel sub-images. This reduces the complexity of subsequent processing. However, spatial information is not considered in these methods, so the segmentation results are fragmented in the case of complex textures and backgrounds. Clustering can effectively merge super-pixel regions, and both K-means clustering (Liu et al. 2019b;Sinaga and Yang 2020) and fuzzy Cmeans (FCM) clustering (Liu and Xu 2008;Xu, Zhao, and Feng 2021) have been applied for this purpose. Because K-means clustering is extremely sensitive to the initial clustering center or membership degree, the merged ground objectives are incomplete. In contrast, FCM divides pixels according to the degree to which each pixel of the image belongs to different regions. Thus, FCM improves the low accuracy of K-means clustering by adding iterations, albeit at the cost of increased computational complexity. Therefore, many scholars have combined the super-pixel and FCM algorithms to optimize the segmentation results. Chen, Li, and Huang (2017) used FCM clustering to merge super-pixel over-segmented regions, resulting in the effective segmentation of target regions and improving the segmentation efficiency of traditional FCM. Lei et al. (2018) used super-pixels to define the multi-scale morphological gradient reconstruction operation, which provided a better local spatial domain for FCM and further improved the algorithm efficiency. However, it is necessary to balance the running efficiency and segmentation accuracy of the algorithm, so Kumar, Fred, and Varghese (2019) developed a super-pixel FCM method with spatial constraints to overcome the impact of uneven spatial information and improve the segmentation accuracy of the model. Wang et al. (2020) proposed an FCM method based on morphological reconstruction that weights the target and suppressed noise through morphological reconstruction and band conversion. Jia et al. (2020) proposed a robust self-sparse fuzzy clustering algorithm (RSSFCA) for image segmentation. RSSFCA solves the problems of outlier sensitivity and over-segmentation of ground object categories, and obtains satisfying segmentation results at relatively low computational cost. Super-pixel pre-segmentation can be used to effectively reduce the burden of image processing and accelerate segmentation, but it is difficult to maintain the accuracy of ground object clustering, that is, it is difficult to find a balance between spatial information and computational cost (Jia and Zhang 2014;Ji et al. 2014;Gu et al. 2018;Wang et al. 2019;Himabindhu and Anusha 2020).
High-level semantic segmentation can effectively extract low-, mid-, and high-level semantic information from images, and aids pixel classification through classifiers such as convolutional neural networks (CNNs) (Suzuki 2020), recurrent neural networks (RNNs) (Zhang, Wang, and Liu 2014), and generative adversarial networks (GANs) (Goodfellow et al. 2014). CNNs achieve feature extraction through multiple convolution and pooling operations, gradually transforming lowlevel rough features into high-level fine features. The high-level features are then segmented when passing through the fully connected and output layers. Compared with RNNs and GANs, CNNs have a more stable structure and offer greatly improved performance. Jampani et al. (2018) proposed a super-pixel sampling network (SSN) model that improves accuracy by clarifying the segmentation criteria. Tu et al. (2018) proposed a super-pixel model based on subdivision loss, which effectively preserves the boundary of ground objects better than traditional CNNs by improving the loss function. Suzuki (2020) developed an improved CNN model by optimizing the random initialization method, adaptively changing the number of super-pixels, and improving the running efficiency. Yang et al. (2020) used a full convolution network (FCN) to segment images. Their model adjusts the weights of the color and distance of ground objects, thus achieving more flexibility and regularity. Although these studies have improved CNN-based super-pixel models and obtained better segmentation results than previous super-pixel algorithms, they rely on a large number of manually labeled samples to supervise the training. Manual labeling is time-consuming and laborious in the case of large-scale and multi-target high-resolution remote sensing images, and greatly reduces the flexibility and automation of the network model. Although many HSRSI segmentation algorithms have been proposed, there is still no unified framework that can achieve segmentation quickly and effectively. The complexity of HSRSIs and the lack of simple linear features make automatic segmentation somewhat difficult. First, image segmentation is a multi-solution problem; that is, there are multiple segmentation methods for any given image. Second, compared with mid-and low-resolution remote sensing images, the contours and shape information of ground objects are clearer and have even more diverse spatial relationships, which increases the segmentation difficulty. Therefore, a general segmentation framework for complex HSRSIs remains a challenging task.
To effectively preserve the segmentation accuracy of FCM and improve the automation of CNN models, a segmentation method based on a self-learning super-pixel network (SLSP-Net) combined with modified automatic fuzzy clustering (MAFC) is proposed in this paper. First, a lightweight feature embedding regulator is adopted in SLSP-Net to generate ground features, and non-iterative clustering and gradient reconstruction are used to automatically allocate the clustered pixels to obtain super-pixel over-segmented images. The density peak (DP) clustering algorithm is then optimized in MAFC as a means of reducing the similarity matrix and obtaining the robust decisionmaking (RDM) image. Finally, FCM clustering based on prior entropy (PE-FCM) is used to improve the RDM image, thus obtaining the final segmentation results. The main innovations of the proposed method are as follows: (1) SLSP-Net generates super-pixels through self-learning of feature extraction, non-iterative clustering, and gradient reconstruction modules, which improves the automation of the overall network model. (2) The super-pixel information is fused into the fuzzy clustering algorithm, which improves the efficiency of the algorithm and provides an effective neighborhood spatial structure for MAFC. (3) MAFC combines the effective spatial structure and prior entropy information of the image, which creates better merging of regions in complex backgrounds and helps to achieve accurate segmentation.
The remainder of this paper is organized as follows. Section 2 describes the proposed SLSP-Net over-segmentation method and the MAFC merger algorithm. In Section 3, the proposed method is applied to image segmentation tasks and its superiority over existing methods is evaluated. Section 4 discusses the results in the context of the state-of-the-art. Finally, Section 5 presents the conclusions from this study and identifies possible future directions of study.

Methods
The segmentation method proposed in this paper adopts SLSP-Net to over-segment the image through feature extraction, non-iterative clustering, and gradient reconstruction. The MAFC model is used to cluster the over-segmented results, where super-pixels are introduced into the DP algorithm to produce the RDM image, and then PE-FCM is used to improve the RDM image and obtain the final segmentation results. The overall process is shown in Figure 1.

SLSP-Net
SLSP-Net includes feature extraction, non-iterative clustering, and gradient reconstruction modules. Specifically, the feature extraction module embeds the original features into adjacent clusters, before the non-iterative clustering module assigns labels to pixels through seeds and automatically calculates the index of the seed nodes. Finally, the gradient reconstruction module adaptively adjusts the gradient of each weight parameter according to the spatial context, as shown in Figure 2.

Feature extraction module
To improve the running efficiency of the super-pixel convolutional network and reduce redundancy, two convolution layers and atrous spatial pyramid pooling (ASPP) are employed to expand the receiving range and better preserve the details of ground objects. As shown in Figure 2(A), color features and pixel positions are integrated into the ASPP structure to obtain multi-scale information: where * is the convolution operator, X [ R N×5 is the input feature, X [ R N×C m is a multi-scale fea- C m 3 is the convolution with an extended range of d, and s is the ReLU activation function. Two 3×3 convolutions are embedded into the multi-scale features X m : where Z [ R N×C 2 is the clustering feature and W 1 [ R C m ×C 1 , W 2 [ R C 1 ×C 2 are the parameter matrices.

Non-iterative clustering module
The extracted cluster feature Z is combined with the initial clustering center S c to obtain the superpixel image. This process needs to be iterated, which takes a lot of time. To improve the running efficiency, the seed estimation layer (SEL) is adopted in the non-iterative clustering module to obtain the cluster center distribution S by learning the offset, and the seeds are moved to a reasonable position according to the center offset, as shown in Figure 2(C). By constraining the uniformity C(i, j, S c ) within the super-pixel (see Eq. (3)), Z adaptively calculates the cluster constrained parameters (se Eq. (4)), giving the distance clustering image M.
where D c (i, j) and D Lc (i, S) are the distances from pixel i to adjacent pixel j and cluster center S, respectively; l and ∅ are parameters; D Lg (i, S) is the gradient distance from pixel i to center S. M is fused to low resolution. Z k [ R K×C 2 , where K is the number of target super-pixels. The S-type activation function is used to calculate the offset of the two-dimensional vector where W s [ R C 2 ×2 is the parameter matrix, F = {rr, cc} are the two dimensions, rr is the lateral displacement ratio, and cc is the longitudinal displacement ratio.

Gradient reconstruction module
To overcome model over-fitting, a gradient reconstruction method is proposed. First, the gradient adaptive layer (GAL) perceives the weight of the gradient in different feature channels to avoid over-fitting. Second, the gradient bidirectional layer (GBL) generates confrontation through the spatial context to improve robustness. First, the GAL reconstructs the original input features (including spatial features and color features) according to where X r [ R N×5 are the reconstruction features (mainly color features and spatial features) and W r [ R C 2 ×5 are the parameters of linear items. The GAL readjusts the gradient of the weight parameters to avoid over-fitting. Although the GAL can avoid over-fitting by recording the intensity of channels, it does not consider the spatial context of pixels. Therefore, the GBL readjusts the gradients based on spatial context to avoid overfitting. The forward and backward propagation of the GBL can be described as where R b (.) is a pseudo-function of the GBL. c b n is the derivative of R b (.) with respect to X n,c . The GBL performs image mapping in the forward propagation stage. In the back propagation step, the GBL generates bidirectional gradients based on contour images of different pixels n based on their contour map B n.

Loss functions
The overall loss function consists of two parts, the clustering loss and the reconstruction loss. The clustering loss Lc is expressed as where l(.) is the regular term,P is the probability that pixels are assigned to seed points, i is the pixel point, and k is the seed point. This can be formulated as a regularized Kullback-Leibler divergence between the limited-range soft assignmentP with its reference distributionQ ik . The reconstruction loss Lr is a key means of adjusting the weight parameters according to gradient reconstruction. By reconstructing the color and spatial features of the input image, Lr is given by where L r c is the reconstruction loss of color features, L r s is the reconstruction loss of spatial features, and ∅ controls the weights of L i r c and L i r s . However, as the gradient is reconstructed as a bidirectional gradient, Lr can be written as where V b = {n|B n . 1} is the number of pixel sets and to avoid over-reliance on the spatial characteristics of pixels in V b during the clustering process. Finally, the loss functions in the proposed method are where β balances the two loss functions.

MAFC
MAFC is combined with SLSP-Net to obtain the over-segmented image. First, the DP algorithm is optimized and the similarity matrix is reduced. Then, decision-making images are generated and the data redundancy is reduced, which improves the operation speed. Finally, PE-FCM is used to achieve the final image segmentation, as shown in Figure 3.

Optimized DP algorithm
The DP algorithm selects the number of clusters according to the decision images, but produces a huge similarity matrix. This leads to poor efficiency and ignores the spatial information of the images. Therefore, an optimized DP algorithm is proposed in which the local density p I is expressed as where N ′ is the total number of samples in the dataset, S J is the total number of pixels in the j-th super-pixel area, dc is the distance between ground objects, and D IJ is the Euclidean distance between ∂ I and ∂ J . The optimized DP algorithm maps the original clustering image r j to a new clustering image F j , and the decision conditions z(x e ) are calculated according to Eq. (13). According to the conditions of Eq. (14), a new clustering image is obtained using Eq. (15): where w j is the clustering condition, x e is the clustering interval, r j is the normalized result, and F j is the new clustering image.

PE-FCM
By analyzing the clustering images obtained by the optimized DP algorithm, it is concluded that the integration of regional similarities among ground objects could lead to incomplete integration of the ground objects. Therefore, a PE-FCM algorithm is proposed. The objective function of the proposed method is as follows: where 1 S l p[∂ l x p is the average value of the super-pixel area ∂; k is the covariance matrix of variable dimensions; v k denotes K-means clustering; p k is the prior probability of the super-pixel region of 1 S l p[∂ l x p | in v k ; and u kl is the membership degree of each super-pixel. By considering a Gaussian distribution and Gaussian density function, (17) can be written as Note that c k=1 u kl and c k=1 p k are both constants. ∂ l is the super-pixel area. Therefore, according to Eq.
(17), the objective function can be improved to Equation (18) uses the derivatives of π k , v k , and u kl . As these terms are all constant, their derivatives are identically 0, which leads to the following: According to Eqs. (19)-(21), PE-FCM integrates the neighborhood information of the prior probability distribution and the distribution characteristics of the image, and accurately merges the ground objects to obtain the final segmented image.

Dataset
At present, there is no published dataset for the edge segmentation of high-resolution remote sensing images, and all published datasets are of natural scenery. Therefore, in addition to using the BSDS500 dataset containing 500 images (Arbeláez et al. 2011), 200 images were manually constructed for the experiments. The self-made images had a size of 500×500 pixels. Datasets for our experiments were composed of 280 training, 140 validation, and 280 test images, with each image labeled by multiple annotators.
To verify the effectiveness of the proposed method, four types of widely used high-resolution remote sensing images were considered, and experiments were conducted across three groups with a total of 11 scenes. The images include UAV aerial remote sensing images (UAV), the WuHan building dataset (WHU) (Ji, Wei, and Lu 2018), the Object Detection in Aerial Images dataset (DOTA) (Xia et al. 2018), and domestic autonomous satellite images from Gaofen-2 (GF-2), obtained from the China Center for Resource Satellite Data and Applications (CRESDA). Detailed information about the datasets is presented in Table 1 and Figure 4, and reference images are shown in Figure 5.
As shown in Figure 4, the 11 images include common ground objects such as buildings, roads, low vegetation, trees, water bodies, cultivated land, and bare land. The details of the ground objects are clear, although their types are complicated and the distribution is dense and uneven. There is considerable image noise and redundancy, which conforms to the characteristics of HSRSIs and introduces significant challenges for accurate and rapid segmentation.
The first data group contains dense areas of buildings, including rural tile-roofed houses, urban residential houses, factories, office buildings and complex buildings. The building materials are diverse, the roof color characteristics are rich, and the shapes are very different. Moreover, in image I4, parts of the buildings are rendered in 3D stereo. Due to the shooting angle and illumination, the target textures and geometric information are relatively rich, and the differences between classes such as buildings, water bodies, and shadows are relatively small.
The second data group covers mixed areas of water body, cultivated land, and vegetation. The intraclass variance is lower than in the first group, but there are more types of ground objects, such as paddy fields, irrigated land, cultivated land, fallow land, and bare land. The internal texture distribution is rich and the boundaries are fuzzy.
The third data group covers mixed areas of ground objects. The spatial relationships among the ground objects are more chaotic, the adjacency relationships are more complex, and the scales of the ground objects are highly variable. For instance, there are long and narrow winding rivers and roads, vast and calm oceans, regular houses and terraces, and irregular cultivated land and vegetation.

Accuracy evaluation
To verify the accuracy of the experimental results, qualitative and quantitative methods are applied. The qualitative analysis compares the boundary adhesion, shape heterogeneity, and the over-and under-segmentation of the resulting images. The quantitative analysis uses nine evaluation indicators to compare and analyze the segmentation algorithms considered in this paper (see Table  2). These are divided into super-pixel over-segmentation evaluation indicators (UE, ASA, and BR) and cluster-merge evaluation indicators (R u, P s , P r , R e , DC, and JS). The actual boundaries of the ground objects are obtained by visual interpretation, as shown in Table 2.
where BR is used to measure the coincidence degree between the super-pixel segmentation boundary and the real ground boundary; ASA is the upper limit of the achievable target segmentation accuracy; and UE is the under-segmentation error rate, which evaluates the quality of the segmentation boundary by punishing super-pixels with a plurality of overlapping objects. R u is the regional gray consistency; P s is the peak signal-to-noise ratio; B and C are the regional gray variance and regional binary variance, respectively; MSE is the root mean square error; P r is the segmentation accuracy (Shen et al. 2020); and R e is the boundary recall rate (Hou et al. 2020). TP is the number of result pixels for which the ground object is correctly segmented; FP is the number of segmentation result for which the background pixels are divided into ground object pixels; and FN is the number of samples divided into background pixels. DC is the similarity index (Milletari, Navab, and Ahmadi 2016) and JS evaluates the region quality (Goh et al. 2021).
The experiments were performed on a Windows 10 platform with an Intel(R) Core(TM) i7-10,700 2.9 GHz CPU and 16 GB RAM. All algorithms and experimental evaluations were implemented in PyTorch and Matlab2018b.

Analysis of results
To verify the effectiveness of the proposed method, simple linear iterative clustering (SLIC)    Ours clustering framework (AFCF) , robust self-sparse fuzzy clustering algorithm (RSSFCA) , and super-pixel segmentation with fully convolutional network (S-SFM) ) are selected for comparison in terms of full-segmentation performance. Details of these algorithms are presented in Table 3. 3.3.1. Test results for super-pixel over-segmentation SLIC, LSC, MGW, and the proposed SLSP-Net method were used to segment the original images, as shown in Figures 6-8. When the texture spectrum and color information between buildings are similar, MGW struggles to distinguish complex ground objects, and the boundary adhesion rate is not high. The proposed method, LSC, and SLIC provide better expressions of the texture spectral information between objects, resulting in better attachment of the object boundaries. To determine the most effective over-segmentation method, we consider the BR, ASA, UE, and Time indices. The results are given in Table 4.
According to Table 4, SLSP-Net and SLIC are the least time-consuming and most efficient methods among the four considered in this experiment. In terms of BR, ASA, and UE, SLSP-Net is superior to the other three methods, mainly because of the high adhesion rate of ground objects. Based on the above analysis, the over-segmentation method proposed in this paper achieves high efficiency and good boundary preservation, and thus provides better pre-segmentation images for later segmentation and merging.

Test results for full-segmentation
The SAW, SoAW, SFFCM, AFCF, RSSFCA, S-SFM, and SLSP-Net methods were used to segment images in three experiments. The results are shown in Figures 9-11. Furthermore, the R u , P s , P r , and R e metrics were used to evaluate the segmentation results. The evaluation results are summarized in Tables 5-7.

Test results from experiment 1
(1) Qualitative Evaluation. Figure 9 shows the segmentation results for the first group of experimental images. This group of images mainly shows dense building areas with complex building types and sizes, rich roof colors, and diverse texture structures. Such objects are easily confused with surrounding objects, which increases the difficulty of segmentation.
According to images I1 and I3, S-SFM and RSSFCA produce serious over-segmentation and poor boundary adhesion; the segmentation shapes given by SAW, SoAW, SFFCM, and AFCF are regular and complete, but the segmentation patches are broken where the over-segmentation problem has not been well solved. The proposed method retains small blocks such as houses and green spaces completely, and yet the buildings are still accurately segmented.
According to images I2 and I4, the segmentation patches produced by AFCF, RSSFCA, and S-SFM have different sizes and shapes. Although the main building boundaries fit the segmentation boundaries well, serious over-segmentation can be observed. SAW, SoAW, and SFFCM produce serious under-segmentation and over-merging of super-pixel regions, and the segmentation produces irregular rectangles. Some positions with fuzzy boundaries are also merged incorrectly. The segmentation effect of SLSP-Net is good, with the building boundaries fitting the segmentation boundaries well; however, there are some instances where roads and buildings are merged incorrectly.
(2) Quantitative Evaluation. According to the segmentation evaluation results in Table 5, S-SFM and RSSFCA perform poorly because of their serious over-segmentation. SAW and SoAW combine built and non-built areas incorrectly, and do not improve the over-segmentation of vegetation. The segmentation of buildings by SFFCM and AFCF contains numerous noise points and the boundaries of ground object segmentation are obviously offset, but the roads have been merged quite well. Compared with the other algorithms, the proposed SLSP-Net obtains the best experimental results, demonstrating its superior performance in terms of ground object segmentation. The segmentation effect is best in experiment I1, mainly because the internal spectral details of ground objects in image I1 are simple, the boundaries are obvious, and the contours are simple, which makes it easy to distinguish different objects.

Test results from experiment 2
(1) Qualitative Evaluation. Figure 10 shows the segmentation results for the second group of experimental images. This group of images mainly shows vegetation and water bodies, where the intraclass variance is relatively low and there are more types of ground objects than in experiment 1, such as paddy fields, irrigated land, cultivated land, fallow land, and bare land. For these ground types, the internal texture distribution is rich and the boundaries are fuzzy.   Images I5 and I6 mainly cover water bodies and grassland. AFCF and S-SFM produce a large number of noise points when segmenting such ground types, which worsens the overall segmentation effect. The segmentations given by RSSFCA, SAW, SoAW, and AFCF result in fragmented water and vegetation areas exhibiting irregular rectangles, and areas with dense ground objects have been mistakenly merged. The SLSP-Net method improves the over-segmentation effect, Table 6. Comparison of different indices (R u , P s , P r , R e , DC, JS) for experiment 2 based on seven full-segmentation algorithms (SAW, SoAW, SFFCM, AFCF, RSSFCA, S-SFM, SLSP-Net). Larger values are better. The best values are indicated in bold.  and so the size and shape of the segmented objects are more consistent than those of the reference images. However, there are still some mis-segmentations in individual positions, such as water bodies and vegetation. Images I7 and I8 mainly cover farmland. S-SFM and RSSFCA exhibit high adhesion to ground objects, but produce serious over-segmentation. For SAW, SoAW, and SFFCM, many segmentation errors occur because the spectra of grassland boundary areas and farmland are similar, resulting in low accuracy. AFCF suffers from a serious under-segmentation problem and the over-merging of super-pixel regions. Although the proposed method produces some errors, the fragmentation of the segmentation effect has been improved to a certain extent, which is more in line with the visual effect of segmentation.
(2) Quantitative Evaluation. Compared with the segmentation results of experiment 1, all four index values are slightly lower for experiment 2. This is mainly because of the diversification of ground object types. Although ground objects of the same kinds in different periods and different functions are extremely similar, they have slightly different spectral, texture, and spatial characteristics. Additionally, the heterogeneity within the class is further increased, and the texture structures are more complex, so the segmentation is more difficult. In summary, compared with other experimental methods, SLSP-Net achieves the highest P r , R u , P s , and R e values, demonstrating the applicability of the proposed method.

Test results from experiment 3
(1) Qualitative Evaluation. Figure 11 shows the third set of data covering a mixed area of ground objects. Compared with the images in experiments 1 and 2, the spatial relationships among the ground objects are more chaotic, the adjacency relationships are more complex, and the scales of the ground objects are more variable. RSSFCA and AFCF produce a large amount of over-segmentation. Because of the low spectral heterogeneity of buildings in the segmented area, it is difficult to distinguish buildings with similar spectra, resulting in over-segmentation. Compared with RSSFCA and AFCF, the segmentation regions obtained by SAW, SoAW, S-SFM, and SFFCM are closed and independent of each other, and the segmentation accuracy is further improved. However, the problem of image over-segmentation has not been well solved. In particular, changes in light and shade around buildings due to uneven illumination results in over-segmentation. The proposed SLSP-Net method is better able to distinguish houses from roads, and preserves small areas such as houses and green spaces relatively well. However, there are also some shortcomings, such as high sensitivity to the spectral characteristics of water, which produces a small number of segmentation fragments. SLSP-Net effectively solves the problem of mis-segmentation caused by the phenomenon of 'different objects with the same spectrum and different spectra from the same objects,' and further improves the segmentation accuracy compared with the other six methods considered in this experiment.
In general, the proposed algorithm takes the uncertainty of homogeneous region features into account, and enhances the expression of uncertainty of pixel categories, thus optimizing the boundary segmentation effect. This ensures greater accuracy of the segmentation results, and makes the size and shape of ground object segmentation more consistent with the reference images.
(2) Quantitative Evaluation. As can be seen in Table 7, when the map width, ground object type, spatial distribution, image resolution, and contour complexity increase at the same time, SAW and SoAW both give poor segmentation accuracy. SFFCM, AFCF, RSSFCA, and S-SFM produce stable results, but are greatly impacted by ground objects. In summary, the evaluation indices for the three groups of experimental images decrease as the segmentation difficulty increases. However, compared with other algorithms, the proposed method maintains higher values of these evaluation metrics. From the above analysis, it can be concluded that the visual effect of segmentation results is consistent with the experimental indices, and the segmentation effect and accuracy of SLSP-Net combined with MAFC are better than those of existing techniques.

Discussion
A segmentation method has been proposed based on a self-learning super-pixel network and modified automatic fuzzy clustering. Both the visual impression and quantitative results demonstrate that our proposed method generates better super-pixels than existing methods and has a relatively high degree of automation.
SLSP-Net has better generalizability with much less complexity, achieving the best BR, ASA, UE, and runtime results across a series of experiments. This fully proves the applicability of the proposed method (see Tables 5-7). However, there are still some drawbacks to our SLSP-Net method. Due to the sequential training strategy, our model cannot achieve complete convergence, unlike other learning-based methods. This leads to the existence of trivial regions in the super-pixels generated by SLSP-Net, which require post-processing to remove. SLSP-Net uses a lightweight convolutional network and achieves real-time segmentation using a GPU, but the computational load is still high. Therefore, the next step is to merge the super-pixels.
MAFC has the ability to resist noise and effectively merge super-pixels. Figures 9 and 10 (g1-g4) show good area characteristics due to the employment of MAFC. Figure 11 (g1-g4) shows that S-SFM and our proposed method obtain similarly good segmentation results close to expectations, but the latter provides more accurate details than the former. To illustrate these experimental results further, Tables 5-7 compare the performance of different algorithms. We can see that our proposed method obtains the best performance indices. Therefore, the proposed approach can further improve the accuracy of HSRSI segmentation and play an important role in developing of automatic HSRSI segmentation.

Conclusion
To address the problems of an insufficient degree of automation in CNN-based super-pixel models and low FCM clustering accuracy, a method that combines SLSP-Net with MAFC has been described in this paper. First, SLSP-Net accurately retains the boundaries of ground objects and obtains over-segmentation results through feature extraction, non-iterative clustering, and gradient reconstruction. Second, an optimized DP method and PE-FCM are incorporated into MAFC to further improve the accuracy of the segmented images. This allows the clustering blocks to be obtained automatically and the over-segmented regions be merged effectively. Experimental results show that, compared with other methods, the proposed approach accurately segments small ground objects and achieves the highest level of accuracy. Moreover, it reduces the experimental complexity and exhibits good all-round performance.
However, due to the color and texture characteristics commonly encountered in super-pixel regions, the morphological features of the images are ignored. Therefore, future work will focus on the inclusion of morphological features in the MAFC model to achieve high-precision automatic image segmentation.

Author contributions
Acquisition of the financial support for the project leading to this publication, H.N. and L.F. Application of statistical, mathematical, computational, or other formal techniques to analyze or synthesize study data,, Z.Y., L.H., D.X. and X.W. Preparation, creation, and/or presentation of the published work by those from the original research group, specifically critical review, commentary, or revision, including pre-or post-publication stages, H.N., L.F., and Z.Y.

Data availability statement
The code used in this study are available by contacting the corresponding author.

Disclosure statement
No potential conflict of interest was reported by the author(s).