RepDDNet: a fast and accurate deforestation detection model with high-resolution remote sensing image

ABSTRACT Forest is the largest carbon reservoir and carbon absorber on earth. Thus, mapping forest cover change accurately is of great significance to achieving the global carbon neutrality goal. Accurate forest change information could be acquired by deep learning methods using high-resolution remote sensing images. However, deforestation detection based on deep learning on a large-scale region with high-resolution images required huge computational resources. Therefore, there was an urgent need for a fast and accurate deforestation detection model. In this study, we proposed an interesting but effective re-parameterization deforestation detection model, named RepDDNet. Unlike other existing models designed for deforestation detection, the main feature of RepDDNet was its decoupling feature, which means that it allowed the multi-branch structure in the training stages to be converted into a plain structure in the inference stage, thus the computation efficiency can be significantly improved in the inference stage while maintaining the accuracy unchanged. A large-scale experiment was carried out in Ankang city with 2-meter high-resolution remote sensing images (the total area of it was over 20,000 square kilometers), and the result indicated that the model computation efficiency could be improved by nearly 30% compared with the model without re-parameterization. Additionally, compared with other lightweight models, RepDDNet also displayed a trade-off between accuracy and computation efficiency.


Introduction
Forests are essential for climate resilience and global sustainable development and recovery.It is reported that the net loss of global forest area has decreased by 180 million hectares since 1990 (Nesha et al. 2021).Dynamic forest change has largely contributed to global climate change, extreme weather outbreaks, and directly affect human life and economic development worldwide (Achard et al. 2002).Consequently, quantitative global forest change studies have a significant scientific value for forest protection.
Using remote sensing images to monitor global forest change is a very effective method, primarily due to its low cost and short revisited period.Numerous studies have been conducted to detect forest change using median resolution imageries (de Bem et al. 2020;Maretto et al. 2021;Mayfield et al. 2020;Torres et al. 2021;Zhang et al. 2022).Landsat images are the most popular data source used to monitor forest change due to their long historic data archive and free-charge policy.Based on Landsat imagery, several large-scale forest change studies have been investigated, such as forest change in the tropics or subtropics regions (Chen et al. 2021;Fortin, Cardille, and Perez 2020).So far, the most famous forest change product is the global forest change product (GFC, Hansen Global Forest Change [Hansen et al. 2013]), which has a spatial resolution of 30 meters and a temporal scale covering the period from 2000 to 2022.Other excellent products, such as the GlobaLand30 released by China (Chen et al. 2020), also produced global forest cover products with 30 meter spatial resolution.All the aforementioned methods and products give strong support to global forest change research field.
However, a recent study indicates that the present forest change products based on the median resolution images have large uncertainty in both location and area information (Chen et al. 2020).To solve this issue, using higher-resolution images such as 2 meter high-resolution images to monitor deforestation is an effective way (Wang et al. 2023).However, high-resolution remote sensing imagery imposes a significant increase in computation resources, especially for advanced deeplearning methods (Zhang et al. 2020).Moreover, as the spatial resolution of remote sensing images increases, the seriously intensified 'same object with different spectrums' phenomenon creates an enormous challenge for the lightweight deep learning model, such as the Unet model (Ronneberger, Fischer, and Brox 2015).Therefore, a large model with strong feature extraction ability is preferable for these complicated scenarios (Zhao et al. 2017).However, the inference time of large models will also be significantly increased.Though using model pruning or model quantizing can improve the model inference speed, the accuracy may be heavily decreased, moreover, such advanced large deep learning models with multi-branch structures is also unfriendly for model pruning or quantizing (Ding et al. 2021).
Deforestation detection is a challenging task due to it requires pixel-level classification, and both the semantic segmentation model and change detection model can accomplish this task.Recently, several carefully designed semantic segmentation models have been proposed, such as the PSPnet (Zhao et al. 2017), the DeepLabV3+ (Chen et al. 2018), and the HRnet (Tao, Sapra, and Catanzaro 2020).All of these models indicate a much higher accuracy than traditional machine learning methods such as the SVM classifer (Bovolo, Bruzzone, and Marconcini 2008), and the Decision Trees classifer (Tariq et al. 2022).Additionally, some excellent change detection models are also proposed, for example, the SiamFCN (Daudt, Saux, and Boulch 2018), the Unet++ (Peng, Zhang, and Guan 2019), the DTCDSCN (Liu et al. 2021), and the ESCNet (Zhang et al. 2023).Most of the present semantic segmentation models or change detection models focus on how to improve the final pixel-level segmentation accuracy but pay less attention to improving model computation speed in the inference stage (Ding et al. 2021).Moreover, the model structure of present semantic segmentation models or change detection models in the training stage and inference stage is consistent, and the multi-branch structure in the aforementioned models will heavily increase model inference time.Reducing the inference time of deep learning models is important, particularly for these scenes like emergency relief, and natural disaster grade assessment.An effective solution to alleviate this issue is that a high-precision deep learning model with heavyweight can be trained in the training stage, while in the inference stage, we can use a mathematical transform or other methods to convert the heavyweight model into a lightweight model, thereby improving model inference speed.Inspired by RepVGG, which was a decoupled deep learning model (Ding et al. 2021), we have proposed a new re-parameterization deforestation detection model in this paper, named RepDDNet, which was designed to maintain detection accuracy and improve model computation efficiency in the inference stage.
The main contributions of this manuscript are as follows: (1) Referring to the idea of decoupling between the training stage and inference stage, we proposed a novel deforestation detection model for high-resolution imagery, named RepDDNet, which has higher inference speed and accuracy compared with other existing excellent deep learning models.In addition, we also designed a new deforestation detection monitoring software based on RepDDNet, which can be downloaded from https://drive.google.com/drive/folders/1xff_PnybrpBPv-Mhzn7OCtdfAHFB7OlW?usp=share_link.(2) A large-scale real-world scenario with 2 m high-resolution imagery was selected as a study area, and the final deforestation detection result proved the effectiveness of RepDDNet.Moreover, the model implementation idea also can provide a new reference for other downstream tasks on remote sensing imagery, such as semantic segmentation, object extraction, and superresolution tasks.Zhao et al. 2022).The ForestNet model (Irvin et al. 2020), was an encoder-decoder semantic segmentation structure, researchers used it to investigate forest loss in Indonesia, where is the most serious deforestation area in the world.In optical remote sensing images, the detection accuracy will be affected by cloud or cloud shadows, to solve this problem, SAR images are an ideal data source.For example, the dense time series of Sentinel-1 SAR images with an Unet model were used to map monthly forest harvesting in California, USA, and Rondonia, Brazil (Zhao et al. 2022).From the existing studies, several excellent deep learning models have also been proposed for deforestation detection, such as attention-Unet (John and Zhang 2022), DeepLabV3+ (de Andrade, Mota, and da Costa 2022), and modified Unet (Alzu'bi and Alsmadi 2022).In their studies, the Unet-style model was the most popular deforestation detection model.

Change detection models
Deforestation detection is a pixel-level segmentation task, belonging to the change detection research field.A fully convolutional neural network with an encoder-decoder structure is the popular deep learning model structure (Fang et al. 2022;Peng, Zhang, and Guan 2019).Several related works demonstrate that a deeper network can achieve higher accuracy than shallow layer models (Chen, Qi, and Shi 2022;Wen et al. 2021;Zhang et al. 2020).The VGG model (Simonyan and Zisserman 2014) has been proven a successful model structure, which only uses a simple 3 × 3 convolutional kernel as the basic module, but can get a satisfactory classification result on the ImageNet dataset.However, due to the plain structure of VGG, it will unavoidably face a vanishing gradient problem as the model depth increases.(Chen et al. 2018;Yuan, Chen, and Wang 2020).For example, the Pyramid Pooling Module (PPM) in PSPNet (Zhao et al. 2017), and the Atrous Spatial Pyramid Pooling Module (ASPP) in DeepLabV3+ (Chen et al. 2018) have demonstrated their effectiveness in pixel-level segmentation tasks.In addition, attention modules also have been proven to be effective recently, such as the PAM (Position Attention Model) or CAM (Channel Attention Model) (Chen and Shi 2020).However, most of the present attention modules require huge computation resources (John and Zhang 2022).Toward a pixel-level change detection task, an ideal model is it can focus on the changed regions, particularly for these hard-to-classify regions (such as small objects).The OCR module (Object context representation) was proposed for solving this issue (Yuan, Chen, and Wang 2020).
The core idea of OCR is that each image pixel belongs to a specific object, and only pixels belonging to the same object need to be modeled.Compared with the global attention module, OCR is more resource-efficient and can achieve a higher pixel-level segmentation accuracy (Yuan, Chen, and Wang 2020).In remote sensing images, each pixel also belongs to a landcover type, such as water, roads, forests, or other landcover types.Naturally, for change detection tasks in remote sensing images, individual pixel can also be divided into two categories, changed or unchanged.Thus, using object modeling to fuse features of changed pixels can theoretically improve the model performance.

Deforestation detection dataset
The deforestation detection dataset was implemented by (Wang et al. 2023).In order to make such a large deforestation detection dataset, several China high-resolution optical remote sensing satellites were used, including 2 m GF-1, 1 m GF-2, and 2.1 m ZY-3.It is worth noting that GF-2 images were resampled to 2 m by using the nearest-neighbor resampling method.There are only three optical bands in this dataset.The acquired time of bi-temporal images is in 2019 and 2020, respectively.The deforestation detection dataset was collected in the Yangtze River Economic Zone of China, including Hunan, Hubei, Jiangxi, Jiangsu, Guizhou, Guangxi, and other 11 provinces.This dataset contains 8330 true color images, and the image size of each sample image is 512 × 512 pixels.The ratio between the training dataset, validation dataset, and test dataset was 8:1:1.Model training was carried out on the training dataset, optimal model selection was conducted on the valid dataset, and test data was utilized to compare model performance.

Study area
To test the model's performance in real-world scenarios, Ankang City, located in Shanxi province, China, was selected as the study area.Ankang is located in the heart of China, with a total area of 23,000 km 2 , and is the boundary area between the north and the south of China.This region is covered with various vegetation types including evergreen broad-leaved forest, deciduous broad-leaved forest, and primary forest.The study area is displayed in Figure 1.

Method
In this study, a decoupled re-parameterization deforestation detection model, named RepDDNet was proposed.Its whole structure is shown in Figure 2, in which a structural re-parameterization transform method is used to convert the multi-branch model into a plain structure in the inference stage.As shown in Figure 2, the whole structure of RepDDNet is an encoder-decoder structure.First, a backbone with a Siamese structure is used to extract rich context semantic features.
Then, the OCR module is used to aggregate high-level context semantic features.Finally, in the decoder stage, a simple nearest-neighbor upsampling method with a 1 × 1 convolution kernel is applied to obtain output semantic features.In addition, to capture the small changed object accurately and alleviate the class imbalance problem, we used the weighted cross-entropy and dice loss to optimize the whole model training stage.Furthermore, an auxiliary loss was used to speed up model training, which also consists of a weighted cross-entropy and a dice loss, and the same model structure can be found in PSPnet.To improve the model performance, the concatenation operation was applied to fuse semantic features of different levels.Similar to the SegNeXt model  ( Guo et al. 2022), the proposed method only fuses context semantic features between stage 2 and stage 4, due to low-level semantic features may be compromised the segmentation performance of the whole model structure.

Deep semantic feature extraction
Toward pixel-level deforestation detection tasks, there are two important methods to encode the input bi-temporal images.One case is using image stacking (Peng, Zhang, and Guan 2019), and the other case is using Siamese network structure (Chen and Shi 2020).Image stacking uses a concatenation operation to transform two time-phase images into a single multi-band image, and then considers it as a semantic segmentation task for the following steps.The Siamese network structure is carefully designed for change detection tasks.Several studies indicated that the Siamese network had achieved slightly higher accuracy than image concatenation method (Daudt, Saux, and Boulch 2018).Therefore, we also imitated the Siamese network structure to design a new backbone to extract deep semantic features.
A series of re-parameterization backbones are proposed for deep feature extraction.Considering that deforestation detection tasks were commonly run on specific GPU servers, such as the TeslaV100, RTX3090, or other NVIDIA GPU hardware devices, therefore, the relatively large model structure like the RepVGG-B2g4 was selected as the backbone module in this study.There are five stages in RepVGG-B2g4, and after each stage, the feature size will be reduced by 1/2, thus the final output size of high-level features is 1/32 of the input image size.However, such a small high-level context feature size will significantly limit model performance on a small object detection due to semantic features disappearing.Therefore, the last two layers in the encoder stage were modified by changing the stride parameter of the convolution operator to 1, which means that the output size of high-level features is 1/16 of the input images.
Detailed basic module for the training and inference stage of RepVGG Block in RepCDNet can be seen in Figure 3.It can be seen that the basic module in the training stage is a multi-branch structure, which is consist of 3 × 3 convolution, BN (Batch Norm), and 1 × 1 convolution, while there is only 3 × 3 convolution, followed by an active layer in the inference stage.To accomplish such a conversion, the intermediate structure in Figure 3 is the most important module.Structural re-parameterization was introduced to implement the transformation.In terms of the mathematical form of BN, as shown in formula (1), it contains four parameters: mean, standard deviation, scale factor, and bias.Naturally, if these parameters can be fused into a convolution kernel, thus the model parameters will be significantly reduced and computation speed will be heavily improved in the inference stage.Formula (1) can be expanded as formula (2) and (3).bn(M, u, s, g, b) = (M :,i,:,: − u i ) Where i denotes the number of feature channels, u denotes the mean parameter, s denotes standard deviation, g denotes scale factor, and b denotes the bias parameter.
W ′ i,:,:,: = Combining formulas (1), (2), and (3), the BN operator can be easily merged into a 3 × 3 convolution kernel and obtain (4).Toward the 1 × 1 convolution, it can be converted to a 3 × 3 convolution by simply filling it with zero parameters.Therefore, the whole multi-branch structure can be transformed into a plain structure, which is fast in the inference stage and also friendly to deploy in a production environment.

Context feature aggregation
Several excellent context feature aggregation modules, such as the PPM (Zhao et al. 2017), and ASPP (Chen et al. 2018) module, indicate a robust segmentation result on natural images, and their general features are that using fixed convolution or pooling parameters to obtain semantic features at different spatial scales.For example, the dilation convolution parameters in ASPP are (1,6,12,18).Though PPM and ASPP acquired satisfactory results in a natural image, their performance on remote sensing images still needs further investigation.Followed by the modified RepVGG-B2g4 backbones, the OCR module was used to aggregate high-level context semantic features.The primary advantage of OCR is that only pixels belonging to the same object need to be modeled, which also can improve the model segmentation performance in small objects or hard-classified regions.In Figure 4, there are three multiplication operations in the whole structure, and the operations are used to capture global and local context semantic information, which is important for segmentation tasks since multiplication operations are useful to achieve global semantic information.where TP indicates that the predicted pixel results in a positive sample and the ground truth pixel is also a positive sample.TN indicates that the predicted pixel is a negative sample and the ground truth pixel is also a negative sample.FP indicates that the pixel in the ground truth image is a negative sample but is predicted as a positive sample.FN indicates that the pixel in the ground truth image is a positive sample but is predicted as a negative sample.

Model efficiency comparison
In order to evaluate the computational efficiency of various deep learning models, Theo FLOPs (Float operations) indicator is used (Ding et al. 2021).It is noted that the Theo FLOPs performance is only a reference indicator.Because in practical scenarios, many factors may affect the speed of models, such as the MAC (Memory access cost) and computational parallelism.Thus, we used the inference time to evaluate model efficiency in this manuscript, which is more realistic and objective.To comprehensively consider the impact of multiple indicators, a new index was constructed to evaluate the model inference efficiency, as follows: where IT is inference time, and the PM is Params, TF is Theo FLOPs.

Parameter setting
Pytorch 1.8.2 was selected as the deep learning framework, and the operating system was win10 with 64GB DDR4 RAM, and it contained a TeslaV100 GPU with 32GB memory.In the training stage, the learning rate was reduced to 1/10 of the original value after every 10 epochs, and the batch size is set to 6. Data augmentation was used to improve model generalization ability, including random scale transformation, random luminance transformation, and rotation transformation.A multi-scale training skill was also utilized to improve model performance.Concretely, during the training stage, the image size of each training sample was changed from 448 to 832 with an interval of 32.However, only the original resolution was used in the inference stage to improve computational efficiency.

Results
The acquisition time of bi-temporal images of Ankang City was in the second season of 2020, and 2021, respectively.The spatial resolution of bi-temporal images is 2 m with only RGB bands, and the image size of bi-temporal images is 121,614 × 118,670 pixels.The experimental images and deforestation results are displayed in Figure 5. Additionally, the spectral difference between bi-temporal images is large, but it is more convincing to verify the model performance or spatial-temporal transferability in such large a real-world scenario.As shown in Figure 5, on the whole, the location of deforestation in Ankang City indicates a uniform distribution, and RepDDNet can detect most deforestation regions in this study area.The deforestation area detected by RepDDNet is 14.92 km 2 , and the area of ground truth is 11.9092 km 2 .The difference between them is primarily due to some crop rotations on the edge of the forest that is wrongly predicted as forest change by RepDDNet.An effective way to improve the accuracy is to use the product of land cover, such as the GlobeLand30 (Chen et al. 2015) to mask these commission alarms.Several detailed sub-region deforestation detection results in Ankang city can be seen in Figure 6 (subregions A, B, C, and D in Figure 5(c)).
In Figure 6, it can be seen that the main deforestation factors occurred in these regions are caused by human factors, such as road construction in A and C regions, urban expansion in subregion B, and agricultural activity in subregion D. In summary, the boundaries of extracted deforestation indicates a satisfactory agreement with GT.To quantitively evaluate model performance, we classify the RepDDNet model into two cases: one case is a model with re-parameterization, and the other case is without re-parameterization.The experimental accuracy and efficiency are displayed in Table 1.In Table 1, compared with RepDDNet (without re-param), the RepDDNet (reparam) remains the same in terms of accuracy, but the computation efficiency improved by nearly 30%.Though Params and FLOPS indicators of RepDDNet (re-param) do not decrease obviously, inference time has significantly decreased.The main reason is that the inference time is not only related to the model parameter but also strongly related to MAC or hardware optimization.For example, the 3 × 3 convolution kernel is carefully optimized on NVIDIA GPUs, another factor is that a plain structure is much faster than a multi-branch structure.Moreover, RepDDNet is different from other model acceleration methods, such as model pruning (Liu et al. 2018), or model quantizing (Zafrir et al. 2019), the main advantage of RepDDNet is that it can maintain accuracy unchanged while improving the computation efficiency.

Discussion
The efficiency of RepDDNet has been demonstrated in Ankang City, but several issues need further discuss.First, what is the advantage of RepDDNet over other re-parameterization models?Second, compared with other deep learning methods, can RepDDNet still maintain the advantages in terms of efficiency and accuracy?Last but not least, RepDDNet uses OCR as a contextual feature aggregation module, how does it compare with other classic contextual feature aggregation modules?These question will be carefully discussed in the following section.

Compared with other re-parameterization models
What are the advantages of RepDDNet over other structural re-parameterization methods?We select two classical models for comparison, they are ACB (Asymmetric Conv Block) (Ding et al. 2019) and DiracNet (Sergey and Nikos 2017).The principle of these two models is as follows:

ACB module
The principle of the ACB module is the decoupling between the training stage and the inference stage.During the training stage, three convolutional kernels were used to extract semantic features simultaneously, and the shape of the aforementioned convolutional kernels are 3 × 3, 1 × 3, and 3 × 1, respectively.In the inference stage, the three convolution kernels can be converted into a new 3 × 3 convolution kernel.In theory, it is equivalent to a reinforced 3 × 3 convolution.Interestingly, the inference stage does not increase any parameters and computational effort, while the weight parameter of three convolution kernels has been reserved.

DiracNet
The core idea of DiracNet is to adapt the ordinary 3 × 3 convolution by adding two parameters that can be learned to optimize the ordinary convolution kernel.The main difference between DiracNet and the proposed RepDDNet is that our structural re-parameterization is different in the training and inference stages, whereas the training stage in DiracNet is the same as the inference stage.
Specifically, for a fair comparison, the basic RepVGG block in RepDDNet is replaced with ACB and DiracNet, and the training parameter or data augmentation methods are kept unchanged.Several deforestation detection results from the test dataset were selected for qualitative comparison, as displayed in Figure 7 the overall visual results detected by RepDDNet are more satisfactory than other methods.In terms of detailed performance, there are a few commission alarms in both ACB and DiracNet models, especially on the edge of the change regions.
The quantitative accuracy of various re-parameterization models are shown in Table 2. Compared with ACB and DiracNet, RepDDNet acquired higher accuracy.In addition, the F1-score indicator of RepDDNet has improved by about 1.24% compared with DiracNet.The whole quantitative accuracy of different models demonstrates that the multi-branch structure can extract more effective high-level semantic features than other re-parameterization methods.The OA indicator of all models is relatively high, primarily due to the overwhelming proportion of negative sample pixels (de Bem et al. 2020).

Compared with other deep learning models
Compared with other deep learning models, can RepDDNet still maintain the advantages in terms of efficiency and accuracy?several advanced models were selected for comparison.Since RepDDNet is still a pixel-level segmentation method, which essentially belongs to the category of semantic segmentation in the field of computer vision, therefore, some classical semantic segmentation models were also selected for comparison.The principle of different models is shown in Table 3 and the deforestation detection result is displayed in Figure 8.
In Figure 8, from the visual aspect, the ESCNet, HRnet, DeepLabV3+, and PSPnet indicated a relatively satisfactory result, and all of them could detect deforestation regions.However, there were a few omissions alarms in the aforementioned models due to narrow road changes that cannot be detected.In other models, such as the Unet, SiamFCN, Unet++, and DTCDSCN, the  A quantitative evaluation of various models is displayed in Table 4. On the whole, compared with other excellent models, RepDDNet achieved better quantitative accuracy.In terms of the Params and FLOPs indicators, the theoretical computation of RepDDNet was the heaviest, but the inference time was relatively better and faster than ESCNet, DTCDSCN, HRnet, DeepLabv3 +, and PSPnet.Compared with other relatively lightweight models like Unet, SiamFCN, and Unet++, RepDDNet indicated a trade-off between accuracy and model inference efficiency.Because the model layer of Unet or SiamFCN was relatively shallow, and the basic module of them was a 3 × 3 convolution and a batch norm operation, thus their computation speed will be faster than heavy models.Though the theoretical computation of RepDDNet was the heaviest among all models, it still runs much faster than the existing heavyweight models and achieves the highest accuracy, since the 3 × 3 convolution kernel used in RepDDNet was carefully optimized on NVIDIA's GPUs.The IPT indicator shows that RepDDNet achieved the best performance in computation efficiency.

Compared with other context feature aggregation modules
The proposed RepDDNet uses OCR as a contextual feature aggregation module, and what are the advantages of it compared with other classic contextual aggregation methods, such as ASPP, and PPM?An auxiliary experiment was added to validate.Specifically, the OCR module in RepDDNet was replaced with the ASPP in DeepLabv3+ or the PPM module in PSPnet.The experiment was also conducted on the test dataset.The results are shown in Figure 9.
In Figure 9, the OCR module achieved a better visual result.In terms of detail comparison, OCR could effectively capture slight changes, for example, the changes of a narrow road.However, the PPM and ASPP modules are difficult to detect such changes effectively.In principle, OCR is capable of modeling high-level semantic features for a long distance, thus it can use these rich and effective  used the pooling operator, which was also optimized on NVIDIA GPUs and had no parameter learning process, thus the computation speed was faster than convolution operators.However, considering from both the accuracy and speed factors, OCR was still the best choice to aggregate highlevel context features.The IPT indicator also proves this opinion.

Feature extraction ability of RepDDNet
The RepVGG-B2g4 was modified as the backbone in RepDDNet to extract deep semantic features, which had a large impact on the final change detection results.Though the aforementioned experiment demonstrated that the modified RepVGG-B2g4 was effective, the underlying mechanism of model feature extraction ability is still unclear.In order to further understand it, a feature visualization method was used to prove it (Han et al. 2020).In addition, to quantitatively describe the feature extraction ability of RepDDNet, several excellent backbone modules were also selected to compare.In Figure 10, it can be seen that the most efficient feature response is obtained by the modified RepVGG-B2g4 (the strongest feature response after the final 'differencing' operation in RepDDNet.).In terms of module principles, although both ResNet101 and ResNeXt-101 used a  multi-branch structure, the width and depth of the modified RepVGG-B2g4 were relatively deeper than theirs, and recent related studies also demonstrated that a deeper model had a stronger feature extraction ability (Alhichri et al. 2021).How does the modified RepVGG-B2g4 compare with ResNet101 and ResNeXt-101 in terms of inference speed?A comparison experiment was conducted to validate.As shown in Table 6, though the Params and Theo FLOPS indicator showed that the modified RepVGG-B2g4 was heavier than ResNet101 or ResNeXt-101, the inference time of the modified RepVGG-B2g4 was relatively faster due to the model inference time is affected by several factors, such as MAC or hardware optimization.

Conclusion
In this paper, we have proposed a novel deforestation detection model, named RepDDNet.A very large study area in China was selected to validate its performance, the result indicates that RepDDNet with re-parameterization could be improved significantly compared to the model without re-parameterization.Moreover, compared with other deep learning models containing re-parameterization models or segmentation models, the proposed RepDDNet also indicates excellent performance.
To the best knowledge of us, RepDDNet was the first re-parameterization model designed for deforestation detection research field.To help other researchers conduct deforestation detection on high-resolution images conveniently, we also developed a new software, which was free to  use and support very large image processing.Moreover, the model design idea of RepDDNet can provide some new insights for other landcover interpretation tasks, such as object detection or super-resolution on remote sensing images.In addition, we also believe that RepDDNet could support the SDGs (Sustainable Development Goals) of goal 13: take urgent action to combat climate change and its impacts.

Figure 2 .
Figure 2. The model structure of the proposed RepDDNet.

Figure 4 .
Figure 4. Object contexture representation to aggregate different level deforestation semantic features.

Figure 5 .
Figure 5. Experimental images.(a-b).The bi-temporal images of Ankang City.(c).The deforestation detection result of Ankang City (The green color represents missing pixels, and the magenta represents false-detected pixels).

Figure 6 .
Figure 6.(a-d).The deforestation detection in subregions A, B, C, and D. (the green color represents missing pixels, and the magenta represents false-detected pixels.)

Figure 7 .
Figure 7. Deforestation detection result of various re-parameterization models.(a) The former time-phase image, (b) The latter time-phase image, (c) ACB, (d) DiracNet, (e) RepDDNet, (f) Ground truth.(The green color represents missing pixels, and the magenta represents false-detected pixels).

Figure 8 .
Figure 8. Deforestation detection result of different deep learning models (A further detailed comparison is presented in Appendix A, the green color represents missing pixels, and the magenta represents false-detected pixels).
For comparison, the ResNet101 (He et al. 2016) and the ResNeXt-101 (Saining et al. 2017) were selected as our compared modules.Especially, we replace the modified RepVGG-B2g4 with ResNet101 or ResNeXt-101 in RepDDNet to construct a new deforestation detection model.The deforestation detection results are displayed in Figure 10.

Figure 9 .
Figure 9. Deforestation detection by various context feature aggregation modules.(The green color represents missing pixels, and the magenta represents false-detected pixels).

Figure 10 .
Figure 10.Deforestation detection of different feature extraction backbone.From (a) to (c), they are input images and Ground Truth.From (d) to (f), these are deep features extracted by different backbones.

Table 1 .
Accuracy assessment and efficiency test of proposed RepDDNet.(The inference time, Params, and Theo FLOPs are computed with 10 images on a TeslaV100 device, and the size of each image is [3, 512, 512] pixels).

Table 2 .
Accuracy assessments of different re-parameterization models.

Table 3 .
(He et al. 2016f compared models.deforestationdetectionresult of them was worse than the ESCNet, HRnet, or PSPnet from visual comparison.For instance, there were a few omission alarms in the Unet model due to its shallower model depth, as described by(He et al. 2016), the shallower model depth cannot extract effective high-level features.From the visual comparison, RepDDNet indicates a much better result with fewer omission and commission alarms than all other models, and a further detailed comparison result can be seen in Appendix A.

Table 4 .
Model parameters and accuracy assessment.(The batch size and input image size are 16, [3,512,512] pixels, respectively.All indicators are calculated on a TeslaV100 device).
high-level features to capture change areas even if they are hard to distinguish.The quantitative evaluation of the different contextual aggregation modules was shown in Table5.In Table5, the OCR module achieved the best F1-score, and it improves by 1.30% over ASPP, and 2.07% over PPM, respectively.Though the computation indicator of OCR is heavier than PPM or ASPP, the actual inference time was much faster than ASPP and slightly slower than PPM.Because PPM

Table 5 .
Quantitative evaluation of different context aggregation modules.(Thebatchsize is 16, and the image size is[3,512,512]).

Table 6 .
Accuracy assessment and efficiency comparison.(The batch size is 16, and the size of each image is [3,512,512] pixels).