An unsupervised heterogeneous change detection method based on image translation network and post-processing algorithm

ABSTRACT The change detection (CD) of heterogeneous remote sensing images is an important but challenging task. The difficulty is to obtain the change information by directly comparing the different statistical characteristics of the images acquired by different sensors. This paper proposes an unsupervised method for heterogeneous image CD based on an image domain transfer network. First, an attention mechanism is added to the Cycle-generative adversarial networks (Cycle-GANs) to obtain a more consistent feature expression by transferring bi-temporal heterogeneous images to the common domain. The Euclidean distance of the corresponding pixels is calculated in the common domain to form a difference map, and a threshold algorithm is applied to get a rough change map. Finally, the proposed adaptive Discrete Cosine Transform (DCT) algorithm reduces the noise introduced by false detection, and the final change map is obtained. The proposed method is verified on three real heterogeneous CD datasets and compared with the current state-of-the-art methods. The results show that the proposed method is accurate and robust for performing heterogeneous CD tasks.


Introduction
Change detection (CD), an important method in remote sensing, aims to study remote sensing images acquired at different times in a specific geographic location and identify change areas by analyzing ground information (Gong et al. 2019). With rapid developments in remote sensing science and technology, real-time images of the earth's surface can be obtained easily, resulting in wider use of CD method for applications, such as urban change monitoring (Ji et al. 2019;Zhang et al. 2019b;Wu et al. 2021), natural disaster assessment (Luppino et al. 2019;Qiao et al. 2020;Kim and Lee 2020), a survey of geological resources (Hou, Wang, and Liu 2017;Yokoya, Chan, and Segl 2016), etc.
The traditional CD methods usually generate a change map using two images of the same location but at different times through arithmetic operations such as difference (Singh 1986) or ratio (Howarth P and Wickware G 1981), or use algorithms such as Change Vector Analysis (CVA) (Singh and Talwar 2015) or Slow Feature Analysis (SFA) (Wu, Du, and Zhang 2014), etc., to calculate the region and changing features. However, traditional methods are difficult to deal with the noise caused by pseudo-changing pixels, and the extracted features have a weak expression in the image, resulting in poor CD performance in complex scenarios. At present, the availability of many homogeneous remote sensing data samples has received extensive attention from researchers. Due to the flexibility of neural networks, different types of changing features can be learned through nonlinear transformations. Therefore, researchers use complex networks instead of simple mathematical operations to obtain more robust and effective detection results (Luppino et al. 2021).
Current works on remote sensing CD is mainly devoted to homogeneous CD, homogeneous CD has achieved competitive results with the development of deep learning technology (e.g. Zhang et al. 2020;Chen and Shi 2020;Daudt, Le Saux, and Boulch 2018;Gong et al. 2016;Samadi, Akbarizadeh, and Kaabi 2019;Wang et al. 2021). The homogeneous CD method relies on homogeneous data, a set of images acquired by the same or similar sensors, such as optic to optic or synthetic aperture radar (SAR) to SAR (Gong et al. 2019). Heterogenous images compared to homogenous images have different domains, statistical distributions, and inconsistent category features (Luppino et al. 2020), making the CD task difficult. However, in practical applications, the acquisition speed of multi-temporal homogeneous images is limited, making it difficult to detect and effectively evaluate real-time changes caused by emergencies and natural disasters, but the more convenient availability of heterogeneous data provides solutions to such problems. For example, when geological disasters or building damages occur, it is necessary to assess the disaster situation quickly. In such cases, remote sensing images that are homogeneous with the previously archived data may not be available, so the change information can only be obtained through bi-temporal heterogeneous images. Therefore, the CD method based on heterogeneous remote sensing images is particularly important in dealing with real-time tasks.
According to the steps, most of the bi-temporal heterogeneous CD methods based on deep learning can be classified mainly into post-feature extraction and post-domain transfer comparisons. 1). The principle of post-feature extraction comparison is direct and simple to implement. First, perform spectral classification or target segmentation on the images at different time instances after registration. Then, compare the classification or segmentation results pixel by pixel to determine the distribution and characteristics of the change information (e.g. Wu et al. 2017;Hedhli et al. 2016;Wan, Xiang, and You 2019), or perform deep feature extraction on two kinds of images through the neural network, and then compare their differences to generate change maps (e.g. Saha, Ebel, and Zhu 2022;Zhang et al. 2016). However, the results of post-feature extraction comparison largely depend on the type of features and the accuracy of classification. Different ground objects have different feature expressions, and parameters need to be adjusted according to experience to achieve the best results . 2). The method based on post-domain transfer comparison transfers the bi-temporal heterogeneous images into the same feature domain and generates the final change map according to the homogeneous CD methods. Domain transfer is generally regarded as a generalized problem of texture synthesis, extracting texture and transferring it from the source domain to the target domain (Jing et al. 2020), which modifies the style of the image while retaining its content. In the existing heterogeneous CD methods based on domain transfer, the frequently used models mainly include auto-encoders (AE), generative adversarial networks (GANs), and their variant networks. AE is an unsupervised deep learning model that can be effectively applied to data dimensionality reduction and feature extraction (Hinton and Salakhutdinov 2006). Under different task requirements, different variants of AE have been proposed, such as denoising auto-encoder (DAE), stacked denoising auto-encoders (SDAE), and variational auto-encoders (VAE) (Doersch 2016). Liu et al. (2018) proposed an unsupervised deep convolutional coupling network (SCCN) using two SDAEs for CD based on two heterogeneous images acquired on different dates using optical sensors and radar. Zhao et al. (2017) proposed an approximate symmetrical deep neural network (ASDNN) by improving SCCN and building the network with SDAE. Luppino et al. (2022) designed a Code-Aligned Auto-encoder (CAA) to transform heterogeneous remote sensing images into the same code space. The change prior is derived in an unsupervised manner from the cross-domain comparable pixel pair affinity extracts the relational pixel information captured by the specific field affinity matrix at the input, effectively realizing the CD of heterogeneous data such as multi-spectral and multi-polarization images. Zhan et al. (2018) proposed an unsupervised CD method for heterogeneous SAR and optical images based on a logarithmic transformation feature learning framework using SDAE.
One of the most important methods in the domain adaptation and data transfer literature is GAN (Goodfellow et al. 2014), an unsupervised deep learning model that realizes the transfer of two feature spaces through the mutual game between the generator and the discriminator. Gong et al. (2019) proposed a coupling translation network (CPTN) based on GAN, which converts heterogeneous images into homogeneous images and detects changes in the observation space. Niu et al. (2019) used conditional GANs to transform heterogeneous SAR and optical images into a space where the information has a more consistent representation for CD. Luppino et al. (2021) proposed a new heterogeneous CD network architecture based on Cycle-GANs, named Adversarial Cyclic Encoder Network (ACE-Net), guaranteeing the correlation between the potential space and the original space by adding cyclic consistency loss. Saha, Bovolo, and Bruzzone (2019) also achieved CD between multi-sensor and multi-temporal data in an unsupervised manner using CycleGAN consisting of two generators. However, these methods almost directly use GANs or their variants for the domain transformation step of heterogeneous remote sensing images, and it is difficult to extract the potential change details with few learning samples completely. The domain transfer-based method combined with deep learning technology gives promising results amongst the existing heterogeneous CD methods. The domain transfer in the above method is mainly divided into two types: 1) Both bi-temporal heterogeneous images are transferred to the third public space and compared. 2) Two bi-temporal heterogeneous images are transferred to the feature space where each other is located and compared the difference information. We noticed that mapping the heterogeneous image to the third latent space produces an inevitable error; these errors accumulate in the next step when the comparison is made, and the change map is generated, affecting the CD result. However, the domain transfer between heterogeneous images in the existing literature is only suitable for simple data, such as visible optic, infrared, and SAR (Gong et al. 2019;Niu et al. 2019), etc. For performing the domain transfer on heterogeneous images of SAR vs. multispectral, it is necessary to design a more complex network. In addition, almost all of the existing methods directly compare the transferred heterogeneous images and generate the final change map through a threshold algorithm. The lack of adequate noise processing techniques and the errors caused by domain transfer makes the change area not precise and compact.
In order to solve the above problems, based on the domain transfer method Cycle-GANs, this paper proposes a new CD method for bi-temporal heterogeneous remote sensing images. First, the heterogeneous image is converted to the same feature domain using the style transfer network. An attention mechanism is added to the generator and the discriminator of Cycle-GANs to speed up model convergence and effectively complete the transfer of the heterogeneous space. Secondly, the Euclidean distance between the bi-temporal images in the same feature domain is calculated channel by channel, called the feature difference matrix. Then, the maximum value of the channel dimension is taken as the difference map, and finally, a series of threshold calculations and noise reduction operations generate the CD map. The heterogeneous CD framework proposed in this paper is unsupervised, as collecting labeled data takes much time and effort (Zhan et al. 2018b) in supervised learning. Unsupervised deep learning methods are usually preferred because they do not require any supervised information about the change.
The rest of the paper is organized as follows: Section 2 reviews the theoretical background of domain transfer methods. Section 3 introduces the structure and process of the proposed method in detail. Section 4 introduces the datasets and other state-of-the-art (SOTA) methods, conducts experimental comparison and analyses the results, and conducts ablation experiments for the proposed attention and adaptive DCT module. Section 5 discusses the experimental results and Section 6 summarizes the work done in the paper.

Related works
This section mainly introduces the principle and structure of GANs and their variants, which are commonly used deep learning models in domain transfer and style transfer. In the GANs model, the powerful nonlinear expression of the neural network is used to fit functions with different functions, and back-propagation is performed through the feedback of the loss function to continuously learn more reliable features to generate data close to the target domain.
The GANs model mainly includes a generator and a discriminator, which generally uses deep neural networks in practical applications. The generator learns the distribution of the actual image to generate realistic images. The purpose is to deceive the discriminator, and the discriminator needs to judge the authenticity of the received images. With the development of GANs in domain transfer and image generation , researchers have proposed Cycle GAN ) with a bidirectional network to achieve efficient transfer. Considering that the cycle consistency of Cycle GAN can improve the efficiency of domain transfer (Luppino et al. 2021;Luppino et al. 2022), Cycle-GANs is used with the attention mechanism as the transfer network of heterogeneous images, and its structure is shown in Figure 1. The entire network is a dual structure, including two generators: Generator (A→B) and Generator (B→A), and two discriminators: Discriminator A and Discriminator B. The model obtains the input Image A from domain A, which is passed to the first generator (A→B). Its task is to transfer a given image from domain A to an image in target domain B. Discriminator B of domain B is used to judge its authenticity. This newly generated image is passed through another generator (B→A), whose task is to transfer the original domain A back to the image Cycle A so that Cycle A is similar to the original input image. This process satisfies the cycle consistency. The entire network is trained continuously until all generators and discriminators reach a dynamic balance; the transformation of heterogeneous images is completed. This model has three main losses: adversarial loss, cyclic consistency loss, and identity loss.
The adversarial loss is applied to two mapping functions to match the generated image's distribution with the target domain's data distribution . For the Generator (A→B) and Discriminator B, the target is expressed as: Among them G AB is the generator that transforms the image of domain A into the image of domain B, x and y are the input images corresponding to domain A and B, respectively, and D B aims to distinguish the translation sample G AB (x) from the real sample y. The training process G AB and D B compete against one another to minimize this goal, that is min G AB max D B L GAN (G AB , D B , A, B). A similar adversarial loss is introduced for the generator and its discriminator that maps from the B to A domain: ( min G BA max D A L GAN (G BA , D A , B, A)).
The adversarial loss alone does not guarantee that the learned function can accurately map the image from domain A to B. The cyclic consistency loss prevents the learned mapping G AB and G BA from contradicting each other. For domain A, the image translation cycle should enable x to remap back to the original image, namely x→G AB (x)→G BA (G AB (x))≈x. Therefore, the cyclic consistency loss is used to incentivize this mapping: The generator G AB is used to generate the domain B style image. To prevent the change in the tone of the generated image, G AB (y) and y should be as close as possible. Therefore, the introduction of Identity loss is as follows: Combining the above losses, the final objective function in this transfer network is: Among them a and b are the proportions of the control cycle consistency loss and the identity loss, whose values are 10 and 5, respectively.

Details of proposed transformation network
The Transformation Network proposed in this paper has two sub-networks of Generator and Discriminator. A lightweight attention module is added to these two sub-networks, aiming to accelerate the convergence of the model and improve the efficiency of the heterogeneous image domain transfer.
(A) Attention module: The attention mechanism refers to a method that pays attention to the important information of the current task while ignoring irrelevant information. In recent years, it has been used widely in various deep learning tasks (Hu 2020;Zeyer et al. 2018;Wang et al. 2020), and it has achieved good results. In the image translation method based on the GAN series, the attention mechanism is introduced to realize the efficient transfer of image style, which has been confirmed in the literature (Chen et al. 2018;Emami et al. 2020;Zhang et al. 2019a). Since our method is unsupervised and has little data, it is difficult to guarantee that the transformation of the entire feature domain is fully realized in Cycle GAN. Therefore, the efficiency of domain transformation is improved by adding an attention module and, at the same time, can accelerate the convergence of the model. Inspired by the work of (Woo et al. 2018), a lightweight attention module is designed from two dimensions, channel, and space. This has been applied to the generator and the discriminator of the above transfer network to achieve more accurate transfer effects. Figure 2 shows the proposed attention module structure used in the transfer network, (a) is the channel attention module, and (b) is the spatial attention module.
Channel Attention Module (CAM) aims to extract meaningful information from each channel in the feature map. First, each channel in the feature block is averaged and pooled to compress the original feature into a C × 1 × 1 vector. A one-dimensional convolution operation is then applied, followed by the activation function to find the channel weights. The original feature is multiplied by the weight coefficient in the channel dimension to obtain a globally enhanced channel attention feature map (M C ), whose expression is as follows: where F is the original feature map, ⊗represents channel-by-channel multiplication, s(·)is the sigmoid activation function, conv1dandAvgpoolrepresents one-dimensional convolution and average pooling operations, respectively. Spatial Attention Module (SAM) aims to extract meaningful information from each pixel in the feature map. First, the spatial elements in each feature block are subjected to maximized pooling on the channel dimension, and the feature block is compressed into a 1 × W × Hfeature map. The spatial weight coefficient is formed using the two-dimensional convolution operation and the activation function. Each position pixel in the original feature is multiplied by the spatial weight coefficient to obtain a locally enhanced spatial attention feature map (M S ), whose expression is as follows: where⊗represents element multiplicationconv2dandMaxpoolrepresents two-dimensional convolution and maximum pooling operations, respectively. In the designed CAM and SAM, the average pooling and maximum pooling methods are used to reduce the space and channel dimensions. The two modules obtain the change information after the global and local enhancements. To achieve a better image translation effect, they connect in different ways with the generator and the discriminator.
(B) Generator: The generator is constructed using encoding, transformation, and decoding. The structure of the generator proposed in this paper is shown in Figure 3. First, Features are extracted from the input image (real image) through convolution and downsampling operations, and then the image's feature vector in the A domain is transferred to the feature vector in the B domain in the attention-based ResNet module. Finally, the deconvolution layer is used for upsampling the feature vector to generate the transferred image (fake image).
In Figure 3, 'ResBlock' is the key module to realize feature transformation. Resblock is the basic unit of the ResNet backbone (He et al. 2016), which is divided into two parts: direct mapping and residual. We add channel and spatial attention to these two places, respectively. Subsequently, two parts are added and sent to the next resblock to perform feature transformation, forming a residual module based on the attention mechanism. Its structure is shown in Figure 4. By fusing the feature maps that have passed through the channel and spatial attention, the global and local information can be taken into account, and the convergence of the generator model can be accelerated while the image transfer is accurately completed. Nine resblocks are connected in series by deepening the number of network layers to extract high-dimensional features and increasing the number of attention modules to enhance useful information and suppress irrelevant information.
(C) Discriminator: The role of the discriminator is to take an image as input and predict whether it is the original image (Real image) or the output image (Fake image) of the generator. Figure 5 shows the proposed discriminator network based on the attention mechanism, which extracts features from the input image through a simple convolutional neural network to determine whether the image belongs to the specific category (real image or fake image). The essence of a discriminator is a binary classifier, in which an attention mechanism is added to improve the classification accuracy.

CD framework for transformed images
In the transformation network proposed in this paper, two heterogeneous images are transformed from a lower-dimensional feature domain to a higher-dimensional feature domain and then obtain the changed regions in the high-dimensional feature space. Since the high-dimensional feature domain contains additional information, it is helpful to detect more changed features, so the high-dimensional feature domain in the heterogeneous image pair is selected as the common domain for subsequent change detection.
Although the above transfer network can transform heterogeneous images into similar feature domains, their pixel differences may still be significant, and unnecessary noise will be introduced during the transformation process. Therefore, a three-step post-processing algorithm is proposed to obtain an accurate heterogeneous change map. 1) Calculate the difference map: According to the channel dimension, we calculate the Euclidean distance of the corresponding spatial position from the image transferred to domain B (Generate image) and the real image of domain B (Image B). And then, the maximum distance in each channel is taken as the difference values of spatial pixels to generate a single-channel difference map. 2) Threshold segmentation: Determine the segmentation threshold of pixel intensity, and only retain the pixels with intensity higher than the threshold in the difference map, and convert  the difference map into a binary change map. The maximum between-class variance method (OTSU) (Otsu 1979) is primarily used to determine the threshold value. It divides the image into two parts: background and foreground, according to the grayscale characteristics of the image. Variance is a measure of the uniformity of gray distribution. The larger the inter-class variance between the background and the foreground, the greater the difference between the two parts of the image. So the segmentation with a large inter-class variance indicates a low probability of misclassification. OTSU is computationally simple, not affected by image brightness and contrast, and does not require additional parameters compared with other segmentation methods. This method has been proven to have excellent segmentation effects in literatures (Luppino et al. 2022;Luppino et al. 2022). 3) Reduce false detection noise: Since the change area of interest in CD tasks is usually large, we filter out high-frequency noise to remove isolated small points in the binary change map to generate a more accurate final change map. A Discrete Cosine Transform (DCT)-based filter is used to remove the false detection noise from the binary change map. The DCT-based filtering method has a good noise suppression effect in the homogeneous area of the image and can retain the edges and detailed features (Ning and Ke 2012). Moreover, the calculation speed of the DCT transform is fast. Therefore, for image noise filtering with unknown noise-related parameters and non-stationary conditions, the adaptive denoising technology of the DCT transform can get excellent results.
DCT is a special form of Discrete Fourier Transform (DFT) (Oktem and Ponomarenko 2007). If in the Fourier series expansion, the function to be expanded is a real even function, then only the cosine term is included in the Fourier series, and thus the DCT transform can be obtained. After DCT, the binary change map forms the DCT coefficient map. The upper left corner represents the low-frequency part. The farther away from the upper left corner, the higher the frequency. The position where the more prominent value component appears represents the frequency distribution of the image. A patch proportional to the image size is used to intercept the low-frequency part of the DCT coefficients to filter out high-frequency noise. For the size of the image patch, a series of reference values are set. According to experiments, it is found that the smaller the size of the selected image patch, the more pronounced the noise removal, but the more the loss in the details of the changed image. The larger the selected image patch, the more details of the changed image are retained, but the noise removal effect is poor. Therefore, the image patch size has become an important factor in the quality of the generated change map.
In order to select the appropriate image patch size, an adaptive method is proposed based on the pixel intensity ratio to determine this parameter, i.e. the initial patch size is 1/100 of the original image, and the proportion of white pixels in the patch is calculated and recorded. The patch size is then expanded outwards in units of five pixels. The above steps are repeated for each expansion, and the proportion of white pixels in the patch is recorded until the patch size reaches half of the original image. During this process, the aspect ratio of the patch is always consistent with the original image. Finally, the value obtained by each expansion is plotted, as shown in Figure 6. The abscissa represents the size of the patch, and the ordinate represents the proportion of white pixels in the patch. It can be seen that when the patch is small, the white pixels account for more, and the change is drastic. As the patch size increases, the proportion decreases and tends to stabilize. The point marked in Figure 6 is where the polyline begins to flatten. Before this, the proportion of lowfrequency information in the image decreases before becoming constant. The expansion of the patch size did not add new meaningful low-frequency information; instead, high-frequency noise was introduced. Therefore, the patch size corresponding to the point where the slope of the broken line tends to stabilize is considered the optimal value for restoring the low-frequency part of the image. Finally, the DCT coefficients in the window are retained, and the others are set to zero, and then the DCT inverse transformation is performed to obtain the change map for removing high-frequency noise (false detection noise).
The above-mentioned heterogeneous CD post-processing algorithm is summarized in Algorithm 1, A, and B, respectively represent the spatial domain of the heterogeneous bi-temporal image, W, H, C are the length, width, and channel number of the image, respectively, the patch is the image block determined according to the DCT coefficient.

Experiments
In this section, the effectiveness of the proposed method is verified using three groups of typical datasets with bi-temporal heterogeneous images. 4.1 Section introduces the experimental details. 4.2 Section introduces the bi-temporal heterogeneous dataset used in this paper for remote sensing CD. 4.3 Section introduces several excellent heterogeneous CD methods and the related evaluation metrics. A comprehensive comparative analysis of the experimental results is shown in Section 4.4, and Section 4.5 conducts the ablation studies.

Implementation details
The proposed domain transformation network is trained and tested on a server powered by a Titan RTX GPU and an Intel(R) Xeon(R) W-2245 CPU (3.9 GHz, 256GB RAM). In order to reduce the computational pressure, the entire image is cropped into small-sized patches as the input of the network. 32×32, 64×64, 128×128, and 256×256 patches were used for training, and the optimal patch size was obtained according to the experimental results. We use the Pytorch deep learning framework with batch size set to 16, and training epochs set to 200. The training process uses a stochastic gradient descent (SGD) algorithm with momentum set to 0.99 and weight decay to 0.0005 to optimize the model. The initial learning rate is 0.01, and the learning rate decreases linearly as the number of iterations increases.

Datasets introduction
The CD experiments are conducted on three heterogeneous datasets, among which the California dataset is an actual case of heterogeneous CD for emergency situations. It is not easy to obtain short-term heterogeneous bi-temporal images in public datasets, so we use the proposed method to conduct experiments on two heterogeneous datasets with a long interval, Shuguang and Toulouse. It is found that the proposed method achieves excellent CD results (Section 4.4). In addition, before the CD of heterogeneous images, the co-registration of images is an indispensable work. The publisher has co-registered all datasets in this section, and the proposed method directly performs CD experiments on the registered bi-temporal heterogeneous images.
(1) California dataset: Figure 7 shows the heterogeneous remote sensing images before and after the floods in the area near California, between January to February 2017. (a) is a multi-spectral image collected by Landsat 8 before the flood, including 11 spectral bands from dark blue to thermal infrared. (b) is a SAR image collected by Sentinel-1A after the flood. The three channels are polarization VV, VH, and the ratio between the two intensities.  (2) Shuguang dataset: Figure 8 shows the bi-temporal heterogeneous remote sensing image of Shuguang Village, Dongying City, China, (a) is the SAR image taken by the RADARSAT-2 satellite in June 2008, (b) is the optical image of the same area taken by the Quick bird satellite in September 2012, (c) is the Ground truth map of the changing situation. The images in two temporal are registered, and both are 593×921 pixels, with a spatial resolution of 8 meters. This pair was obtained from Liu et al. (2018). (3) Toulouse dataset: Figure 9 shows the bi-temporal heterogeneous remote sensing image of Toulouse, France, (a) is the SAR image taken by TerraSAR-X satellite in February 2009, (b) is the optical panchromatic image taken by Pleiades satellite in July 2013. (c) is the Ground truth map of the changing situation. Two of the temporal images were registered to 4404 × 2604 pixels, and the SAR images were resampled and registered by Prendes (2015) using optical imagery with a spatial resolution of 2 m.

Benchmark methods and evaluation metrics
In order to evaluate the effectiveness of the proposed method, we use the following benchmark methods to perform heterogeneous CD and compare their performance on the above four datasets: (1) CVA (change vector analysis) (Johnson and Kasischke 1998): A classic method for pre-classification CD. The magnitude of the change is indicated by the magnitude of the vector between the two temporals, and the threshold for the separation determines the area of change/ unchanged between the two temporals. at the input using the field-specific affinity matrix forces the code space's alignment and reduces the impact of changing pixels on the learning target.
Four evaluation metrics are used to evaluate the performance of the proposed CD method, namely Precision (Pr), Recall (Re), F1 score (F1 Score), and Kappa coefficient (Kappa). In the CD task, the Pr indicates the accuracy of detecting the changed pixels, and the Re indicates the completeness of the changed pixels detected by the model. They use true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) to describe. F1 Score and Kappa reflect the model's overall performance, and a more significant value reflects the better performance of the model. The four evaluation metrics are described as: Among them, the Kappa calculation formula P o represents the overall accuracy (OA), P e represents the ratio of expected consistency between the true value and the predicted value under a given category distribution (El Amin, Liu, and Wang 2017). The expression of P o and P e is as follows, where N is the total number of image pixels.  Figure 10 shows the CD process of our method on the California dataset, where (a), (b), and (c) are T1 image, T2 image, and Ground truth map, respectively. After passing through an image transfer network based on the attention mechanism, the Euclidean distance of the corresponding pixel space between the transferred image and the original image is found, and it is converted into a singlechannel difference map as shown in Figure 10 (d). (e) is the binary change map after OTSU threshold segmentation, (f) is the final change map after DCT noise reduction. Compared to figure (e), the final change figure (f) has significantly reduced false detection pixels, reduced noise, and can more accurately detect the change area of the SAR and multi-spectral bi-temporal images. Figure 11 shows the CD results obtained by different methods on the California dataset, (a) is the Ground truth map, (b)-(g) are the change maps obtained by CVA, SCCN, NPSG, INLPG, CAA, and the method proposed in this paper, respectively. A comparison shows that CVA (Figure 11 (b)) has the worst results, barely able to handle the noise from spurious changes. The results of SCCN (Figure 11 (c)) can correctly detect a large-area change, but there are both false and missed detections for small changes. NPSG (Figure 11 (d)) failed to detect the shape of the change area, missed large area change pixels, and has incorrectly detected the subtle background pixels as changed pixels. INLPG (Figure 11 (e)) significantly reduces the number of incorrect detections but still does not detect a few small areas of change. CAA (Figure 11 (f)) and the proposed method (Figure 11 (g)) are similar, and both can detect large areas of change covered by floods, but it is difficult to detect smaller flood areas correctly. The difference is that using the proposed method, the boundary of the change area is more compact, but it merges closely located small change areas into a connected domain, resulting in more incorrectly detected pixels.

Experiments on California dataset
In order to quantitatively evaluate the performance of the detection results of the above methods, Table 1 shows the evaluation metrics of the experimental results obtained by different methods on the California dataset. The proposed method achieves the best detection performance with the highest Re (0.7231), F1 Score (0.6099), and Kappa (0.5764). Similarly, the deep learning-based methods, SCCN, CAA, and the proposed method perform better on the California dataset than NPSG and INLPG. This shows that for multi-spectral heterogeneous images with a large amount of data, deep learning networks for CD are more competitive. Compared to the suboptimal methods CAA and SCCN, the F1 Score metric of the proposed method has increased by 1.41% and 4.57%, respectively, and the Kappa has increased by 1.06% and 6.04%, respectively. However, for the Pr metric, the proposed method is 0.17% lower than the CAA. This is because some falsely detected pixels caused by the small filter window in the adaptive DCT algorithm are not filtered out,  but the Re metric of the proposed method is significantly improved, reflecting the comprehensive metrics F1-Score and Kappa make up for the lack of Pr decline to a certain extent. The CVA method takes the least CD time in terms of computation time, while the INLPG method takes the longest, and the three deep learning-based CD methods, SCCN, CAA, and the proposed method, require similar time (including the sum of training and testing time). Figure 12 shows the CD process of the proposed method on the Shuguang dataset, where (a) and (b) are the SAR and visible optical image, respectively, (c) is the Ground truth map, (d) is the difference map obtained after the transfer network, (e) (f) are the change maps obtained after the post-processing steps proposed in this paper. Figure 13 shows the CD results obtained using different methods on the Shuguang dataset, (a) is the Ground truth map, (b)-(g) are the change maps obtained by CVA, SCCN, NPSG, INLPG, CAA and proposed in this paper, respectively. In the change map, black represents TN, white represents TP, green represents FP, and red represents FN. This comparison shows that CVA (Figure 13 (b)) has the most false and missed pixels. SCCN (Figure 13 (c)) has a high false detection rate, and some pixels are missed. NPSG (Figure 13 (d)) has the least missed pixels and has a higher recall rate, but the false detection rate is high. INLPG (Figure 13 (e)) has fewer false detection pixels, but it misses many real change pixels, missing the shape of the change area, and the change from farmland to building cannot be recognized properly. CAA can detect the boundary of the changed area ( Figure  13 (f)), but many unchanged pixels are misidentified, and it lacks effective post-processing steps to remove the misdetection noise of some details. The proposed method (Figure 13 (g)) has fewer missed pixels and fewer false detections for changing pixels, and the obtained change area has a complete boundary and compact interior. The consistency of the change map and the ground truth map is optimal in the above method.

Experiments on Shuguang dataset
In order to quantitatively evaluate the performance of the detection results of the above methods, Table 2 shows the evaluation metrics of the experimental results obtained by different methods on the Shuguang dataset. The proposed method achieves the best detection performance with the highest Pr (0.8115), F1 Score (0.8314), and Kappa (0.8241). NPSG has the highest Re (0.9297), but its Pr is significantly lower, decreasing overall performance. Due to insufficient network fitting, the methods based on deep learning, SCCN, and CAA result in poor image transform effect, and they lack effective post-processing steps, resulting in poor performance of these two methods. Compared with the sub-optimal method INLPG, the proposed method achieved 9.02% and 9.35% improvements on the comprehensive metrics F1 Score and Kappa, respectively. From Table 2, CVA and NPSG methods are computationally faster, the CAA method is the slowest, and the computational complexity of SCCN, CAA, and the proposed method are approximately the same. Figure 14 shows the CD process of the proposed method on the Toulouse dataset, where (a) and (b) are the SAR, and optical panchromatic image, respectively, and (c) is the Ground truth map, (d) is the difference map obtained after the transfer network, (e) (f) are the change maps obtained after the post-processing steps proposed in this paper. Figure 15 shows the CD results obtained by different methods on the Toulouse dataset, (a) is the Ground truth map, (b)-(g) are the change maps obtained by CVA, SCCN, NPSG, INLPG, CAA, and the proposed method, respectively. The changing targets in the Toulouse dataset are mainly buildings and roads. Since both SAR and panchromatic images are single-band images, the spectral information is not rich enough, so the CD on the Toulouse dataset is challenging. By comparison, it can be seen that the change map generated by the proposed method has the least number of falsely detected pixels and significantly removes the high-frequency noise from the image. In terms of detection integrity, the proposed method almost completely detects the small changes in the road (below Figure 15 (f)), while other benchmark methods miss this detail. In addition, the proposed method has a higher recall rate for the change areas with many buildings, as in the upper left of the image, which is significantly higher than that of CVA (Figure 15 (b)), NPSG (Figure 15 (d)), INLPG (Figure 15 (e)), and CAA (Figure 15 (f)), while SCCN (Figure 15 (c)) has significantly higher false-detection pixels in other regions than the proposed method. Due to the small filter window of the adaptive DCT denoising algorithm in the Toulouse experiment, some small change areas are not detected in the obtained change map.

Experiments on Toulouse dataset
In order to quantitatively evaluate the performance of the above methods, Table 3 shows the evaluation metrics of the experimental results obtained by different methods on the Toulouse dataset. The proposed method achieved the best results in all four evaluation metrics compared to other benchmarks. Compared with the suboptimal method SCNN, F1-Score and Kappa are improved by 3.26% and 6.52%, respectively. It can be seen that the proposed method can also achieve competitive detection results on the challenging dataset Toulouse. Facing the Toulouse dataset with a larger size, the CD time of both NPSG and INLPG exceeds 50 min, while the proposed method's total training and testing time are about 15 min.

Parameter analysis
In the proposed transfer network for heterogeneous images, the difference in the size of the input image patch will lead to the difference in the transfer effect. Therefore, for different   Figure 16. It is found that training the network using a 64×64 patch on the California dataset gives the best results. 128×128 patch on the Shuguang dataset gives the best results. Similarly, using the 128×128 patch on the Shuguang and Toulouse dataset gives the best results. By analyzing the change details of the original dataset, it is considered that the size of the patch selected is related to the area and concentration of the change part. The Shuguang dataset has only one change part, the change area is relatively large, and the overall image is small (593×921). Similarly, the Toulouse dataset has relatively regular change areas, and the size of the whole image and the changed areas are larger. Therefore, both use 128×128 patches to capture multiple change details to achieve a complete CD map. The California dataset has the most scattered change part, a relatively small area, making the CD task difficult, so the 64×64 patch is most suitable. In our method, the difference map of the transferred image must be a single channel to form a binary change map, which requires converting the multi-channel difference map into a single channel. Usually, the reduction of channel dimensions uses simple methods, such as the maximum, the minimum, and the average value of each channel. In order to determine which method to use, the effect of the change map was obtained by each dimensionality reduction method on four datasets.  The comprehensive metrics F1 Score and Kappa are shown in Table 4. Through comparison, it was found that the effect of the change map obtained by the maximum sampling method is the best of the three datasets. The large Euclidean distance between two images indicates that the pixel is likely to change, and the accuracy of the CD is ensured by taking the maximum value in the channel. The corresponding pixels will be very different, especially in heterogeneous CD, regardless of whether it is changed or not. If the minimum sampling is selected, many incorrectly detected pixels will be added. Therefore, the change map obtained using the maximum sampling has a more competitive effect.

Effect of adaptive DCT
The DCT denoising algorithm is improved in the post-processing step of heterogeneous CD, enabling it to adaptively determine filtering parameters according to the difference map properties. The CD results are compared with other classical denoising methods (original DCT, Gaussian filtering, Mean filtering, Wavelet transform) and Morphological filtering (Open and Close operations) on three datasets to verify the effectiveness of the adaptive DCT method. The F1-Score and Kappa metrics obtained from the CD experiment are shown in Table 5. Since it takes many experiments to determine the optimal parameters of each algorithm, in order to reduce the time cost, we set the parameters in these four denoising methods empirically. By comparison, it can be found that the proposed adaptive DCT method is excellent in removing pseudo-change pixels and obtaining a change map that is closer to the Ground Truth. Since the parameters of other algorithms are determined by empirical values, it means that the denoising performance may not be optimal. However, in practical applications, it is expected to perform CD automatically and quickly, and accurately, instead of consuming much time by manually adjusting parameters. In addition, it was noticed that the F1-Score obtained by Adaptive DCT in the California dataset is not as good as that of the Open operation method. This is because some small change areas in the California dataset make Adaptive DCT misjudge it as noise and filter it, resulting in low Recall of the generated change map. In future research, the proposed algorithm will be improved to detect small changed regions.

Effects of attention and post-processing algorithm
This paper has two main contributions: one is to propose an image transfer network based on the attention mechanism, and the other is to design a post-processing algorithm for CD based on adaptive DCT. In order to verify the role of the proposed attention module and post-processing algorithm in the detection of heterogeneous changes, an ablation experiment is carried out in this section. Different models are considered with and without the attention module and the post-processing algorithm. These models are trained and tested using the three datasets. Furthermore, the Kappa metric is calculated for each model, as shown in Table 6. It can be seen that after adding the attention module and post-processing algorithm, the CD performance of the model has been significantly improved. In addition to the Shuguang dataset, the post-processing algorithm contributes slightly more to improving the model performance than the attention module. This means that the proposed post-processing step of heterogeneous CD is important to improve detection accuracy. The introduction of the attention module makes the domain transfer network pay more attention to the essential features of images, ensuring the reliability of domain transfer between heterogeneous images. Therefore, the attention mechanism and the DCT noise reduction algorithm, proposed in this paper jointly improve the model's performance in detecting changes with heterogeneous images at the domain transfer and post-processing aspects and make the model more robust.

Discussion
In this Section, we discuss the experimental results of the proposed method on the above three datasets. It mainly includes CD performance on heterogeneous images with different spatial resolutions and different spectrum, and the relationship between CD effect and time cost. Table 7 shows the spatial resolution of the three datasets and the spectral numbers of the corresponding optical images and the performance metrics of the proposed method on these three datasets. CD performance at different spatial resolutions. The spatial resolutions of the three heterogeneous datasets selected in this paper after registration are 30, 8 and 2 m, respectively. Combined with the Kappa metric in Tables 1-3, it is found that the proposed method has more advantages than other methods on the datasets with 8 and 2 m resolution. The Kappa of the proposed method in dataset 1 (30 m resolution) is 1.06% higher than that of the suboptimal method, while the Kappa in dataset 2 (8 m resolution) and dataset 3 (2 m resolution) outperforms the suboptimal method by 9.35% and 6.53%, respectively. This is because in the process of domain transformation, higher spatial resolution provides the transformation network with rich ground object information, which improves the efficiency of spatial domain transformation. Therefore, our method is more competitive for CD in medium to high resolution heterogeneous images.
CD performance of different spectral features. In the three heterogeneous datasets, the spectral numbers of the optical images were 11 (multi-spectral), 3 (RGB), and 1 (panchromatic). According to the experimental results, it is found that our method performs well in the dataset 2 with the three bands spectrum. Generally speaking, spatial resolution and spectral resolution are mutually restricted, and the method in this paper can achieve optimal CD results when the two tend to balance. In addition, it can be seen from the reference change map of the three datasets that the shape of the change area in dataset 2 is the most regular, and the distribution of the change pixels in the other two datasets is more scattered, so our method conforms to the subjective judgment of the change area by humans. That is, the dataset 2 with the simplest change distribution has the best CD performance.
The relationship between CD effect and time cost. The time cost of the proposed method on the three datasets is 1017.60s, 387.65s, and 908.76s, respectively. Combined with the time cost of other methods in Tables 1-3, we find that the proposed method's time cost is generally higher. Except for dataset 2, our method's time cost is higher than the CAA method, but the F1 Score and Kappa are the best on the three datasets. Compared with the CAA method, the proposed method adds a series of post-processing steps, among which the process of dimensionality reduction and denoising will reduce the speed of the algorithm to a certain extent, so we believe that it is reasonable to trade a small amount of time cost for the improved accuracy of CD results.

Conclusions
This work mainly studies the CD of bi-temporal heterogeneous remote sensing images. Due to the different imaging mechanisms of heterogeneous data, the image features are not in the same feature domain. So it is difficult to detect the changes between two temporally heterogeneous images through direct comparison. An unsupervised method was proposed for heterogeneous CD based on Cycle-GANs image transfer network. First, by introducing an attention mechanism, the domain transfer effect of Cycle-GANs is improved so that heterogeneous images can be compared in the same feature domain. Secondly, a post-processing algorithm is designed, and the final change map is obtained through the steps of dimensionality reduction of the difference map, threshold segmentation, and adaptive DCT noise reduction method. Experimental verification was performed on three different types of heterogeneous image datasets, and competitive detection results were obtained. This means that the proposed method can effectively detect the change information of optical and SAR heterogeneous images. Comparing current SOTA algorithms, it is found that the proposed heterogeneous CD method has higher accuracy and stronger robustness. Finally, the optimal patch size of the training transfer network and the dimensionality reduction way of the difference map are obtained through parameter analysis. In the ablation experiments, we further verified the denoising effectiveness of the proposed adaptive DCT and the effectiveness of the attention module and post-processing algorithm. In the future work, we will focus on reducing the time cost of the algorithm, and study the CD method more suitable for medium and low resolution heterogeneous images, and further improve the efficiency of heterogeneous CD.