Towards Targeted Change Detection with Heterogeneous Remote Sensing Images for Forest Mortality Mapping

Several generic methods have recently been developed for change detection in heterogeneous remote sensing data, such as images from synthetic aperture radar (SAR) and multispectral radiometers. However, these are not well suited to detect weak signatures of certain disturbances of ecological systems. To resolve this problem we propose a new approach based on image-to-image translation and one-class classification (OCC). We aim to map forest mortality caused by an outbreak of geometrid moths in a sparsely forested forest-tundra ecotone using multisource satellite images. The images preceding and following the event are collected by Landsat-5 and RADARSAT-2, respectively. Using a recent deep learning method for change-aware image translation, we compute difference images in both satellites' respective domains. These differences are stacked with the original pre- and post-event images and passed to an OCC trained on a small sample from the targeted change class. The classifier produces a credible map of the complex pattern of forest mortality.


Introduction
The forest-tundra ecotone, the sparsely forested transition zone between northern-boreal forest and low arctic tundra, is changing rapidly with a warming climate (CAFF 2013). In particular, 1 arXiv:2203.00049v2 [cs.CV] 12 Mar 2023 changes in the distribution of woody vegetation cover through shrub encroachment, tree line advance, and altered pressure from browsers and forest pests, modify the structural and functional attributes of the forest-tundra ecotone with implications for biodiversity and regional climate feedbacks.
In Northern Norway, mountain birch (Betula pubescens var. pumila) forms the treeline ecotone towards the treeless Low Arctic tundra. In this region, periodic outbreaks by forest defoliators is the most important natural disturbance factor and the only cause of large scale forest die-off. In recent decades, such pest outbreaks have been intensified due to range expansions thought to be linked to more benign climatic conditions (Jepsen et al. 2008;Jepsen et al. 2011).
Today, two species of geometrid moths; the autumnal moth (Epirrita autumnata) and the winter moth (Operophtera brumata) have overlapping population outbreaks at approximately decadal intervals, some of which cause regional scale defoliation and tree and shrub mortality in the forest-tundra ecotone. Outbreaks can thus lead to a reduction in forested areas as well as cascading effects on other species (Biuw et al. 2014;Henden et al. 2020;Pedersen et al. 2021).
The Climate-ecological Observatory for Arctic Tundra (COAT) ) is an adaptive, long-term monitoring program, aimed at documenting climate change impacts in Arctic tundra and treeline ecosystems in Arctic Norway. One of COATs monitoring sites, shown in Figure 1, is located near lake Polmak, partially on Norwegian side and partially on the Finnish side of the border (28.0°E, 70.0°N). The chosen study site is subject to different reindeer herding regimes, where the area on the Finnish side of the border is grazed all year round (but mostly during summer), while on the Norwegian side the region is mainly winter grazed (Biuw et al. 2014). The site's subarctic birch forest suffered a major outbreak by both autumnal moth and winter moth between 2006 and 2008, with effects that are still clearly visible in the form of high stem mortality (Biuw et al. 2014).
Remote sensing imagery is an important tool to observe and understand changes in the foresttundra ecotone, both for large scale monitoring and mapping on a local scale. In this work, we develop a method to find areas with forest mortality after the geometrid moth outbreak, based on satellite images and limited ground reference data. This is a challenging task for several reasons, where three significant factors stand out and guide our approach to solve the problem. Firstly, there are few remote sensing images available from our study site. This is due to the high cloud coverage at high latitudes of subarctic Fennoscandia, which limits the imaging opportunities of optical satellites. The available cloud-free optical images are relatively few and far between and consecutive images are often from different sensors. Synthetic aperture radar (SAR) is an active sensor largely uninfluenced by clouds, which can be utilised to monitor defoliation and deforestation (Perbet et al. 2019;Bae et al. 2021). However, planned acquisition of SAR images of the Polmak study site did not start until after the outbreak. Detecting changes between images from different modalities (e.g. SAR and optical), and even between two images from different sensors of the same modality, is very challenging. If these challenges can be overcome, heterogeneous change detection would enable us to use all available historical data sources for long term monitoring and increase the temporal resolution and responsiveness of the analysis. The second factor is that changes in canopy state are difficult to detect in medium resolution imagery. At this scale we do not observe the aggregated landscape level effect as in low resolution satellite images, nor are the individual canopies visible as in high resolution aerial photographs. For optical images, the loss of "greenness" or normalised difference vegetation index (NDVI) response caused by forest mortality can be offset by the understorey vegetation that becomes increasingly visible as the canopy disappears. For SAR imagery, the change in scattering mechanisms may help detect forest mortality. However, depending on the forest density, these changes can be very subtle compared to other changes in the scene. Thus, when using unsupervised change detection methods, the presence of other man-made or natural changes can easily drown out less distinct signs of forest mortality. Unsupervised methods tend to detect these strong change signatures and attenuate the weaker ones. Masking of pixels susceptible to such changes by manual inspection, creation of detailed databases of such areas, or pre-classification of images would add another layer of complexity to the task. The accuracy of the final detection result would also be very dependant on the ability of the masking operation to find all relevant areas, and the spatial resolution would be limited to that of the mask. Furthermore, it does not prevent other uninteresting changes in vegetation (i.e. not related to canopy state) from appearing in an unsupervised result. We therefore create a targeted change detection algorithm to learn the change signature for forest mortality based on field data. The third factor is that learning these signatures with a supervised approach requires labelled data to train the classifier. While an ecological, field-based, monitoring of forest structure and forest regeneration in the Polmak study site was started in 2011, the scale of the field data is unsuitable to generate training labels for medium-resolution satellite images, and does not cover the full extent of all ground cover classes contained in the images. To label data for all classes would be a tedious manual process, and we want to generate just enough training data for our application of detecting forest mortality, since exhaustive classification is unnecessary for our application and could adversely affect the classification accuracy for the class of interest.
We therefore select one-class classification (OCC) to delineate the targeted change in an approach with two main steps: 1. Change-aware image-to-image translation that allows direct comparison of heterogeneous pre-and post-event images through differencing.
2. OCC applied to a stack of difference images (from step 1) and original input images to detect defoliation.
We use the recently developed code-aligned autoencoders (CAE) algorithm (Luppino et al. 2022) to do the image-to-image translation. CAE performs unsupervised change detection.
However, since it is based on obtaining the difference images, it can be used to translate images between domains. Furthermore, since it is designed with change detection in mind, the network learns to preserve the changes in the translations.
OCC is a semi-supervised learning approach that utilises the available labels, but does not require a big training set or access to labels for all classes. By learning the change signature of the phenomenon of interest, we can perform targeted change detection by solving a classification problem with limited and incomplete ground reference data. For OCC we select a flexible approach that utilises all available ground reference data, i.e. also from outside the class of interest. Our approach is summarised in Figure 2.
The main contributions of this work are: • We propose a method to detect a specific change in heterogeneous remote sensing images based on limited ground reference data and in presence of other changes.
• We adapt a deep learning method recently developed for unsupervised change detection to translate the images between domains while preserving the changes.
• We adapt a method developed to identify reliable negative samples in OCC for the text domain to work on image data.
• We analyse the effect of the number of labelled samples on benchmark datasets and show that relatively little training data is needed to achieve a good classification.
• We provide an ablation study for the various components of our method using benchmark datasets.
Figure 2: Illustration of our approach. The input optical and SAR images at time t1 and t2 are translated to the other domain using code-aligned autoencoders (CAE). The originals and the differences with the translated versions are stacked as feature vectors for every image pixel. A limited amount of training data in the form of known areas with forest mortality is then used to train a one-class classifier (OCC) to map the change for a large area in the presence of both unchanged areas and other changes.
• Our approach allows a post-hoc study of change events, that is study areas that were not originally subject to persistent monitoring, by using any available satellite imagery combination from before and after the event.
• Our method allows more responsive detection of changes in areas with high cloud cover.

Theory and related work
In this section we provide an overview of relevant theory for our application and work related to detection of canopy damage caused by defoliating insects.

Remote sensing of insect induced canopy defoliation
Previous studies of canopy defoliation in the forest-tundra ecotone have mostly utilised low resolution (250 m) MODIS data (Jepsen et al. 2009;Jepsen et al. 2013;Biuw et al. 2014;Olsson et al. 2016b;Olsson et al. 2016a). This agrees with the findings in a review of remote sensing of forest degradation by Gao et al. (2020), which shows the prevalence of optical data in general and MODIS in particular. A literature review by Senf et al. (2017) found that studies of disturbances by broadleaved defoliators mainly used low or medium resolution data, with Landsat being the most used sensor. These works typically use spectral indices and dense time series to detect the defoliation. Senf et al. (2017) found that 82% of studies mapping insect disturbance of broadleaved forest used a single spectral index, typically NDVI. This is consistent with the observation by Hall et al. (2016) that image differencing of vegetation indices derived from spectral band ratios were most frequently used. However, problems with the low resolution of MODIS for mapping forest insect disturbance in fragmented Fennoscandian forest landscape were emphasised by Olsson et al. (2016a). Limitations of low spatial resolution sensors for detection of pest damage were also pointed out by Hall et al. (2016), due to the large number of different surface materials that can be contained in a pixel, only some of which are affected by the outbreak. To rely on a single spectral index makes results susceptible to changes from other sources than forest mortality or dependant on accurate forest masks to avoid this. The sole use of NDVI also ignores the information contained in other bands of the sensor.
When it comes to the use of SAR data, none of the studies of broadleaf defoliation listed by Senf et al. (2017) used SAR, nor did the works on remote sensing for assessing forest pest damage reviewed by Hall et al. (2016). A study by Mitchell et al. (2017) of approaches for monitoring forest degradation did include a number of SAR data applications. However, these were mainly for L-and X-band SAR data. Most of the summarised papers dealt with the scenario where entire trees (stems and branches) had been removed, for instance by fires or logging, or they used proxy indicators, such as detection of forest roads, to monitor degradation (Mitchell et al. 2017). It was found that only a few studies had investigated the use of C-band data (Mitchell et al. 2017). The study of C-band SAR remains interesting, particularly because the Sentinel-1 satellites provide free data in that frequency band.
A study of insect induced defoliation using C-band SAR was presented by Bae et al. (2021), which calculated the correlation between defoliation risk and smoothed time series of backscatter values averaged over five hectare (50 000 m 2 ) plots. In a precursor to this work, we discriminated between live and dead canopy based on an accurate estimation of polarimetric covariance from a single, full-polarimetric C-band image (Agersborg et al. 2021).

Heterogeneous change detection
Traditionally, change detection has been based on images from the same sensor and preferably with the same acquisition geometry before and after the change event. This puts some limitations on the use of such methods. Firstly, the response time is at a minimum limited to the revisit time of that particular satellite or constellation. Secondly, the area of interest (AOI) might not be covered frequently by the sensor. Furthermore, the time period which can be studied is also limited to the active time period for that particular sensor. One way to alleviate these issues is to use heterogeneous change detection on imagery from two different sensors. This comes at the cost of solving a problem that is methodologically more challenging, especially in the unsupervised case, but is still our chosen solution given the practical constraints.
To enable reliable, long-term, persistent monitoring of forest mortality in our AOI, we need to use images from different sensors due to frequent cloud cover. SAR sensors have significantly higher imaging opportunities because of their near weather-independent nature. For our AOI, we do not have SAR imagery before the geometrid moth outbreak, and we thus have to use Landsat data for the pre-event image. Furthermore, we do not have enough data to use time series for smoothing or monitoring gradual changes, as in the studies based on low-resolution MODIS NDVI (Jepsen et al. 2009;Jepsen et al. 2013;Biuw et al. 2014;Olsson et al. 2016b;Olsson et al. 2016a) or the approach by (Bae et al. 2021). Hence, we must rely on bi-temporal change detection using pairs of images.
Heterogeneous change detection has received growing interest the last years (Touati et al. 2019;Sun et al. 2021). Fully heterogeneous change detection should work in a range of settings, from the easiest where the images are acquired with the same sensor, but with different sensor parameters or under different environmental conditions, to the most challenging where images are obtained by sensors that use different physical measurement principles (e.g. SAR and optical) (Sun et al. 2021). The advances of deep learning have in recent years opened up several new directions for heterogeneous change detection. Image-to-image translation in particular offers the interesting prospect of comparing the images directly, once one or both have been translated to the opposite domain. Many traditional change detection methods involves a step for homogenising the images, such as radiometric calibration, even for images from the same sensor (Coppin et al. 2004). Image-to-image translation can be seen as an evolution of this traditional preprocessing step; one (or both) images are "re-imagined" to what it would look like in the other domain.

Image-to-image translation
Code-aligned autoencoders (CAE) is a recently developed method for general purpose change detection designed to work with heterogeneous remote sensing data (Luppino et al. 2022). The change detection is based on translating images U and V intoV andÛ, respectively, such that the difference images U −Û and V −V can be computed. The CAE algorithm must ensure that changed areas are not considered when learning the translation across domains to avoid falsely aligning the changed areas. CAE identifies changed areas in a unsupervised manner and reduces their impact on the learnt image-to-image translation. The unsupervised nature of the algorithm means that we do not need training data. While CAE performs change detection directly, it will detect all changes between the two acquisition times, such as those caused by human activity, fluctuating water levels, seasonal vegetation changes, etc., and not just forest mortality. We therefore utilise it as an image-to-image translation method that does not require labelled training data, is change-aware, and is designed for heterogeneous remote sensing data.

One-class classification
Canopy defoliation has a weak change signature compared to many other vegetation changes that can be discerned in medium resolution imagery, meaning that the imposed change of the radiometric signal is much smaller than for disturbances such as e.g. forest fires or clear cutting.
Thus, it will not in general be detected by the CAE. An unsupervised change detection method will tend to highlight certain strong changes and ignore weaker ones. If we lower the threshold for detection, the forest mortality could be found, but it would be surrounded and accompanied by many unrelated changes in the final change map. We therefore use a semi-supervised approach to detect the change phenomenon of interest while ignoring all other changes.
Traditional supervised classification algorithms require that all classes that occur in the dataset are exhaustively labelled . To obtain sufficient training data from all possible classes by manual labelling would be both time consuming and costly , and does not necessarily improve the classification accuracy with respect to our class of interest.
While we cannot collect ground reference data for all classes, we still want to utilise the available data to train a change detection algorithm to find changes in a larger region extending beyond the study area. We instead use techniques from one-class classification (OCC) to learn the change signature from a very limited number of labelled samples of ground reference data containing forest mortality information.
OCC is framed as a binary classification problem, where the class of interest is referred to as the positive class (or target class), with label y = 1. In this setting, the negative class, y = 0, is either absent from the training data or the instances available "do not form a statistically representative sample" (Khan and Madden 2014). The negative class is usually a mixture of different classes (Li et al. 2020), defined as the complement of the positive class. The typical case for OCC is that the full dataset X is divided into P, the set of labelled positive samples, and an unlabelled set U (also called mixed set) that consists of data from both the positive and the negative class. OCC seeks to build classifiers that work in such scenarios, which often naturally arise in real world applications (Bekker and Davis 2020). In general, the results will not be as good as in true binary classification, where statistically representative samples from both the positive and negative class are available for training.
In our case the reference data is not randomly sampled, but selected systematically from a spatially limited area with a sampling bias that is somewhat related to landscape attributes.
The lack of random sampling limits our choice of OCC algorithm to the so-called two-step techniques and discards the other possibilities mentioned in the taxonomy of Bekker and Davis (2020). Two-step techniques only require two quite general assumptions: smoothness and separability (Bekker and Davis 2020). Smoothness means that samples that are close are more likely to have the same label, while separability means that the two classes of interest can be separated (Bekker and Davis 2020). The steps are: 1. Given labelled positive samples P from a dataset X, reliable negative (RN) samples are identified from the remaining set of unlabelled samples U.

2.
A classifier is trained with the provided labelled positive samples P and using the RN samples from Step 1 as the (initial) set of negative training data, N.
Any (semi-)supervised classifier can be used in the Step 2, as both positive and RN labels are available.

Feature selection
Feature engineering is an important part of machine learning, as selecting the right features and normalisation may have a big influence on the final classification result. For our bitemporal change detection problem we originally have the co-registered pre-(U) and post-event (V) images, which may be from different sensors and even different physical measurement principles.
We want the feature vectors to be as descriptive as possible since we have a very limited amount of training data, and only from the change class of interest. A common change detection technique for homogeneous images is to subtract one image from the other to obtain the difference image, D ∈ R h×w×c . The exact steps for finding the changes from the features of D varies, but often involve a form of thresholding.
We seek to combine original image features and difference vectors for heterogeneous images without creating a specific method for weighting the contributions. Using CAE for image-to-image translation, we can obtainÛ, which is the post-event image translated to the domain of U, andV, the pre-event image translated to the domain of V. This allows us to compare the pre-and post-event images and utilise the difference images in our features: If U is an image with c u channels and V has c v channels for each pixel position in the h × w images, Eqs. (1) and (2) correspond to the element-wise differences: where u ∈ R cu×1 and v ∈ R cv×1 are the multi-channel pixel vectors in the co-registered input images at a given position,û ∈ R cu×1 andv ∈ R cv×1 are the corresponding pixel vectors in the translated images, and the parentheses are used to index the channels of the image. For an optical image, the channels will be spectral bands, while for SAR they will typically contain polarimetric information.
By stacking the difference vectors from Eqs. (3) and (4) with the original multichannel input data, we form an array X ∈ R h×w×c where at each of the h × w pixel positions, the feature vector x can be written as: where (·) T denotes the transpose operation, and the dimension of the feature vector is equal to the sum of the dimension for each component, feature vector x will contain information about both the change, and the state before and after the event. Since we use labelled training data, we avoid hand-crafted features or dimensionality reduction methods such as PCA, thus allowing the machine learning algorithm to learn which features are important for detecting the change of interest.

Building the OCC
We select the two-step approach to OCC for our application since it is flexible and makes no assumptions about random sampling of the positively labelled training data. A further benefit is that enables the use of ground reference data collected from other classes (i.e. not forest mortality) by adding it to the RN samples to augment the negative class. When selecting the methods for each step, we want to avoid those that have many parameters that require tuning.
To further guide our choice, the OCC should work for a range of different number of positive samples. We also exclude methods that are highly specialised toward the text domain.
Step 1 To obtain RNs in the first step, we use a Gaussian mixture model (GMM) updated once with the expectation maximisation (EM) algorithm (Dempster et al. 1977). This is inspired by the first step in the Spy algorithm (Liu et al. 2002) These probabilities are then reformulated using Bayes' rule and cancelling the evidence (p(x)) in the denominators, gives the decision rule: P (y = 0)p(x|y = 0) > P (y = 1)p(x|y = 1) where P (y) is the prior probability of the positive or negative class, and p(x|y) is the likelihood.
The approach of NB is to make the so-called naïve assumption that all the features of x are mutually independent when conditioned on the class y. Then p(x|y) can be written as the product of the marginal univariate probability density functions (PDFs) of all features in x. The use of NB by Liu et al. (2002) and other works stems from document classification, where the marginals are modelled as discrete probability mass functions which represent the probability of a given word occurring in a document of class y. These can be readily estimated by counting word occurrences, while estimating p(x|y) for the set of words in a document is intractable unless the vocabulary is small. In our case, the naïve assumption of mutual independence of features does not make it easier to calculate Eq. (7), as that requires assumptions about the PDFs . Instead, we choose to use a parametric model for the conditional probability density functions for the feature vectors for each class, p(x|y). Compared to using the naïve approach, this allows us to account for correlation between the features of x. Recall that the feature vectors consist of all channels from the original images as well as differences obtained with the translations, as given by Eq. (5), so we must assume correlation between features. Since the goal is to perform classification in order to obtain reliable negative samples which can be used to train a better, final, classifier in the second step, we do not attempt to optimise the selection of p(x|y). We instead argue that the Gaussian is a reasonable choice of PDF in this setting, since its parameters are readily estimated by the mean and the sample covariance matrix, with the latter capturing the correlation between features. Thus, we model the marginal density for the positive class as p(x|y = 1) ∼ N (µ 1 , Σ 1 ), and likewise for the negative class p(x|y = 0) ∼ N (µ 0 , Σ 0 ). This is a two-component GMM, and we can use EM to provide an initial classification of the data in order to find RNs. The initial estimates for the mean and covariance of the first mixture component,μ 1 andΣ 1 , are based on the labelled positive training set, while the estimatesμ 0 andΣ 0 are based on the unlabelled set, which contains both positive and negative samples. The standard sample mean and sample covariance matrix estimators are used. Using results from EM for GMM (see e.g. Bishop (2006)) we can now find refined estimates for the parameters.
The new estimates of the expectation for mixture component k is then: where γ(z jk ) ∈ [0, 1] is called the responsibility and denotes the posterior probability of sample x j belonging to mixture component k, and n k = N −1 j=0 γ(z jk ) is a normalisation factor. The term indicates how much "responsibility" mixture component k has for explaining sample j. The γ(z jk ) are calculated as the posterior probabilities of sample x j belonging to mixture component k given the parameters of the mixture components calculated in the previous (initial) iteration of the EM algorithm. The prior probabilities in Eq. (7) are initialised as equal (uninformative) P (y = 1) = P (y = 0) = 0.5. Since we want mixture component k = 1 to model the positive class, we set γ(z j1 ) = 1 and γ(z j0 ) = 0 when x j is from the positive training set. The updated estimate for the covariance matrices are given in a similar manner as Eq. (8) aŝ We note that the EM estimates are the maximum likelihood estimators (MLEs) for the expectation and covariance matrix weighted by γ(z jk ). These can be easily found by functions that can calculate weighted versions of the mean and sample covariance matrix, e.g. the average and cov functions from the Python numpy library. The reliable negatives are selected as the samples x where the probability of belonging to mixture component used to model the negative class is greater than that of the positive class according to Eq. (7). The prior probabilities are calculated as the proportion of samples assigned to each class and the estimates in Eqs. (8) and (9) for the positive and negative class are then used to evaluate the Gaussian PDFs. We used the multivariate_normal function from the scipy.stats package. Intuitively, the mixture component for the negative class will be "wide" and have the largest value for most of the support, while the mixture component of the positive class will be "compact" and only have higher values close to where the positive training samples are located.
Step 2 For the second step, any (semi-)supervised method could be used, as we now have training data for both classes with the labelled positives and RNs (Bekker and Davis 2020). We choose to base our second step on multi-layer perceptron (MLP) feed-forward neural networks. Neural networks are a good out-of-the-box method for a problem such as this, due to their high flexibility as function approximators. Unfortunately, there are no general guidelines for selecting the number of hidden network layers, or the amount of neurons in each. Our initial data exploration revealed that there were some variations in the classification results depending on how the network architecture was selected. We therefore opted to combine five different MLPs in an ensemble model, and

Illustrating targeted change detection
To illustrate targeted change detection, we test our method on a dataset used in the change detection literature. The Texas dataset consists of two 1534 × 808 pixel multispectral optical images of Bastrop County, Texas, where a destructive wildland-urban fire struck 4 September 2011 (Volpi et al. 2015). The pre-event image is from Landsat 5 Thematic Mapper (TM) with 7 spectral channels and the post event image is from Earth Observing-1 Advanced Land Imager (EO-1 ALI), with 10 spectral channels. The ground truth was provided by Volpi et al. (2015).
We apply our method with 1000 randomly drawn positive samples as the training data. This corresponds to 0.76% of the positive ground truth data (0.08% of the total image pixels). In Figure 3 we zoom in on an area containing both the targeted change (where the fire has occurred) and other changes (clouds), and show the original image data and the change detection result.
In addition to the result from our approach, we show that of the unsupervised CAE change detection from Luppino et al. (2022). This illustrates how our semi-supervised OCC-based approach ignores changes not present in the labelled training set. The result in Figure 3c is ( reasonable, since the clouds (and their shadows) represent an actual change between the two acquisition times. However, it is the effects of the forest fire which is interesting for us to map, and thus the cloud-related changes are marked as false positives in the ground truth data.

Ablation study
We perform an ablation study to check the contributions of the various components of our approach. In this study we remove or replace a component and assess the result when using the ablated procedure. The objective is to check if all components contribute positively to the change detection, or if a simpler method performs equally well or better. This is an important exercise in the field of machine learning due to the complexity of the methods involved. Since we wish to objectively measure the effects of the ablation, we must use benchmark datasets where the ground truth is available and we can evaluate the performance of our method numerically.
As a part of this study we also investigate the effect of the number of labelled positive training samples, and how many are needed for a stable result. We use three benchmark change detection datasets with heterogeneous pre-and post-event images for which ground truth is available.
Though these datasets are not directly related to the task of mapping forest mortality, this study allows us assert that our approach is valid for problems beyond our AOI. We present how we performed the study before discussing the result for one of the benchmark datasets in detail, and then briefly summarise the two others.
We also include the F1 score for the GMM with one EM update used to find the reliable negatives (RNs) in the first step. The difference between this result and our proposed method is the contribution of the second step to the final OCC result.
We also consider an alternative method in the second step, which uses the same RNs as our proposed approach. The method is called iterative support vector machine (SVM) (Bekker and Davis 2020) and is a common choice for step two. It successively trains SVM classifiers based on the labelled positive and RN samples available, and then classifies the remaining unlabelled dataset. The samples which are classified as negative are added to the RN set and used to train a new SVM in the next iteration. After some convergence criterion is met, the final SVM trained is used to classify the entire dataset. Due to the large number of RNs found in the first step, we cannot use kernel SVM and must use the linear variant. The SVM penalty parameter (see e.g. Bishop (2006)) is set by an automated search procedure, as recommended by Hsu et al. (2003), to select the best value based on the false negative rate (FNR). The iterative SVM training stops when the FNR exceeds 5%, as was first proposed by . Both the MLP ensemble and the iterative SVM use the same RNs found in the first step by the GMM with one EM update.
Finally, we also include results for the popular one-class SVM (OCSVM) method (Khan and Madden 2014). While it is not an ablation of our approach, OCSVM is an alternative to the first-step method we use, and it is also a frequently used OCC method on its own. The error bars represent the 10th and 90th percentile F1 score. We see that the iterative SVM second step performs well when the number of labelled positive samples is small. However, the algorithm actually decreases the F1 score from the first step. The OCSVM has relatively consistent performance as a function of |P|, but a considerably lower F1 score than the GMM with one EM update used in the first step. The OCSVM was originally designed for anomaly detection, a setting where there often is no unlabelled (or negative) data available. The poor performance in this setting compared to our GMM EM method illustrates the importance of utilising the unlabelled data. The ablation of difference features performs better than the ablation of original features, and also better than with the full feature vector when |P| is small. Our proposed approach has the best F1 score when |P| ≥ 250.
In Table 1 we list the F1 scores for the different ablations when |P| = 1000. In addition to the result for the France dataset, which corresponds to the values for one x-value in Figure 4, we include the result for two additional heterogeneous change detection datasets where ground truth information is available. One is the Texas dataset used for the example in Figure 3.  For the second step our MLP ensemble improves the F1 score significantly compared to the first step for all datasets. The ISVM results are more varied. It decreases the F1 score for the France dataset, as seen in Figure 4, but achieves a slightly higher F1 score than the MLP ensemble for the Texas dataset.
We conclude that all components of our method contribute to the F1 score. Relatively few positive samples are needed for a stable classification result, which bodes well for our forest mortality classification. We also note that our method performs well for these datasets despite that the changed areas in the pre-event images of the benchmark datasets are more varied than for forest mortality mapping. Several different landcover types are affected by the construction, fire, and flooding, whereas for our application the changed areas are always live forest in the pre-event image.

Creating the forest mortality map
There are few cloud-free optical images available from our AOI, which limits the selection of imagery which could be used to map the forest mortality that has occurred. We found one Then the polarimetric covariance matrices were estimated using the guided nonlocal means method (Agersborg et al. 2021). This method preserves SLC resolution and was shown to give estimates of the polarimetric features that better separate between live and dead canopy than alternative methods (Agersborg et al. 2021). The features relevant for canopy state classification were extracted from the covariance matrix C. These are the intensities in the HH, HV, and VV channels, C 11 , C 22 , C 33 respectively, and the cross-correlation between the complex scattering coefficients for HH and VV, C 13 = |C 13 |e j∠C 13 (Agersborg et al. 2021).
The Landsat-5 TM bands were upsampled from the original 30.0 m (120.0 m for Band 6) spatial resolution to give a pixel size of 10.0 m × 10.0 m using bilinear interpolation resampling during the coregistration process with the SAR data. In this process, both the LS-5 and RS-2 images were geocoded to the UTM 35N projection and mapped on a common grid using QGIS before cropping and extraction of overlapping images. This resulted in 1399×1278 pixel images which were the input for the analysis.
The training data was created by experts carefully comparing high-resolution aerial photography from before (2005) and after the outbreak (2015), drawing polygons covering areas with forest mortality. We choose this approach for three reasons. Firstly, it was easier than selecting a greater number of smaller areas from all over the scene, especially considering that the aerial photographs only covered parts of it. The areas need to be relatively large as the original 30.0 m resolution of the LS-5 data sets a lower limit for the polygon size. Secondly, by extracting from within larger homogeneous areas we minimise the effect of any misalignment between the ground reference data and the satellite imagery, which could cause the training data to contain pixels from the negative class. Thirdly, manual creation of training data is tedious work, and we want to generate just enough training data for our OCC method to map the forest mortality for the entire scene. 15 polygons of roughly rectangular shape with various sizes were created. Four of these intersects with transects studied in field work in 2017 (Agersborg et al. 2021). In total the 15 polygons contained 1536 pixels, which given the excerpt size of 1399 × 1278 constitutes 0.086% of all pixels. The CAE translations in Figure 5 are reasonably similar to the other image in the translated domain, but can easily be discerned from the original data. This is expected as it is not possible to exactly recreate the spectral information found in optical imagery from SAR data, and likewise, the polarimetric information about scattering characteristics cannot be replicated from the Landsat TM reflectance measurements. The translation from the optical to the SAR domain seems to match the original RS-2 image quite well, and appears visually to be the better of the two translations, whereas the translation in 5d appears somewhat blurry with muted colours. There are some obvious translation artefacts, for instance the bright pixels in lake Polmak in the translated RS-2 result.

Unsupervised change detection with CAE
To illustrate the changes detected by a non-targeted approach based on the translations in Figure   5, we show the CAE result. Change detection with CAE is based on thresholding the magnitude of the difference vectors between the translations and the originals (Luppino et al. 2022). Figure   6 shows the confusion map for the CAE change detection based on the difference vectors obtained from the images shown in Figure 5. Figure 6: CAE confusion map, with predicted changes in green, predicted unchanged in black, correctly classified forest mortality from our limited ground reference dataset marked white, and red showing missed detections from the same dataset. Figure 6 also shows the 15 ground reference polygons with forest mortality created as training data for the OCC. Note that some of the polygons are too small to be discerned in Figure 6. Parts of the forest mortality polygons where CAE predicts no change are shown in red, while correct change predictions for these areas are white in Figure 6. Only 14% of the pixels with forest mortality are predicted as changed. We notice that parts of the predicted changes resemble outlines of the water bodies in Figure 5, which indicates that CAE detects changes in water levels. There are also changes in the agricultural and settled land, primarily along the Tana river.
The translation artefacts from lake Polmak in Figure 5d are also marked as changes. Compared to these phenomena, which result in high magnitude difference vectors, the death of the canopy layer of the fragmented forest-tundra ecotone is a subtle change at this resolution level. To detect it we need to train a classifier to look for it specifically. Figure 7 shows the forest mortality map provided by our targeted change detection method.

Target change detection with proposed method
The feature vectors in Eq. (5) are obtained from the images and translations shown in Figure   5. The two-step OCC is then trained on 1536 feature vectors from the 15 polygons with known forest mortality. The result indicates that large areas of forest have died following the outbreak.
Particularly the western side of lake Polmak, on the Finnish side of the border, has been heavily afflicted. This is in line with the observations by Biuw et al. (2014) that the regeneration of mountain birch stands appeared to have been severely hampered by the year round grazing on the Finnish side of the border. Significant forest mortality is also detected north of the river. We note the fragmented nature of the forest mortality map, which is natural given that the sparse nature of the forest-tundra ecotone. Contrary to the CAE result, there are no detections along the agricultural and settled land around the Tana river.
We do not have another map of the effects of the outbreak to evaluate against, so we cannot readily quantify the accuracy for the full forest mortality map. By use of NDVI measurements from before and after the outbreak, in the same period as the input imagery, and correlating with the result in Figure 7, we find a noticeable NDVI decrease in the areas of classified forest mortality. For these areas, shown as white in Figure 7, the NDVI decreased by 0.145 on average.
In the other areas, the difference was virtually zero.
We also evaluate our result on a systematic grid of 30 m × 30 m cells aligned with transects examined during field work in 2017 (Agersborg et al. 2021). The cells were classified by an expert into four classes based on a study of aerial photographs. Three of the classes were related to forest mortality, as they describe the canopy state as "live" (33 cells), "dead" (44 cells), and  Figure 7 that intersects the geographical coordinates defining the bounding box of that cell. Since the mortality map uses a 10 m × 10 m pixel spacing, each grid cell should correspond to 3 × 3 = 9 pixels, though a few will be larger, as we choose to include partially covered pixels. Note that the grid is not aligned north-south and has not been co-registered to the satellite imagery. Hence, there may be some inaccuracies due to misalignment, especially since the size of the cells are the same as the resolution as the LS-5 pre-event image.
For the classes "live" and "other", we observe a low false alarm rate withTN live grid = 98.7% andTN other grid = 96.9% of the pixels correctly classified as no forest mortality. As part of the work to create the polygons with forest mortality, 15 polygons live forest were also extracted.
These were similar in shape (rectangular), size and grouping, and located relatively close to areas with forest mortality. In total the 15 polygons with live forest consists of 2070 pixels. For the live polygons our approach performs well withTN live poly = 97.2%, which is consistent with the result for the "live" class for the grid cells. Note that the both the "live" and the "other" class are subsets of the negative superclass and are not necessarily accurate estimates for the true negative rate for the complete negative class. Comparing the forest mortality map in Figure 7 with the optical image in Figure 5a and the unsupervised CAE change detection result in Figure   6, indicates that we avoid classifying changes in agricultural and settled areas as forest mortality, which is important for the true negative rate.
For the "dead" grid cells we observe a true positive rate ofTP grid = 66.4%. Misalignment could be a contributing factor to the missed detections as 37 of 44 (84.1%) of the cells do contain pixels classified as forest mortality. The number of missed detections along with the low false alarm rate could also be an indication that our method is conservative in predicting forest mortality, though this was not seen for the training data where the true positive rate was 96.8%.
While accuracy scores for the training data always are of questionable validity, this indicates that our method does not have to "sacrifice" much accuracy on the positive training set to increase accuracy on the RNs during training.
It is not obvious if the grid cells with damaged canopy state should be considered as the positive or negative class for evaluation. This state contains a mixture of dead stems and trees where parts of the canopy has re-sprouted. 28.1% of the pixels in the grid cells of this category were classified as forest mortality. Arguments could be made for considering the damaged state as a part of the positive class, since there is a clear decrease in canopy cover. However, for operational monitoring it could be more important to focus on areas where the forest has died completely, as it can be assumed that most of the forest will suffer canopy damage in a major outbreak, and our targeted change detection approach should be able to map only the former.
Nonetheless, that only 28.1% of the pixels in the damaged class were labelled as forest mortality may be another indication that our prediction of forest mortality is on the conservative side.
To evaluate how well the forest mortality map corresponds to the high resolution aerial that, our method appears to avoid misclassifying other landcover types as forest mortality.
We also note the challenge of accurately mapping forest mortality due to the sparse nature of the forest-tundra ecotone, and the entangled pattern of live and dead trees. Since the resulting map appears quite good, setting the pixel spacing of the mortality map to 10 m × 10 m to match the RS-2 resolution, seems warranted. However, we should keep in mind that the resolution of LS-5 used as the pre-event image is 30 m × 30 m, which will affect the accuracy of mortality map. An example of the mortality map failing to delineate thin separations between classes is shown in Figure 9. Here there is a separation between areas where the trees have died that are included in the mortality map. Similarly to the result in Figure 8, the map shown in Figure 9 miss parts of the forest that has died which borders on the areas correctly mapped as dead. It also appears that there is an area which is falsely classified as forest mortality in the bottom right part of Figure 9.
The results in Figures 8 and 9, and the numerical evaluations on the grid cells and live polygons, indicate that mortality map is conservative. One way of easily increasing the number of pixels predicted as forest mortality is to adjust the ensemble voting done in the second step This is equivalent to considering the output of the ensemble as the average number of predictors which predict the positive class as a number between 0 and 1 that is thresholded to give the final binary output map. Then, reducing the votes necessary is the same as reducing this threshold t. By reducing the number of votes required from three (majority) to two we obtain a higher true positive rate ofTP grid = 76.6%. In terms of threshold, this corresponds to lowering the threshold from t = 0.5 to t = 0.3. However, there is a decrease in the true negative rate, witĥ TN live poly = 94.6%,TN live grid = 97.5%, andTN other grid = 94.6%. There was also an increase in the number of pixels from the "damaged" state classified as forest mortality to 33.7%. When evaluating this result on the high resolution aerial photographs, we see that more forest mortality is correctly detected, but at the cost of a higher number of false alarms. An example of this is shown in Figure 10, where the result of our method with t = 0.5 to t = 0.3 are both overlain the photographs from before and after the outbreak. The brightest overlay areas are predicted as forest mortality with both thresholds, while the more see-through overlay corresponds to the prediction from t = 0.3 alone. For the area in the upper right corner, the t = 0.3 appears better, while the difference in prediction at the bottom right of Figure 10 appears to be mostly false alarms.

Conclusions and future work
In this work we have presented a method for detecting forest mortality from a pair of medium resolution heterogeneous remote sensing images to map the effect of a geometrid moth outbreak.
This is a challenging problem, and we were unable to achieve it using the unsupervised CAE results alone, since the phenomenon of interest has a weak signature compared to other changes.
By utilising the CAE for change aware image-to-image translation, we obtained multitemporal difference vectors despite the heterogeneity of the input images. When combined with the original image features, a semi-supervised one-class classifier is able to learn to map the changes of interest from a very limited set of training data consisting of less than 0.1% of the image pixels of these extended feature vectors. The ablation study shows that all the different components of our targeted change detection approach contributes to the final output, and we achieve good results for benchmark datasets, despite not being intended as a general change detection method.
The evaluation for our AOI indicates that we achieve a low false alarm rate, but that the predicted forest mortality map may be a bit conservative. It is possible to increase the true positive rate at the expense of more false alarms by adjusting a threshold in the ensemble voting done in the second step of the OCC method. However, we prefer a low false alarm rate to a higher number of detections in the trade-off, for instance in case the map should be used to determine new areas for field work to study the effect of the outbreak. Future work should seek to assess performance on datasets with complete ground truth data available. This should preferably be done on datasets suitable for targeted change detection, where changes unrelated to the phenomenon of interest are included in the negative class.
Our approach expands the potential for detecting the extent of changes that we know have occurred at one or more locations by using whatever satellite imagery available from before and after the event as long as it can be co-registered. This allows us to map a phenomenon of interest over large areas. It does not require a dense time series of data, in the same manner as NDVI-based approaches, which can be problematic for our AOI given the high cloud cover percentage. The modular nature of our approach means that components can be replaced if the particular dataset warrants it. This could also be investigated in future work, and our ablation study hints at some interesting directions.

Disclosure statement
The authors declare no conflicts of interest.

Funding
This work was supported in by the COAT Tools project (www.coat.no/en/Education/COAT- where the encoders and decoders are the same as for Equations (11) and (12)  CAE enforces alignment of the code layers of the two autoencoders by adding a loss terms that ensures their alignment in both distribution and location of land covers (Luppino et al. 2022).
This code correlation loss is a novel feature of CAE, and enforces that Z u should be similar to Z v (Luppino et al. 2022). The similarity is based on a cross-modal distance between training patches in the input domains. This allows pixels that have changed to be distinguished from those who have not, and the loss term seeks to preserve these relationships in the code layer.
Contrary to the other loss functions, which are used to train the both encoders and decoders, the code correlation loss is only used for the encoders.
A cycle-consistency loss enforces that data translated from one domain to the other, and then back again, should be identical to the input. In a sense it is similar to the reconstruction loss, except that the cycle-consistency loss involves all encoders and decoders of the network.
The final loss term requires that the translationsÛ andV in Equations (13) and (14) should be similar to the data in the original domain U and V, except for pixels where changes have occurred. If there is a significant chance that a change has occurred for a particular pixel, its contribution to this loss term is strongly suppressed, whereas pixels ofÛ (V) from likely unchanged areas should be close to U (V). To distinguish between changed and unchanged areas, this loss term includes a weighting factor updated iteratively during training. This is based on preliminary change detection results obtained with the image translations at the current stage of training. Note that while the final change-detection result of CAE is a binary change map, the weighting factor is a continuous variable between zero and one where lower values indicates where there is high probability of change.
We have made some adaptations to the CAE related to network training, with the aim of improving the visual quality and detail preservation of the translations, as briefly summarised: • The patch size used for training is reduced from 100 × 100 pixels to 20 × 20, so the code correlation loss is calculated for all pixel pairs in the input patches. In the original implementation the cross-modal distance between training patches were based on a 20×20 pixels excerpt from the centre of the full 100×100 training patch due to memory constraints (Luppino 2020). By reducing its size, the full training patch is used for the code correlation loss to better align the two domains.
• The number of patches per training batch is increased from 10 to 20 and the number of batches per epoch from 10 to 600 to compensate for the lower number of pixels seen during training due to the reduced patch size. A 100 × 100 patch contains 25 times as many pixels as a 20 × 20 one, and it is therefore necessary to increase the batch size and the number of epochs compared to Luppino (2020).
• The preliminary evaluation of the difference image is changed from using the reconstructed versions (Ũ andṼ) to using the originals in the weighted translation loss to better preserve details in the translations.
In our experience, these modifications improve the visual quality of the translations, and through more meaningful difference images also the accuracy of the final classification.

One-class classification
OCC is framed as a binary classification problem, where the positive class of interest has label y = 1 and the negative class, y = 0, is usually defined as the complement of the positive class (Li et al. 2020). The full dataset X is typically divided into P, the set of labelled positive samples, and an unlabelled set U (also called mixed set) that consists of data from both the positive and the negative class.
Text classification and document retrieval are applications where OCC has seen much use and several OCC methods that have been developed are customised for the text domain. In the taxonomy of OCC techniques proposed by Khan and Madden (2014), the applications were divided into two categories: "text/document classification" and "other applications". The terms single-class classification (SCC) (Yu 2005), partially supervised classification (Liu et al. 2002), and others (see Khan and Madden (2014) for a brief summary) have been used for one-class classification problem. OCC is also related to positive and unlabelled learning (PUL), and the distinction between the two is somewhat blurry. Just as in the OCC setting, PUL assumes that a labelled positive training set and an unlabelled set that contains mixed samples from both the positive and negative classes are available. Many PUL methods are based on estimating dataset attributes, such as the labelling frequency and class prior probability of the positive class, and thus needs to make assumptions about how the labelled positive samples were generated (Bekker and Davis 2020). Some methods assume that the labelled positive samples were selected completely at random (SCAR). PUL also considers two different settings: the single-training-set scenario, where the positive and unlabelled samples are from the same dataset, and the casecontrol scenario, where they are from different datasets. We choose to draw the distinction between OCC and PUL by saying that the latter makes assumptions about how the positive samples were labelled, or that it is designed to work in the case-control scenario, or both.
Further, unlike PUL, OCC also opens up for the possibility that some labelled negative data may be available, although not well enough sampled or statistically representative enough to build a traditional binary classifier. Note that the PUL acronym has also been used by Elkan and Noto (2008) about a learning algorithm using the SCAR assumption and a specially held-out validation set to estimate the probability of a positive sample being labelled.
The SCAR assumption does not hold for our application with the available ground reference data from our AOI. The weaker "selected at random" condition, which assumes that labelled examples "are a biased sample from the positive distribution, where the bias completely depends on the attributes" (Bekker and Davis 2020), also does not hold. Our ground reference data is selected systematically from a limited area, and the sample selection bias is related to the attributes of the feature vectors due to some spatial correlation in type of background vegetation, soil conditions, tree densities and mortality rates, etc. However, this bias is not completely dependent on the attributes of the feature vectors. Therefore, according to our definition above, we use the term OCC and not PUL in this work, although we acknowledge that the distinction between the terms is not well established.
There are many different OCC methods to choose from, and the choice should naturally depend on the application. The OCC taxonomy by Khan and Madden (2014) divides the methodology into "one-class support vector machine" (OCSVM) and "non-OCSVM". However, this taxonomy (Khan and Madden 2014) is focused on the text domain and only mentions in passing other application domains without summarising which OCC methods they use. A more useful taxonomy, at least when it comes to guiding our choice of methods, is provided in the review article by Bekker and Davis (2020), which lists three different categories of PUL methods: two-step techniques, biased learning, and class prior incorporation. The latter two invoke the SCAR assumption, and are thus not applicable in our case. We therefore end up using the two-step technique, which was previously reviewed.
OCC has been used to extract a particular land cover type from remote sensing images. It has also been applied for targeted change detection. Camps-Valls et al. (2008)  An OCC method based on sparse representation of the features was tested for bitemporal change detection of a flood in a homogeneous multispectral dataset and compared with an RBFkernel OCSVM by Ran et al. (2016). Ran et al. (2018) reported results using kernelised versions of the sparse representation classifiers. Ye et al. (2016) presented an OCC-based method for targeted change detection combining SVDD with results from change vector analysis (CVA) and PCC. However, due to difficulties in finding tight boundaries for the target cluster in the feature space, the changed class was subdivided into several subclasses depending on the spectral signature, each dependent on a balanced training dataset (Ye et al. 2016). The approach was tested on homogeneous image pairs, and good results were reported on selected balanced training and test datasets. Jian et al. (2021) used generative adversarial networks (GANs) to perform OCC change detection. It was based on "spatial-spectral" features extracted from a stack of three homogeneous RGB remote sensing images, where the change occurred in the latest image of the stack. Contrary to the other methods discussed, the training dataset was created from unchanged data from the first two images of the stack. Two GANs generate samples from a "change data" distribution, and a discriminator was trained to separate unchanged from generated change samples (Jian et al. 2021). This discriminator was then finally used to detect changes in the spatial-spectral features generated from the second and third images in the stack where the change occurred. The method performed comparably to deep learning based unsupervised change detection methods and other OCC methods when these, contrary to the normal setting of labelled positive data, were trained with data from the unchanged negative class. It should be noted that while the concept of using only unchanged data for OCC-based change detection is intriguing, it requires additional unchanged images and is unable to perform targeted change detection.