Deep convolutional transformer network for hyperspectral unmixing

ABSTRACT Hyperspectral unmixing (HU) is considered one of the most important ways to improve hyperspectral image analysis. HU aims to break down the mixed pixel into a set of spectral signatures, often commonly referred to as endmembers, and determine the fractional abundance of those endmembers. Deep learning (DL) approaches have recently received great attention regarding HU. In particular, convolutional neural networks (CNNs)-based methods have performed exceptionally well in such tasks. However, the ability of CNNs to learn deep semantic features is limited, and computing cost increase dramatically with the number of layers. The appearance of the transformer addresses these issues by effectively representing high-level semantic features well. In this article, we present a novel approach for HU that utilizes a deep convolutional transformer network. Firstly, the CNN-based autoencoder (AE) is used to extract low-level features from the input image. Secondly, the concept of tokenizer is applied for feature transformation. Thirdly, the transformer module is used to capture the deep semantic features derived from the tokenizer. Finally, a convolutional decoder is utilized to reconstruct the input image. The experimental results on synthetic and real datasets demonstrate the effectiveness and superiority of the proposed method compared with other unmixing methods.


Introduction
Hyperspectral imaging is a cutting-edge technology that makes it simpler for humans to recognize the world and offers a new way to look at the earth.It has become a trendy research area due to its wide range of applications, such as data fusion (Xiao Zhang et al., 2015), classification (Farooque et al., 2021(Farooque et al., , 2023;;Hong et al., 2020;Roy et al., 2019), target detection (Nasrabadi, 2013), data fusion (Hong et al., 2019;Xie et al., 2019), and environmental monitoring (Khan et al., 2018;Stuart et al., 2019), etc. Hyperspectral sensors have a high spectral resolution, making it easier to identify ground materials.However, the limited spatial resolution of these sensors often results in multiple materials being mixed within a single pixel (Hong et al., 2020;Keshava & Mustard, 2002), Consequently, this mixing poses challenges in accurately characterizing hyperspectral data, leading to inconsistencies in scene understanding and estimation.Managing the mixed pixel is a major obstacle in hyperspectral image (HSI) processing.Therefore, hyperspectral unmixing (HU) has gained importance in addressing the mixed pixel problem.The term HU pertains to segregating the spectral signatures of distinct materials and determining the relative proportions of those materials within each pixel.In general, there are three key steps involved in HU, the estimation of the number of endmembers in the HSI, the extraction of endmembers, and estimate the endmember fractional abundances.
Various unmixing algorithms rely on linear or non-linear spectral mixing models (Bioucas-Dias et al., 2012).The linear spectral mixing model (LSMM) assumes that each incoming incident light only interacts with a single pure material (Xiangrong Zhang et al., 2018).The non-linear spectral mixing model (NLSMM) holds up when various materials scatter the light in the scene.The LSMM is widely used for many applications (Dobigeon et al., 2009) because it is effective and simple.Under the LSMM, many HU methods can be classified as statisticalbased (Berman et al., 2004), geometrical-based (Bioucas-Dias, 2009), and sparse regression (Iordache et al., 2011(Iordache et al., , 2013) ) problem.The statisticalbased methods determine endmembers and their abundance map using parameter estimation techniques.The geometrical-based method can be divided into two subcategories: pure pixel and non-pure pixel methods.The well-known pure pixel methods, such as pixel purity index (PPI) (Boardman, 1993), vertex component analysis (VCA) (Nascimento & Dias, 2005), and N-FINDER (Winter, 1999).While robust collaborative nonnegative matrix factorization (RCNMF) (Jun Li et al., 2016), and minimum volume simplex analysis (MVSA) (Jun Li & Bioucas-Dias, 2008), are the good examples of non-pure pixel methods.The methods based on sparse regression rely on the assumption that the hyperspectral data can be linearly combined with known endmembers from the spectral library (Iordache et al., 2012(Iordache et al., , 2013)).
Deep learning (DL) techniques have garnered substantial attention within remote sensing, explicitly focusing on analyzing hyperspectral images.Over recent years, many advanced DL architectures have been meticulously designed to cater to various applications, including HSI classification (Hong et al., 2020;Wu et al., 2021;Yao et al., 2023), fusion (Jiaxin Li et al., 2022), and unmixing (Hong et al., 2018;Shao et al., 2023).Among these applications, unmixing has attracted significant attention.At the forefront of unmixing techniques, autoencoders (AE) are the prominent architecture extensively adopted for unmixing tasks.An AE comprised two parts: an encoder part, responsible for extracting the fractional abundances from the input data, and a decoder part, which reconstructs the input data using the fractional abundances.By enforcing two physical constraints (i.e.nonnegative and sum-to-one) for the fractional abundances, efficiently trained of an AE can significantly reduce reconstruction errors.Nonnegative sparse and De noising AE was utilized to simultaneously extract both endmembers and abundance map for HU, with higher denoising and self-adaptive sparsity constraint, which accomplished remarkable unmixing performance even in noisy environments (Guo et al., 2015;Su et al., 2017).Similarly, a spectral unmixing method using an AE is (Su et al., 2019), where stack nonnegative sparse and variational AE is employed to perform HU problems.Several other AE methods are (Borsoi et al., 2019;Jin et al., 2021;Ozkan et al., 2018;Palsson et al., 2018) developed for spectral unmixing; these methods only process spectral information while ignoring the important spatial information.Recently, AE-based HU methods (Gao et al., 2021;Palsson et al., 2019Palsson et al., , 2019;;Qi et al., 2022;Rasti et al., 2022) have been proposed to integrate the spatial correlation between neighbouring pixels effectively.However, a few (Hadi et al., 2022;Khajehrayeni & Ghassemian, 2020) convolutional AE approaches were proposed for supervised HU, which assumed that the endmembers are known in advance.In (Rasti et al., 2021), the unmixing network using deep image prior is introduced, which utilizes the simple volume maximization (SiVM) method for endmember extraction.In addition, numerous approaches have been developed for HU, but most of the approaches fail to accurately estimate the endmembers when the HSI does not have any pure pixels.
Although convolutional neural network (CNN) based AE methods have found great success in HU problems regarding performance and accuracy, they still have some limitations.First, a major problem with CNN is the convolutional operator, which prevents them from taking advantage of long-range semantic dependencies in the input image.Secondly, CNN finds it hard to build an unmixing network that is both lightweight and effective.Better unmixing results may be achieved using a deep CNN-based approach for HU, but this comes at the cost of increased network complexity.Recently, a model called transformer has emerged, utilizing the multi-head self attention (MHSA) mechanism to solve the limitations of CNNbased methods.Transformer (Vaswani et al., 2017) was first applied in the natural language processing (NLP) field and achieved remarkable success; later, it was used in image processing (Dosovitskiy et al., 2020).It uses an attention mechanism to capture longrange semantic dependencies (Guo et al., 2022).Transformer have also shown promising performance in HSI (Hong et al., 2021;Roy et al., 2022;Yang et al., 2022;Yu et al., 2022;Zhao et al., 2022;Zhong et al., 2021).
Inspired by (Sun et al., 2022), this paper proposes a novel method based on a deep convolutional transformer network for hyperspectral unmixing (DCTNU), which can accurately estimate endmembers and abundance maps.First, a CNN-based AE is used to efficiently extract features of HSI.Second, tokenization is employed to generate tokens.Third, the transformer module is used to learn deep semantic features from generated tokens.Finally, a decoder layer is employed to better reconstruct the hyperspectral image.
The proposed method makes significant contributions, which can be summarized as follows: (1) A novel HU method combining CNN-based AE and transformer network is proposed.The CNN module consists of two 2D convolutional layers.Then, the transformer network is used to improve the feature extraction and yield a more accurate estimation of endmembers and abundance maps.
(2) We introduced the concept of tokenization within the context of HU.This module converts the derived shallow features learned through CNN-based AE into tokenized semantic features.Subsequently, these tokenized semantic features are directed as input to the transformer network.The transformer network effectively exploits the intricate semantic features inherent in HSI, thereby significantly improving the HU performance.
(3) The proposed approach is lightweight and simple to make computing more efficient.The effectiveness of the proposed approach has been examined by evaluating its performance on one synthetic and three real-world remote sensing datasets.The results consistently show its superiority over existing state-of-the-art methods.
The remainder of this article is constructed as follows.
The proposed convolutional transformer for HU architecture is described in section 2. The experimental results utilizing four HSI are detailed in Section 3, Section 4 offers the discussions; and finally, in Section 5, a concise summary of the conclusions is presented.

Methodology
In this section, we present a detailed architecture of the proposed network designed for HU.There are five phases in the unmixing method.Notation and problem formulation, feature extraction through CNN-based AE network, feature tokenization, transformer module, and unmixing block with a decoder.

Notation and problem formulation
The LSMM is used as the base for the HU model in the proposed approach, and the notations are as follows: The given observed HSI X ¼ x 1 ; x 2 ; . . .::; x N ½ � 2 R L�N consists of N pixels and L spectral bands.The height and width of the HSI are denoted by H and W, respectively, such that N = H × W. Each observed spectrum is represented by x i 2 R L for the i th pixel.The endmember matrix M ¼ m 1 ; m 2 ; . . .::; m P ½ � 2 R L�P contains P endmembers in the image.The abundance matrix A ¼ a 1 ; a 2 ; . . .::; a N ½ � 2 R P�N represents the fractional abundances, where a i 2 R P corresponds to the i th pixel.The LSMM can be expressed as follows: The above formula can be written in the matrix form as where N 2 R L�N is the residual matrix (e.g.additive noise).The abundance is subject to two physical constraints: the abundance non-negative constraint (ANC), which ensures that a i;j is greater than or equal to zero, and the abundance sum-to-one constraint (ASC), which ensures that the sum of all a i;j values for a given pixel are equal to one.All the notations used in this article are listed in Table 1.
The HU approaches utilized the AE network to extract abundances and endmembers from the input image.In general, AE is a neural network that reconstructs input data from the hidden representation.
In this article, we investigate the HU problem using CNN-based AE with a transformer network to accurately estimate endmembers and their fractional abundances.The proposed network for HU of remote sensing data is shown in Figure 1.As a network input, an HSI is passed through the CNN encoder layers to produce high-level discriminative features with a reduced number of bands.After that, the extracted features are flattened, and the tokenizer generates semantic feature tokens and fed into the transformer module.The transformer module takes the semantic tokens as input to learn the relationship between semantic features through the multi-head self attention (MHSA), and the multi layer perception (MLP) layer receives the learned features as input.Then, the unmixing block receives the features from the transformer module to upscale and reshape for the abundance map.Moreover, ASC and ANC constraints are enforced, and a softmax activation function obtains the abundance maps.Finally, the decoder increases the number of spectral bands of the image by applying a convolutional layer with weights corresponding to the endmembers.

HSI feature extraction via CNN-based AE
Recently, it has been demonstrated that CNNs are highly effective in extracting high-level feature representations from HSI. Owing to the ability of high-level feature learning, the proposed model utilizes two layers of CNN as a feature extractor.Each layer consists of a 2D convolutional layer with a 1 × 1 kernel size.This configuration serves two main objectives: reducing the number of parameters and accelerating the model training speed.After each convolutional layer, a batch normalization (BN) layer is employed to accelerate the learning process, a spatial dropout layer to reduce overfitting and vanishing gradient problem, and a leaky rectified linear activation function (ReLU) is used to introduce non-linearity in the network.A detailed summary of the CNN-based AE can be found in Table 2.

HSI feature tokenization
The CNN-based AE captures highly discriminative features from HSI, which contain valuable information suitable for HU tasks.To effectively incorporate these features into the transformer network, it is necessary to tokenize them.Consequently, these features are flattened and defined as X flat 2 R uv�z where u and v denote the width and height of the HSI features, respectively, and z indicates the number of feature channels.The tokens for HSI features are denoted by T 2 R w�z , where w denotes the number of tokens.The following formula can be used to obtain T for the feature map X flat : here W a 2 R z�w denotes the learning weight matrix, and X flat W a corresponds to the point-wise product of size 1 × 1.The semantic group features obtained by G 2 R uv�w .Then, G is transposed, and the softmax function is applied to concentrate on the relatively significant semantic part.Finally, the T semantic tokens are obtained by multiplying G and X flat .The tokenizer process is shown in Figure 2.

Transformer module
The tokenizer semantic tokens are fed into the transformer network to learn the correlation among  semantic features.There are three steps involved in the transformer module.
In the first step, the position embedding is performed to annotate each semantic token's position information.We add learnable class tokens and positional information PE pos for the unmixing task.The output embedded sequence is given below: where T w represents the w th token and T cls 0 , PE pos are learnable class tokens and positional information, respectively.
The second step is the transformer encoder (TE) shown in Figure 3, which contains MHSA, MLP, and a couple of layer normalizations (LN).Before the MHSA and MLP layers, residual skip connections are designed.
Figure 4 shows the MHSA mechanism, which is a core part of the transformer network.It is based on self-attention to capture the interaction among feature tokens.The feature tokens multiply with three learnable weights, W Q , W K , and W V , to obtain three linear mapping matrices, queries Q, keys K, and values V.In order to calculate the Q with all K, dot products are first applied, and then the softmax function is used to determine the attention weights on the values.The output of the self-attention mechanism can be expressed as follows: where d K represents the dimension of matrix K.
In addition, the output of several separated selfattention components is concatenated, and this is known as MHSA and is given below: where h represents the number of heads, W 0 denotes the learned parameter.W 0 2 R h�d k �d w , where d w = w represents the number of tokens.
In the third step, the weight obtained from the MHSA is passed through the MLP layer to produce the ultimate output of the transformer.The MLP consists of two fully connected layers with a nonlinear Gaussian error linear unit (GELU) activation function.GELU is the variant of the ReLU  activation function widely used in transformer networks, and mathematically can be computed as follows: where φ x ð Þ denotes the standard Gaussian cumulative distribution function, erf x ð Þ ¼ ò x 0 e À t 2 dt: Before the MHSA and MLP, the LN was used to mitigate the problem of vanishing gradient and reduce the training time.The two commonly used normalization mechanisms in deep learning are LN (Ba et al., 2016) and BN (Ioffe & Szegedy, 2015).The BN has demonstrated excellent performance in computer vision while significantly degrades the performance of NLP tasks.In (Shen et al., 2020) performed a systematic analysis of NLP transformer models and discovers that the statistics of NLP data across the batch dimension fluctuate significantly during training.Therefore, LN is an efficient mechanism for reducing fluctuations in transformer network training and also boosting their efficiency in NLP tasks.In addition, the issue of overfitting, a common of deep learning methods, is further addressed through the incorporation of residual connections.

Unmixing block with decoder
In this block, the final output of the transformer module is forwarded into the convolutional layer to estimate the abundance matrix.The softmax activation function is used to satisfy the ANC and ASC constraints.For the endmember estimation, the abundance matrix is sent to the decoder part, which has one convolutional layer with kernel size 1 × 1. VCA initializes the weights of this layer, which are improved to estimate the endmember matrix.

Overall loss function
In this article, the proposed DCTNU method incorporates three terms in overall loss function to enhance HU performance.These terms include spectral angle distance (SAD), mean square error (MSE), and spectral information divergence (SID).They are given below.
The following formula computes the SAD loss function.
where L SAD measures the spectral angle between the input image and the reconstructed image.Several AEbased methods used it as a loss function.
The MSE is given by: this function is widely used in several deep neural networks.
The SID is computed by: L this loss function is used to measure the divergence between the input and reconstructed image.Note that these three functions are independent metrics; we use them combined to improve the reconstruction impact.Finally, the overall loss function of DCTNU can be formulated as:

Synthetic dataset
The synthetic dataset was generated by selecting five endmember signatures from the USGS digital spectral library (Clark et al., 2003).Each spectrum in the dataset comprises 162 spectral bands, and the spatial dimension of the image is 50 × 50 pixels.Figure 5 depicts the colour image and the true endmembers.

Samson dataset
Among the most popular hyperspectral unmixing datasets is the samson remote sensing dataset collected with the SAMSON sensor.The image has a spatial resolution of 95 × 95 and 156 spectral bands ranging from 401 to 889 .In this dataset, the three primary endmembers are observed: tree, soil, and water.The ground-truth endmembers were manually selected from the image, and the ground-truth fraction abundances were generated using the FCLSU.Figure 6 shows the RGB image and true endmembers.

Houston dataset
This dataset was obtained in June 2012 using an ITRES CASI-150 sensor over the University of Houston campus in Texas, USA.The original image has a spatial dimension of 349 × 1905 pixels, while the spectral dimension comprises 144 bands ranging from 0.364 to 1.046 .We investigate a cropped 170 × 170 image from the original one.Four materials are included in this dataset, i.e. concrete, asphalt, met.roofs, and vegetation.The RGB image and true endmembers are depicted in Figure 7.

Washington dc mall dataset
This dataset was obtained using the HYDICE sensor over the Washington DC Mall region in the United States.We consider a cropped image of size 290 × 290 from the original image 1208 × 307 × 285.Because of noise and water vapour, only 191 bands covering the wavelength range from 400 to 2400 are usable.This dataset includes six endmembers: grass, tree, road, roof, water, and trail.Figure 8 shows the false-colour image and ground-truth endmember signatures.

Hyperparameter settings
For the success of a deep learning-based unmixing model, the hyperparameters must be carefully selected and tuned.Choosing appropriate hyperparameter values can substantially improve results.To update the network parameters, we utilized RMSProp optimizer.Moreover, the dropout rates for our proposed method were investigated in the range of [5%, 15%, 20%, and 50%].Consequently, based on the unmixing results, the dropout rate of 5% is optimal for all the datasets.The batch size is set to one, and the model is trained for a total of 150 epochs.

Performance evaluation metrics
For the quantitative assessments of the algorithm, we use two evaluation metrics, such as spectral angle   distance (SAD) and root mean square error (RMSE).These metrics are defined as follows: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 1 PN where m i ; b m i represent the extracted and the groundtruth endmember, respectively.a i;j; b a ij denote the estimated and the actual abundance value for i th endmember of the j th pixel, respectively.

Experiment on synthetic dataset
To evaluate the robustness of the proposed approach, we add Gaussian white noise to the synthetic image to obtain data with a signal-to-noise ratio (SNR) ranging from 10 to 40 dB.The quantitative results for all unmixing approaches at various noise levels are shown in Figure 9 in the form of SAD and RMSE.As anticipated, our proposed method surpasses other unmixing methods, yielding optimal results.The worst results were achieved by the FCLSU and CyCU-Net methods.However, the NMF-QMV approach yields better results than uDAS.The unmixing results of the uDAS approach are seen to degrade with increasing noise levels.Collab and SGSNMF methods perform approximately the same but show poor  results.Regarding SAD and RMSE, the proposed method obtains better performance, and the DHTN method obtains sufficiently lower error.As observed, the proposed method almost yields the best results in each circumstance.The extracted endmember comparisons are shown in Figure 10.A thorough examination of the endmember results indicates that the DCTNU method provides accurate estimates for all endmembers, except for End 3.However, the results demonstrate the superiority of the proposed method over other unmixing methods.Figure 11 shows the abundances produced by various unmixing approaches on the synthetic datasets (40 dB).In abundance estimation, Collab and SGSNFM have a similar moderate performance, while CyCU-Net performs worst.Figure 11 reveals that CyCU-Net combines the End 1, End 3 and End 4 classes.DCTNU and DHTN abundance map is more accurate and has less noise than those obtained by other methods.DCTNU approach achieved highly promising performance compared to other methods.For illustrative purposes, Figure 12 shows the extracted endmember signatures of different algorithms with three corresponding ground-truth endmember signatures.The red curves correspond to the true endmembers, while the blue curve represents the extracted endmembers.According to Figure 12, the SGSNMF and DHTN methods achieved the best performance in estimating soil and tree but had trouble with the water endmember.For all three endmembers, the proposed approach outperformed the competitors.

Experiment on samson dataset
Table 4 lists the RMSE values for the abundance maps obtained using all the algorithms.Based on the RMSE evaluation, the proposed method outperformed the other approaches.As you can see in Figure 13 the abundance maps estimated by the DCTNU method closely resemble the ground-truth abundance maps.

Experiment with houston dataset
Tables 5 and 6 present the results of all the unmixing methods regarding SAD and abundance RMSE, respectively.The DCTNU approach produces superior results to all those obtained by the other unmixing approaches.DHTN and Collab are ranked as the second and third-best performing methods, respectively, on this dataset.The extracted endmembers and abundance maps of all unmixing approaches for the Houston remote sensing dataset are illustrated in Figures 14 and 15, respectively.Figure 14 is a visual representation of a comparison of the extracted endmembers of the various methods and their corresponding ground-truth.However, the endmember signatures of DCTNU that were extracted mostly match ground-truth endmembers.Figure 15 demonstrates that the abundance maps obtained by the DCTNU approach and the ground-truth abundance maps match very well.In terms of overall SAD, the worst results were provided by FCLSU and CyCU-Net, respectively.For illustrative purposes, Figure 16 displays the extracted endmember signatures of all unmixing techniques compared to the actual ones.It can be observed that the endmembers extracted by the DCTNU method are in good accordance with the original endmembers.All the unmixing methods except the proposed method have trouble estimating the water material correctly.However, our proposed method estimated all materials accurately.Figure 17 represents the unmixing methods for extracting the abundance maps from the Washington DC Mall dataset.As can be observed from Figure 17, the abundance maps obtained using the DCTNU approach closely match the ground-truth abundance maps.

Sensitivity analysis of the hyperparameters
This section focuses on exploring the sensitivity of the DCTNU method to different hyperparameters.We thoroughly investigate four key aspects that can potentially influence the performance of the proposed method.First, we conducted numerous experiments on the overall loss function.Secondly, we investigate different learning rates.Thirdly, we also investigate several common activation functions and Finally, we give the running times of all techniques to analyze the computing efficiency of various approaches.

Impact of loss functions
The loss function of DCTNU is composed of three terms: SAD, MSE, and SID.An ablation study is performed on a synthetic dataset to evaluate the impact of the individual terms (MSE, SAD, and SID) within the overall loss function of the proposed method.In addition, the regularization parameter α; β influences the HU performance of the proposed method.We fixed the weight of β (L SID Þ and explored the α (L MSE Þ weight effect in the proposed model.The results for the synthetic dataset are illustrated in Figure 18.It can be observed from Figure 18    results, the DCTNU approach achieves optimal performance when α ¼ 1 � 10 À 2 and β= 1 � 10 À 5 .

Learning rate
The DCTNU network loss curves for different learning rates are shown in Figure 19(a) .Through our investigation, we determine that the loss curves exhibit stability across different learning rates, indicating the convergence of the network.Notably, the proposed method exhibits the most efficient convergence when the learning rate is set to lr = 0.003.

Activation function
The performance of the proposed unmixing network under different activation functions is illustrated in Figure 19(b), which indicates that the leaky ReLU activation function outperforms the ReLU, SELU, and ELU activation functions.

Computational cost
Table 10 provides an analysis of the computational cost of various hyperspectral unmixing approaches on one synthetic and three real remote sensing datasets.The processing time was measured using a computer equipped with an Intel Core i7-8550 U CPU and a GPU with 4 GB of memory.The results indicate that our proposed method performs faster than others, except for FCLSU.

Discussion
DL-based methods have several advantages over other state-of-the-art techniques.These advantages include the automated extraction of features from HSI and the efficient utilization of computing resources, such as graphical processing units (GPUs).The proposed method employed CNN and transformer networks, which are widely recognized as powerful backbone networks for several computer vision tasks, including HU.However, the proposed DCTNU method used CNN-based AE to obtain shallow features, and then, the transformer network is applied to learn feature representations to improve the unmixing performance.According to the results on three real and one synthetic datasets, the suggested DCTNU can significantly enhance the performance of HU in terms of SAD and RMSE.This is achieved due to jointly employing the CNN and transformer to obtain discriminative deep semantic features, which have strong discriminative capability and are very helpful for HU task.In comparison, the uDAS approach is based on AE without the utilization of spatial information, resulting in an overall performance that is deemed unsatisfactory.CyCU-Net includes spatial information but its poor performance can be attributed to the absence of ASC in the process of reconstruction.The CyCU-Net utilises the Clamp function in place of Softmax to optimise the abundance.Furthermore, enforcing ASC becomes imperative once the abundance estimation is completed.The other compared methods, such as FCLSU, Collab, SGSNMF, and NMF-QMV also perform well, but their results vary on different datasets.While a transformer-based approach, i.e. the DHTN method achieved similar unmixing performance but spent much computational time compared to our proposed method.However, the proposed DCTNU method has lower computing complexity than all state-of-the-art methods except FCLSU.

Conclusion
This paper introduces a novel convolutional transformer network for hyperspectral unmixing.The proposed method first extracts high-level discriminative features through CNN and transforms these features into semantic tokens through a tokenization module.Then, the transformer module is used to learn the relationship between semantic tokens and improve unmixing performance.To validate the effectiveness and robustness of our approach, we conducted experiments on one synthetic and three real remote sensing datasets.The results unequivocally demonstrate the superiority of our proposed method over existing unmixing methods.
In future work, the experiments can be performed towards designing and developing an innovative twobranch transformer network tailored specifically for blind hyperspectral unmixing.This forthcoming research endeavor will enhance efficiency and accuracy by addressing a critical challenge in hyperspectral unmixing, such as the intricate issue of spectral variability.

Figure 1 .
Figure 1.The overall structure of the proposed deep convolutional transformer network for HU is composed of several key phases, including a CNN layers, a tokenization module, a transformer encoder (TE), and a decoder block.

Figure 2 .
Figure 2. Visual representation of the tokenization process.

Figure 10 .
Figure 10.The visual comparison of five endmembers estimated by various unmixing approaches.Blue: the estimated endmembers; Red: the true endmembers.

Figure 11 .
Figure 11.Abundance maps comparison of five materials estimated by various unmixing approaches.SNR 40 dB.

Figure 12 .
Figure 12.The visual comparison of three endmembers estimated by various unmixing approaches.Blue: the estimated endmembers; Red: the true endmembers.

Figure 13 .
Figure 13.Abundance maps comparison of three materials estimated by various unmixing approaches.

Figure 14 .
Figure 14.The visual comparison of four endmembers estimated by various unmixing approaches.Blue: the estimated endmembers; Red: the true endmembers.

Figure 15 .
Figure 15.Abundance maps comparison of four materials estimated by various unmixing approaches.

Figure 16 .
Figure 16.The visual comparison of six endmembers estimated by various unmixing approaches.Blue: the estimated endmembers; Red: the true endmembers.

Figure 17 .
Figure 17.Abundance maps comparison of six materials estimated by various unmixing approaches.
(a) that there is a small variation between index values, and that SAD and RMSE have shown almost no change.Similarly, we set the weight of α (L MSE Þ fixed and explore the effect of β (L SID Þ weight in the unmixing model.The SAD and RMSE results of different β (L SID Þ weight are illustrated in Figure 18 (b); it can be observed that the different weights have a different impact on the unmixing results.According to the

Figure 18 .
Figure 18.Ablation study for the DCTNU method on synthetic dataset under different loss function weights.(a) different weights of L MSE : (b) different weights of L SID .

Figure 19 .
Figure 19.Unmixing results obtained by DCTNU for synthetic dataset with different learning rates and activation functions.(a) learning rate (b) activation function.

Table 1 .
Description of the important notations in this article.

Table 2 .
A detailed summary of CNN-based AE.
Table 3 presents the obtained SAD score from the real hyperspectral Samson dataset.The proposed

Table 3 .
Samson dataset.SAD obtained from various unmixing approaches.Best results in bold.

Table 4 .
Samson dataset.RMSE comparison of different unmixing approaches.Best results in bold.

Table 5 .
Houston dataset.SAD obtained from various unmixing approaches.Best results in bold.

Table 6 .
Houston dataset.RMSE comparison of different unmixing approaches.Best results in bold.

Table 7 .
Washington dc mall dataset.SAD obtained from various unmixing approaches.Best results in bold.

Table 8 .
Washington dc mall dataset.RMSE comparison of different unmixing approaches.Best results in bold.
Table 7 and 8 report SAD and abundance RMSE results of various unmixing methods for the Washington DC Mall dataset.The results clearly demonstrate the superior performance of our proposed DCTNU method compared to other approaches in terms of both overall SAD and RMSE.DHTN performed significantly better than FCLSU, Collab, SGSNMF, NMF-QMV, CyCU-Net and uDAS methods.The SAD results for Collab and SGSNMF are identical and can be considered poor.
Table 9 presents all experiment results.The loss function is divided into four cases: complete loss function, without SID, without SAD, and without MSE.The results demonstrate that the model exhibits superior unmixing performance in the first case.

Table 9 .
SAD and RMSE results using different loss functions on a synthetic dataset.Best results in bold.

Table 10 .
The computational time comparison (in seconds).