Highlight Removal from A Single Grayscale Image Using Attentive GAN

ABSTRACT The existence of specular highlights hinders high-level computer algorithms. In this paper, we propose a novel approach to remove specular highlights from a single grayscale image by regarding the problem as an image-to-image translation task between the highlight domain and the diffuse domain. We solve this problem by using the generative adversarial network framework, where a generator removes highlights and discriminator judges whether outputs of the generator are clear and highlight-free. Specular highlight removal is intractable as we should remove specular highlights while keeping as many details as possible. Considering the similarity between the highlight image and diffuse image, we adopt an attention-based submodule that generates a mask image, which we call the highlight intensity mask, to locate pixels that contain specular highlights and help the skip-connected autoencoder to remove highlights. A pixel discriminator and Structural Similarity loss are utilized to ensure that more details can be retained in the output images. For training and testing models, we build a grayscale highlight images dataset. It consists of more than a thousand sets of grayscale highlight images with ground truth. Finally, quantitative and qualitative evaluations demonstrate the effectiveness of our method than other contrast generative adversarial network methods.


Introduction
Most current computer vision algorithms assume the surface of an object is fully diffuse. But the reality is that specular highlights are widely present in real-world objects and specular highlights create obstacles to high-level algorithms. Industrial metal parts are more likely to show specular highlights due to their materials. Therefore, the research on the grayscale image highlight removal of metal parts is of great significance.
According to the dichromatic reflection model (Shafer 1985), the observable intensity IðpÞ of any pixel p is formed by the linear superposition of the specular reflection component I s ðpÞ and the diffuse reflection component I d ðpÞ as: The purpose of specular highlight removal is to separate two intrinsic images from the highlight image, a diffuse image, and a specular image. According to the number of input images, the existing methods can be divided into two categories: single-image-based methods and multiple-image-based methods, but they both have certain limitations (Artusi, Banterle, and Chetverikov 2011). The generative adversarial network (GAN) (Goodfellow et al. 2014) has achieved great success in the field of image synthesis (Huang, Yu, and Wang 2018). Although it is difficult to obtain satisfactory results by applying those GAN models directly to specular highlight removal, it provides a new approach to this problem. That is, we can treat specular highlight removal as an image-to-image translation between the highlight domain and diffuse domain, and then employ conditional GAN methods to solve the problem. Inspired by this, we proposed a novel approach using the generative adversarial network, in which the generator procudes the image without specular highlights, and the discriminator determines whether the image contains specular highlights.
We noticed that there are distinctive features within the specular highlight removal. Compared to other image translation tasks, the highlight image and diffuse image show a high degree of similarity as shown in Figure 1. The vast majority of the pixels in them are identical, differing only in a few highlighted areas. The similarity leads to poor results in adversarial training. To address the problem caused by this similarity, we propose to employ an attention module in front of the autoencoder. The attention module is used to predict the specular reflection intensity of each pixel in the highlight image, which is called the highlight intensity mask. It serves as effective auxiliary information to help to remove specular highlights. Skip-connections are exploited in the autoencoder so that some of the information in the highlight image can be conducted directly to the symmetric convolutional layer. For the discriminative network, we apply a pixel discriminator to prevent the accuracy of being affected by excessive receptive fields. The similarity property also causes the MSE between images to always be quite close to zero, reducing the discrimination of MSE as a measure of image similarity, so we adopt Structural Similarity (SSIM) (Z. Wang et al. 2004) as an additional content loss. These targeted designs allow our proposed method to greatly improve the retention of details in the single grayscale image highlight removal problem.
In summary, with the framework of GAN, we propose a novel end-toend network for the single grayscale image highlight removal. The model consists of a generator with skip-connections and a pixel discriminator. Considering the similarity between the highlight image and the diffuse image, we propose to locate highlight by a highlight intensity mask image, which is generated by an attention module, and use SSIM loss and MSE loss as a combined content loss to train the network. With those features, specular highlights are removed effectively from grayscale images, and details are kept in outputs.
The rest of this paper is organized as follows: Section 2 discusses the related work in the fields of specular highlight removal, and the fields of image applications based on generative adversarial networks. Section 3 describes our proposed model in detail, including both the network structure and the objective function. Section 4 shows our synthetic dataset and training details. Section 5 discusses the results of training and experiments. Finally, Section 6 concludes our paper.

Related Work
Our work involves two topics: specular highlight removal and generative adversarial networks which are briefly discussed in this section.

Specular Highlight Removal
According to the number of input images, specular highlight removal methods can be divided into two main categories: single-image-based method and multiple-image-based method.

Single-image-based Method
The specular highlight removal method based on a single image aims to remove the highlights by only one input image. Klinker et al. analyzed the distribution of pixels and found that diffuse reflection and specular highlights presented a T-shaped distribution in the RGB color space, and achieve the removal of specular highlights (Klinker, Shafer, and Kanade 1988). Later, other scholars extended the color space analysis to UV-space (Schlüns and Teschner 1995) and S-space (Bajcsy, Lee, and Leonardis 1996) based on Klinker's idea. Koirala et al. exploited principal component analysis and histogram equalization method to remove specular highlights quickly (Koirala, Hauta-Kasari, and Parkkinen 2009). Shen et al. discovered the intensity ratio between the maximum values and range values (maximum minus minimums) is independent of surface geometry and applied this rule to achieve real-time highlight removal (Shen and Zheng 2013). Guo et al. proposed a sparse and low-rank reflection model for specular highlight detection and removal with a single input image (Guo, Zhou, and Wang 2018). As this problem is inherently ill-posed, prior knowledge or assumptions on the characteristics of natural images should be exploited to make the problem tractable. Although methods based on prior knowledge have achieved good results, such methods do not always achieve satisfactory solutions in an unconstrained environment.

Multiple-image-based Method
The diffuse reflection does not shift with the viewing angle and the relative position of the light source. According to this characteristic, it is natural to use image sequences from different points of view or multiple light positions to restore the diffuse reflection (Feris et al. 2004;Li and Ma 2006). Woff et al. removed specular highlights by multiple images based on polarization (Wolff and Boult 1993). Sato et al. made additional use of time-dimensional information to separate the specular highlight component (Sato and Ikeuchi 1994). Xu et al. and Wang et al. utilized light field cameras to assist in removing specular highlights, as they provided depth information (H. Wang et al. 2016;Xu et al. 2015). Wei et al. estimated the angle of the incident light and diffuse reflection component when the geometry of the object is known, and then removed the specular highlights (Wei et al. 2018). These representative methods can achieve some good performance in the removal of specular highlights, but they are dependent on the external environment to control the angle and quantity of light source or need special equipment, such as a light field camera. This requirement greatly limits the usage scenarios of multiple-imagebased methods.

Generative Adversarial Network
Since the generative adversarial network (Goodfellow et al. 2014) was proposed in 2014, it has achieved great success in the fields of deep learning (Gui et al. 2020). There are many applications of GAN in computer vision, such as image inpainting (Yeh et al. 2016;Yu et al. 2018), super-resolution (Ding et al. 2019Xintao Wang et al. 2018b), and image translation (Isola et al. 2017;Zhu et al. 2017). Inspired by the success of GAN in image translation, we utilize it to implement the specular highlight removal. Some scholars have made related attempts before. John Lin et al. used a multi-class discriminator to train the generator to remove the specular highlights from color images (Lin et al. 2019). Funke et al. employed GAN in the endoscope highlights removal (Funke et al. 2018). In general, the research on the combination of GAN and specular highlight removal is still at the initial stage. As discussed in the introduction, this research topic is of great research value, so we propose a GAN-based method to remove specular highlights from a single grayscale image in this paper.

Overview
In this paper, we define specular highlight removal as an image-to-image translation between the highlight domain and the diffuse domain as shown in Figure 2. The highlight domain is a collection of images with specular highlights and the diffuse domain is composed of diffuse images. With this definition, an input image can be regarded as a random sample from the highlight domain, and a corresponding diffuse image can be found in the diffuse domain. We propose to adopt a generative adversarial network to solve this image translation problem. The structure of our proposed network is shown in Figure 3. It follows the framework of GAN, the network is divided into two sub-networks, a generator, and a discriminator. The generator learns the mapping from the highlight domain to the diffuse domain from the data to remove specular highlights. A discriminator is introduced to determine the way to distinguish the output of the generator from the ground truth (GT). The generator is trained alternately with the discriminator to reach a Nash equilibrium. The process can be expressed as a min-max optimization problem: where G stands for the generator and D stands for the discriminator. Input image II and diffuse image DIdenote randomly samples from the highlight domain HD and the diffuse domain DD. The structure generator and discriminator will be described in the next part of this section.

The Generator
As shown in Figure 3. the generator is divided into two submodules, one is the attention module and the other is the skip-connected autoencoder.

Attention Module
The attention is widely used in computer vision to locate regions of interest for better feature extraction (Gregor et al. 2015;Zhao et al. 2017). In the same way, our attention module is used to predict the location and intensity of pixels containing specular highlights in the input image and produce the highlight intensity mask as shown in Figure 4. The highlight intensity mask is a singlechannel image of the same size as the input, and each pixel in the mask has a value in the range of 0 to 1, which signifies the reflection intensity at the corresponding location of the input. A higher value means a stronger reflection. We consider the mask can help the autoencoder to remove highlights as it suggests the pixels that the autoencoder should focus on by different attention values.
The attention module consists of N recursive blocks as shown in Figure 3. Figure 5 shows the internal structure of each recursive block. There are four residual layers (He et al. 2016) to extract features first in each block. Those short-cut connections in the residual layers ensure that the semantic information is preserved. Next, is the convolutional LSTM unit (Qian et al. 2018), and the last convolution layer is used to adjust the channel to produce the highlight intensity mask.
The convolutional LSTM unit contains an input gate i t , an output gate o t , a forget gate f t , and a cell state C t , where the t represents time. The relationship between them and the input can be defined as follows: where∘ represents the convolution operation. W and b represent the weights and biases, respectively. X t represents the features extracted by the residual layer, and H t represents the output from the previous LSTM unit. The features are fed into the convolutional layers to generate the highlight intensity mask. Each recursive block accepts the input image and the tensor from the previous recursive block as input and outputs a new highlight intensity mask. We initialize the highlight intensity mask to the full 0.5 as the input of the first block and select the output of the last block as the final highlight intensity mask.

Autoencoder
The purpose of our autoencoder is to generate images that are free from specular highlights. The input of the autoencoder is the concatenation of the input highlight image and the mask generated by the last recursive block. We conceive that the diffuse image should restore details from the input image as many as possible. That means information of intensity and structure in the input image needs to be preserved to the output image. For this reason, we adopt a network similar to the U-net (Long, Shelhamer, and Darrell 2015) whose most notable characteristic is the skip connections as shown in Figure 6. The input image is first passed through a series of down-sampling convolution layers until a bottleneck layer. Then it returns to the original size and channels by up-sampling deconvolution layers. Low-level information can be shared by those skip connections which have been proved to have excellent results in image generation applications (Guan et al. 2020;Chengjia Wang et al. 2018a).

Objective Function
The objective function of the generator consists of four parts: The Mean Square Error (MSE) loss L MSE , the Structural Similarity (SSIM) loss L SSIM , the attention module lossL ATT , and the adversarial loss L GAN . The λ � represents the scale factor of different parts. We define the loss function as: where O stands for the diffuse image generated by the generator. T indicates the ground truth of the diffuse image. M f g N 1 represents the highlight intensity mask generated by 1 to N recursive blocks. MT represents the ground truth of the highlight intensity mask, which is obtained by subtracting the specular highlight image from the corresponding diffuse image and normalized.  The MSE loss L MSE is widely used to measure the similarity between generated image and ground truth. MSE loss ensures that images generated by the generator are close enough to the ground truth. L MSE is denoted as: where m and n represent the width and height of the input image, respectively. The MSE loss converges to zero so quickly that the term no longer contributes to backpropagation. To solve this problem, we introduce the SSIM as the second content loss, which is defined as Equation 6. The inclusion of SSIM loss allows the autoencoder to take not only the loss of the average pixel values, but also information such as contrast, brightness, and structure into account. Experiments demonstrate that we obtain clearer results after using SSIM as a loss term.
where μ O and μ T indicates the mean value of all pixels in the selected window, σ O and σ T indicates the variance of the pixel values, σ OT represents covariance between O and T, and C 1 and C 2 are hyper-parameters. The loss of the attention module L ATT is defined as Equation 7. Actually, L ATT calculates the MSE between the highlight intensity mask of each recursive block and the ground truth, where θ denotes the weight.
Adversarial loss L GAN is determined by the result of the discriminator and can be written as:

The Discriminator
The discriminator is employed to distinguish the diffuse image generated by the generator from the ground truth. Discriminators in some GANs compress the features to a number or an N-dimensional matrix (N is much smaller than the original size of the image). Those discriminators include a relatively large receptive field, which will make discriminators capture the overall information while ignoring the highlighted areas that don't account for the majority. Such a discriminator is not suitable for specular highlight removal with high image similarity. So, we adopt a pixel discriminator. Figure 7 shows the specific structure of the pixel discriminator we propose. The pixel discriminator is composed of convolutional layers with a step size of 1 and a padding of 0. The size of the convolution kernel is (1, 1), and there is no pooling layer in the network. This design ensures that the receptive field is equal to 1, that is, each point in the output can correspond to a certain point in the input. The loss function of the discriminator L D is defined as: where L D indicates a criterion that measures the Binary Cross Entropy between the ground truth and the output of the generator.

Highlight Dataset
Similar to the current deep learning methods, our method requires a comparatively large amount of data with ground truth for training. However, there is no such appropriate dataset for specular highlight removal from a grayscale image. Therefore, we construct a synthetic dataset using 3D modeling software. There are 1062 sets of images in this dataset. Each image set contains a specular highlight image and a diffuse image. The dataset contains 85 industrial parts of different sizes and shapes, most of them are assigned different metallic materials, others are assigned composite materials to enhance the richness of the materials. For the rendering, two lighting scenes are constructed, a basic scene with only basic light and a high-lighting scene with the addition of a globe or flat light. In both scenes, we render the aligned images from 10 different angles as shown in Figure 8. Finally, the rendered images were converted into grayscale for training and testing the models.

Training Details
We implemented our model and comparative models with the PyTorch framework and trained them on an NVIDIA 2080Ti GPU. In training, we adopt Adam optimizer with a batch size of 1 and set a learning rate as 0.0002, the exponential decay rate as (β 1 ; β 2 ) = (0.6, 0.9). We train the generator and discriminator for 200 epochs. The weights of the objective function of the generator are set to λ ATT ¼ 100:0, λ MSE ¼ 100:0 and λ SSIM ¼ 10:0. The hyperparameters in the SSIM loss C 1 , C 2 is set to 0.0001 and 0.0009, and the size of the window is set to 5. The number of recursive blocks N is 4 and the weight θ is 0.5. Theoretically, the larger N makes the better training effect of the attention module, and it also requires more memory.

MSE Loss and SSIM Loss
Many GANs choose MSE or L1 as the content loss to train neural networks. However, it is not a very wise decision for specular highlight removal, because of the existence of similarity between highlight image and diffuse reflection image, the network can get a good result even if does nothing. In practice, the MSE loss decays very quickly, approaching zero in the training, as shown in Figure 9. As can be seen, both MSE and SSIM show a decreasing trend, indicating that the generator does produce results closer to the ground truth. But SSIM is still able to maintain a level of 0.4 to 0.5 in the latter part of the training, while MSE is quite close to 0 in the middle of the training, making it unable to contribute to backpropagation. With the addition of SSIM as a loss, the result of the generated images is improved both in quantitative and qualitative evaluation.
Highlight Intensity Mask Figure 10 shows the highlight intensity mask generated by the attention module at different training steps. It is visualized by the heat map. As the training steps increasing, the highlight intensity mask focuses more and more  on the regions that contain specular highlights. It is clear that the mask is an excellent indication of the distribution and intensity of highlight pixels in the highlight image, so it can be used as auxiliary information to help the autoencoder locate and remove highlights.

Highlight Removal Result
We trained our method and other GAN models on the dataset. ResNet-GAN is the baseline that uses ResNet (He et al. 2016) as the generator. UNet-GAN is similar to ResNet-GAN, but its generator is UNet (Long, Shelhamer, and Darrell 2015). Pix2Pix (Isola et al. 2017) that is a landmark conditional GAN in image translation. It consists of a skip-connected autoencoder and patch discriminator. We also did ablation studies by building networks with parts of features. In Ours-1, we don't exploit the pixel discriminator, but the same patch discriminator as in Pix2Pix. Ours-2 applies MSE loss alone without SSIM loss, and Ours-3 is a model without attention module. Their other parts are the same as the full method.

General Results
To measure the performance, we experiment on the test set with SSIM, Peak signal-to-noise ratio (PSNR), and MSE. They are both important indicators to measure image similarity. The PSNR (in dB) is defined as: Where MAX T is the max value of ground truth, and MSE is the same with L MSE in Equation 5. The results of all test images are shown in Table 1, where the MSE is multiplied by 1000 for better comparison. In all metrics, our method gets the best results, which fully demonstrates that our method produces images closer to the ground truth than other methods. We notice that the result of UNet-GAN is better than Resnet-GAN, while the only difference between them is that UNet-GAN uses the skipconnected autoencoder. It proves that it is appropriate to employ the skipconnected autoencoder as the generator to deal with the highlight removal problem. With content loss and patch discriminator, Pix2Pix gets an improved grade. It further proves that compared with vanilla GAN, the conditional GAN for image-to-image translation is suitable for highlight removal. Compare to other configurations of our method, Ours-1 (no pixel discriminator), Ours-2 (no SSIM loss), and Ours-3 (no attention), the full model gets the best outcomes, indicating the effectiveness of the combination of those features we proposed.

Cases Study
In this section, we picked five test cases from test images for qualitative and quantitative evaluation. As can be seen in Figure 11, our method is considerably more effective in specular highlight removal compared to Figure 11. Comparison of 5 test cases of different GAN models in specular highlight removal.

e1988441-96
other methods. Our method produces the clearest diffuse images. Especially in the region with texture, the images generated by our method can keep more details, while the images generated by other methods are blurred as shown in Figure 12.
We also made a quantitative evaluation of the five test cases, and the results are shown in Table 2. The quantitative and qualitative evaluations of five test cases demonstrate the performance of our method.

Conclusion
In this paper, we regard specular highlight removal as an image-to-image translation between the highlight domain and the diffuse domain. We propose a single-image-based highlight removal method that focuses on removing highlights from a grayscale image. This method utilizes the generative adversarial network, where the generator produces images that is free of specular highlights, while the discriminator is responsible for determining whether images are clear and highlight-free. We take the similarity of the images in specular highlight removal into account, and elaborately design all parts of the network. In generative networks, the attention module is employed for generating the highlight intensity mask to locate highlight pixels. Then highlights are removed by the skip-connected autoencoder which ensures low-level information can be conducted directly. Pixel discriminator and SSIM loss help to train the generator to get results that keep more details. Finally, the effectiveness of the method is proved by quantitative and qualitative evaluation.

Disclosure Statement
No potential conflict of interest was reported by the author(s).