Dual conditional GAN based on external attention for semantic image synthesis

Although the existing semantic image synthesis methods based on generative adversarial networks (GANs) have achieved great success, the quality of the generated images still cannot achieve satisfactory results. This is mainly caused by two reasons. One reason is that the information in the semantic layout is sparse. Another reason is that a single constraint cannot effectively control the position relationship between objects in the generated image. To address the above problems, we propose a dual-conditional GAN with based on an external attention for semantic image synthesis (DCSIS). In DCSIS, the adaptive normalization method uses the one-hot encoded semantic layout to generate the first latent space and the external attention uses the RGB encoded semantic layout to generate the second latent space. Two latent spaces control the shape of objects and the positional relationship between objects in the generated image. The graph attention (GAT) is added to the generator to strengthen the relationship between different categories in the generated image. A graph convolutional segmentation network (GSeg) is designed to learn information for each category. Experiments on several challenging datasets demonstrate the advantages of our method over existing approaches, regarding both visual quality and the representative evaluating criteria.


Introduction
Conditional image synthesis mainly uses text, Gaussian noise or semantic layout to generate constrained images.Typically, conditional Generative Adversarial Networks (GANs) (Mirza & Osindero, 2014) are common approaches for conditional image synthesis.In conditional image synthesis, semantic image synthesis is to generate photorealistic images through semantic layouts.Since the information contained in the semantic layout is relatively sparse, semantic image synthesis is a huge challenge to image synthesis methods.
Semantic image synthesis is widely used, for example past work includes specified content creation (Mirza & Osindero, 2014;Ntavelis et al., 2020) and drawing editing (Park et al., 2019;Tang, Xu, et al., 2020;Zhu et al., 2020) and other related work.In addition, the industrial applications of this work are also very wide, such as virtual reality and AIGC related applications.
Currently, the semantic image synthesis methods based on GANs generally use noise as the input, and the semantic layout is used to control the image synthesis process through the adaptive normalisation methods.SPADE (Park et al., 2019) is the representative semantic image synthesis method.SPADE effectively solves the problem of blurred boundaries of each category in the generated image.CC-FPSE (Liu et al., 2019) and SCGAN (Y.Wang et al., 2021) are improvements based on SPADE (Park et al., 2019), and these methods have achieved good results.However, since these methods only use a single constraint to control the synthesis process, the quality of the generated images still cannot meet the needs of users.
In addition, the discriminator also has an impact on the quality of the generated images.In GANs, the discriminator mainly consists of a convolutional network.Generally, PatchGAN (Isola et al., 2017) is a commonly used discriminator.Recently, some new discriminators have also been proposed.OASIS (Schonfeld et al., 2021) proposed a novel discriminator based on the segmentation network.The discriminator based on the segmentation network can effectively prompt the generator to generate the object shapes that conform to the semantic layout.But to a certain extent, it ignores the positional relationship information between different categories.
In order to solve the above problems, we propose Dual-conditional GAN with based on an external attention for semantic image synthesis (DCSIS).In DCSIS, the adaptive normalisation module uses the semantic layouts of one-hot encoding to generate the first constraint.The external attention uses the semantic layouts to generate the second constraint.Two different modules perform two constraint controls on the input in sequence, forming a dual conditional attention (DCA).Compared with only using a single constraint, DCA can better utilise the category information and boundary information in the semantic layouts to synthesise better detailed information.Attention mechanism has been widely used in image synthesis and effectively improve the quality of synthesised images (Tang et al., 2019;Q. Wang et al., 2020).A novel graph attention (GAT) is introduced into the generator, which aims to strengthen the relative positional relationship between objects of different categories.
DCSIS has two discriminators, one is the traditional discriminator SESAME (Ntavelis et al., 2020) and the other is the proposed segmentation network based on graph convolutional network.The proposed segmentation network based on graph convolutional network can not only align semantic information, but also better establish relationships between objects of different categories.The overview of the proposed DCSIS model is shown in Figure 1.We conduct experiments on three challenging datasets.
In general, the main contributions of this paper are as follows: (1) We proposed two constraint methods to control the synthesis process.The semantic layout of RGB format and the semantic layouts of one-hot encoding are used to generate two constraints and form dual conditional control.(2) We designed a segmentation network discriminator based on graph convolutional network, which can better align semantic information.(3) We designed a novel graph attention to enhance the relational information between objects of different categories.

Related work
Generative adversarial networks have achieved remarkable success on unconditional image synthesis tasks (Brock et al., 2019;Tero Karras et al., 2019;T. Karras et al., 2020).Since the result of unconditional image synthesis is uncontrollable, conditional image synthesis using external control information to control the result of image synthesis is proposed.Semantic layout is a commonly used control information in conditional image synthesis, which is mainly used as the input of the generator.Pix2pix (Isola et al., 2017) and pix2pixHD (T.-C.Wang et al., 2018) are classical conditional image synthesis methods that take semantic layout as the input of the generator.Edge-GAN (Tang, Qi, et al., 2020) used edge details to optimise detailed structural information for image synthesis.
Due to the sparsity of the information contained in the semantic layout, directly using the semantic layout as the input increases the pressure of network learning and it is difficult to effectively improve the quality of the generated images.Therefore, the current mainstream conditional image synthesis methods generally use noise as the input of the generator, and the semantic layout as the constraint to control the image synthesis process.
At present, using the adaptive normalisation method that takes the semantic layout as the input to constrain the image synthesis process has gradually become the mainstream constraint method.AdaLIN (Kim et al., 2019), SPADE (Park et al., 2019), SEAN (Zhu et al., 2020), the class-adaptive normalisation method (D.Chen et al., 2020) and SAFM (Lv et al., 2022) are some well-known adaptive normalisation methods.These adaptive normalisation methods take the semantic layout as the input, and utilise the semantic layout to constrain the features of the noise during the normalisation process.The adaptive normalisation methods generate corresponding parameters through semantic layout to control the normalisation results.Since the semantic layout is only used to generate normalised parameters, it effectively avoids the defect that the semantic layout contains sparse information.
Except for the generator, the discriminator also has an impact on the quality of the generated images.Some new discriminators are proposed.
CC-FPSE (Liu et al., 2019) proposed a pyramid discriminator, which jointly feeds the generated image and semantic labels into the discriminator, and then discriminates true and false on multiple resolutions.OASIS (Schonfeld et al., 2021) proposed a segmentation network discriminator supervised with semantic labels.
LGGAN (Tang, Xu, et al., 2020) proposed to use a local class-specific and feature module to learn the appearance distribution of different objects globally and the generation of different object categories.SC-GAN (Y.Wang et al., 2021) learns to generate normalised parametric models by convolving semantic vectors.
In addition to these GAN-based methods, there are also some special non-GAN methods, such as CRN (Q.Chen & Koltun, 2017) which utilises refined cascaded networks for semantic image synthesis.There is also the recent application of the more popular diffusion model SDM (W.Wang et al., 2022) to this work, which combines the SPADE normalisation module with a diffusion model backbone to control image generation.
For conditional image synthesis, fully exploiting the information of the semantic layout is crucial to the quality of the generated image.However, most approaches only use the semantic layout for a single constraint control.As a comparison, DCSIS uses the semantic layout of the RGB format and the semantic layout of the one-hot encoding format to form two different constraint controls.This approach further improves the utilisation of the semantic layout information.

Dual condition attention
Image synthesis methods that only use adaptive normalisation to constrain the generated images have been unable to meet the needs of existing tasks.In this paper, the proposed dual condition attention module (DCA) is used to constrain the generated images.DCA contains two constraints: the semantic layouts of RGB encoding and the semantic layouts of one-hot encoding.Structurally, DCA contains SPADE and the proposed attention network.DCA and the architecture of the generator network is shown in Figure 2.
Inspired by the pre-trained visual language models CLIP (Radford et al., 2021) and GLIDE (Nichol et al., 2021), the paper proposes the RGB semantic encoder to extract the information of the semantic layouts of RGB encoding.
The RGB semantic encoder consists of a CNN encoder and a transformer module, as shown in Figure 3.The semantic latent space generated by the RGB semantic encoder is fed into the proposed attention network in DCA.Different from the ordinary attention modules, the proposed attention network is a conditional attention module named Seg_Attention.The module is formulated as follows: where Q, K and V come from feature maps, and K c and V c come from backbone network.Since SPADE only uses the parameters generated by the semantic layouts to control the input information, the information loss of the object category in the semantic layout is  more, resulting in a lack of further guidance for the details of the generated images.Hence, the RGB semantic latent space is used to enhance the semantic constrained information in DCA.In this way, the image synthesis process is controlled by two different forms of constraints, which improves the utilisation of semantic layout information.

Graph attention
To strengthen the connections between categories in the generated images, two attention mechanisms are employed at the end of the generator.The spatial attention (Tang, Bai, et al., 2020) and the proposed graph attention are used in the generator.The purpose of the proposed graph attention is to learn the correlation between regions in the image.Learning the correlation between regions can effectively improve the quality of each region in the synthesised image.
The proposed graph attention is shown in Figure 4.In Figure 4, FC means fully connected layer.As shown in Figure 4, the proposed graph attention is composed of the patch embed where α represents the features processed by embed module, ⊕ represents element-wise addition, T represents matrix transposition, and ⊗ represents element-wise multiplication.F l−1 represents the feature outputted by the (l − 1)th Seg_Attention block and F l represents the feature outputted by the lth Seg_Attention block.

Graph convolution segmentation network
For semantic image synthesis tasks, the role of the discriminator is to distinguish between the synthesised images and the ground-truth images.Generally, the classification-based discriminators are commonly used in image synthesis.However, classification-based discriminators ignore the relationship between each object in the image.Insufficient learning of the relationship between objects can easily lead to blurred object boundaries in the synthesised images.Therefore, this paper proposes a segmentation network based on graph convolution as a new discriminator.The new discriminator is called GSeg.The role of GSeg is to align category information in the synthesised images and the semantic layouts.Simultaneously, SESAME is also used as a discriminator.Our method includes two different discriminators: GSeg and SESAME.The architecture of GSeg is shown in Figure 5. GSeg uses an encoderdecoder structure.
As shown in Figure 5, the encoder consists of a graph convolutional module in a vision GNN (Han et al., 2022).The decoder consists of the convolutional modules.Compared with ordinary convolution modules, graph convolution modules can better learn the relationship between objects.

Optimisation objective
In our method, the adversarial loss L adv , the feature matching loss L feat , the perceptual loss L perc and the semantic alignment loss L seg are used to control the quality of the synthesised images.
Adversarial loss: In GANs, adversarial loss is very effective for image fidelity, and has achieved good results in many images synthesis works.The adversarial loss can be defined as: where I R denotes the real image, z denotes the noise, S hot denotes the one-hot encoded semantic layout, and S RGB denotes the RGB semantic label, G represents the generator, D represent the discriminator, and E represents the RGB semantic encoder.
Feature matching loss: According to (T.-C.Wang et al., 2018), in the discriminator, we output multiple sets of different feature maps.During the training process, the L 1 loss is used to constrain the feature maps of different scale spaces.Its calculation process is shown in formula (6): where N i denotes the number of features in D i (I R , S hot ).
Perceptual loss: In this paper, a pre-trained VGG (Qi et al., 2018) is used to extract the features of the real images and the synthesised images.The perceptual loss in the multiscale feature space is shown in formula (7): where I F represents the generated image, represents the VGG model, and k represents the feature map of the k th layer in the VGG model.Semantic alignment loss: To constrain the semantic alignment between the synthesised images and the corresponding semantic layouts, our method employs the semantic alignment loss for control.The semantic alignment loss can be expressed as: where the ground-truth label image S has three dimensions, where the last two denote spatial locations, namely (j, k) ∈ H × W, is the balance weight of each class in the one-hot semantic graph, and Seg represents the graph convolutional segmentation network.
The weighted summary of these loss functions is shown in formulas (10): where λ adv , λ feat , λ perc and λ seg are the corresponding weight parameters.

Experiment
Experiments are conducted to evaluate the performance of the proposed approach for image synthesis on various benchmarking datasets.We compare qualitative and quantitative results with some competing methods.These competing methods includes CRN (Q. Chen & Koltun, 2017), Pix2PixHD (T.-C.Wang et al., 2018), SPADE (Park et al., 2019), CC-FPSE (Liu et al., 2019), SCGAN (Y.Wang et al., 2021), SDM (W.Wang et al., 2022) and OASIS (Schonfeld et al., 2021).In addition, this article will use multiple sets of ablation experiments to verify the benefits of each module.

Dataset and experiment details
Dataset: This paper conducts experiments on three challenging datasets, namely Cityscapes (Cordts et al., 2016), ADE20K (Tero Karras et al., 2019) and CelebAMask-HQ (Brock et al., 2019) Experimental details: Our method uses the ADAM optimiser, the learning rate of the generator is set to 1 × 10 −4 , the learning rate of the discriminator is set to 4 × 10 −4 , and the RGB semantic encoder is trained together with the generator.λ adv , λ feat , λ perc and λ seg are set to 1, 10, 10 and 1 respectively.In the first half of the optimisation process, only the real images are fed into GSeg, and when the epoch reaches half of the maximum value, the real images and the synthesised images are fed into GSeg.We perform 150 epochs of training on the Cityscapes and ADE20K datasets, 150 epochs of training on the CelebAMask-HQ dataset.All experiments were performed on a Nvidia 2080ti GPU.
Evaluation metrics: This paper uses two metrics to evaluate network performance: Fr´echet Inception Distance (FID) (Heusel et al., 2017) and mean Intersection-over-Union (mIoU).FID is used to measure the distance between the distribution of synthesised results and the distribution of the real images.mIoU is used to evaluate the semantic segmentation accuracy of the synthesised images.The higher the semantic segmentation scores (mIoU) are and the lower FID is, the better the method should be.Following previous methods (Liu et al., 2019;Park et al., 2019), we use semantic segmentation models DRN-D-105 (Yu et al., 2017), UpperUnet101 (Xiao et al., 2018) and Unet (Lee et al., 2020;Ronneberger et al., 2015) for semantic segmentation evaluation of Cityscapes, ADE20K and CelebAMask-HQ respectively.

Comparison with previous methods
In this section, we compare our method with several state-of-the-art semantic image synthesis methods on three datasets.
Quantitative results: The quantitative comparison results of the synthesis models on the Cityscapes, ADE20K and CelebAMask-HQ datasets are shown in Table 1.From Table 1, mIoU of DCSIS are superior to all baseline methods on all datasets.For mIoU, DCSIS achieves 68.8, 47.4 and 77.8 on the three datasets of Cityscapes, ADE20K and CelebAMask-HQ.DCSIS gives the relative improvements of 1.7, 2.1 and 2.5 compared to SIMS on the Cityscapes dataset, OASIS on the ADE20K dataset and SIMS on the CelebAMask-HQ dataset, respectively.Compared with the discriminator based on the segmentation network used by OASIS, GSeg has better segmentation performance.This also makes mIoU of DCSIS significantly higher than that of OASIS.For FID, DCSIS achieves 48.4,31.8 and 17.1 on the three datasets of Cityscapes, ADE20K and CelebAMask-HQ.For the CelebAMask-HQ dataset, DCSIS outperform all baseline methods on FID.FID of DCSIS is slightly lower than that of OASIS for the other two datasets, but our method has better segmentation performance.
The parameter size of DCSIS is 108M.Although DCSIS has 8M more parameters than SPADE, DCSIS gives the relative improvements of (10.2, 6.9), (0.8, 10.2), (0.4, 0.1) and (6.0, 5.7) compared to SPADE on three datasets for FID and mIoU.CRN has 21M fewer parameters than DCSIS.However, FID and mIoU of CRN on each dataset are much lower than DCSIS.DCSIS has 14M more parameters than OASIS.But DCSIS gives the relative improvements of 2.3, 2.1 and 2.7 compared to OASIS on three datasets for mIoU.Overall, DCSIS achieves good image synthesis results with a moderate parameter size.
Quantitative results: The Qualitative comparison of different methods on the three datasets of CelebAMask-HQ, Cityscapes and ADE20K are given in Figures 6-8, respectively.For the three datasets, the images synthesised by DCSIS not only have much better visual quality, but also are closer to the ground truth images in the overall colour distribution.
Compared with all baseline methods, our method produces realistic images while respecting the spatial semantic layout, and can generate diverse scenes with high image fidelity.The reason is that DCSIS uses double constraints for finer control over the synthesised images and GSeg also effectively improves the clarity of object boundaries.Experiments also show that it is difficult to effectively control the details of the object only by a single adaptive normalisation method.

Ablation experiment
In this section, a set of experiments are to investigate the effect of each component on the performance of DCSIS.We conduct ablation experiments on the Cityscapes dataset (5) SPADE + SESAME + DCA + Gseg + GAT (DCSIS).The quantitative and qualitative results are presented in Table 2 and Figure 9, respectively.As shown in Table 2, it can be seen that DCA, Gseg and GAT have a significant impact on the performance of DCSIS.Among the all architectures mentioned above, DCSIS obtains  the best results.It can be seen that FID and mIoU can be improved when the segmentation network as the discriminator.Compare with Unet, GSeg brings the relative improvements of (0.3, 1.3) for the Cityscapes dataset and (1.3, 1.9) for the CelebAMask-HQ dataset on FID and mIoU.This shows that the segmentation performance of GSeg is better than that of Unet.
When DCA is introduced, DCA brings the relative improvements of (2.0, 2.3) for the Cityscapes dataset and (0.4, 2.1) for the CelebAMask-HQ dataset on FID and mIoU.Compared with only using a single constraint, using two different forms of constraints to control the synthesis process is beneficial to improve the quality of the synthesised images.
GAT brings further improvements in FID and mIoU.The relative improvement of FID and mIoU are (0.3, 0.8) for the Cityscapes dataset and (0.4, 0.4) for the CelebAMask-HQ dataset, respectively.
Overall, DCA has a greater impact on the quality of the synthesised images than GSeg and GAT.These components effectively improve the performance of DCSIS.
From Figure 9, it can be seen that the visual quality of the images synthesised by DCSIS is better than other methods.For the CelebAMask-HQ dataset, human skin tones in the synthesised images obtained by DCSIS appear more realistic than other methods.For the Cityscapes dataset, DCA makes the boundary information between categories clearer.Overall, our method makes texture details in the images appear very natural and more realistic.For all datasets, the results suggest that DCA, GSeg and GAT can effectively improve the visual quality of the synthesised images.It proves that components are useful for the final results in DCSIS and can further improve the performance in DCSIS.

Conclusions
This paper presents a novel image synthesis approach, namely DCSIS, in which DCA, GSeg and GAT are used to enhance the information of the semantic layouts and improve the results of image synthesis.In DCSIS, DCA uses the double constraints to control image synthesis.Except for the adaptive normalisation as the first constraint, we also propose a semantic encoder as the second constraint.The proposed semantic encoder uses the semantic layouts of RGB encoding as the input.DCA can effectively improve the quality of the synthesised images.The generator learns the relationship between different categories using GAT to further improve the quality of the synthesised images.The proposed GSeg is used as the discriminator of DCSIS.Gseg is used to align semantic information and establish relationships between objects.
Experiments are conducted on three benchmark datasets to evaluate the performance of our presented approach.The experimental results indicate that DCSIS can generate the higher quality photorealistic images and obtain the better quantitative results.Comparisons with some state-of-the-art baseline methods, it demonstrates that the new method is more effective and efficient in terms of the qualitative and quantitative results in most cases.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
The

Figure 2 .
Figure 2. Architecture of the generator network.

Figure 3 .
Figure 3. Architecture of the RGB semantic encoder.
. The Cityscapes dataset is an image set including a variety of urban street scenes, including 35 semantic classes, of which 3000 images are used for training and 500 images are used for verification, and the image resolution is set to 256 × 128.The CelebAMask-HQ dataset is a high-resolution face dataset with fine-grained mask annotations, containing 19 semantic classes, and the image resolution is set to 128 × 128.The ADE20K dataset is a huge dataset with dense annotations, containing 150 semantic classes, 20,210 images for training and 2000 images for validation, and the image resolution is set to 128 × 128.

Table 1 .
Quantitative with competing methods on different datasets.