The detection of distributional discrepancy for language GANs

ABSTRACT A pre-trained neural language model (LM) is usually used to generate texts. Due to exposure bias, the generated text is not as good as real text. Many researchers claimed they employed the Generative Adversarial Nets (GAN) to alleviate this issue by feeding reward signals from a discriminator to update the LM (generator). However, some researchers argued that GAN did not work by evaluating the generated texts with a quality-diversity metric such as Bleu versus self-Bleu, and language model score versus reverse language model score. Unfortunately, these two-dimension metrics are not reliable. Furthermore, the existing methods only assessed the final generated texts, thus neglecting the dynamic evaluating the adversarial learning process. Different from the above-mentioned methods, we adopted the most recent metric functions, which measure the distributional discrepancy between real and generated text. Besides that, we design a comprehensive experiment to investigate the performance during the learning process. First, we evaluate a language model with two functions and identify a large discrepancy. Then, several methods with the detected discrepancy signal to improve the generator were tried. Experimenting with two language GANs on two benchmark datasets, we found that the distributional discrepancy increases with more adversarial learning rounds. Our research provides convicted evidence that the language GANs fail.


Introduction
Text generation based on neural language models (LM) (e.g. LSTM Hochreiter & Schmidhuber, 1997) has received much attention and has been used for news generation (Zellers et al., 2019), text summarisation (Lin et al., 2022) and image captioning (Xu et al., 2015). However, the generated sentences are still of low quality with regards to semantics and global coherence and are often imperfect grammatically speaking (Caccia et al., 2020).
These issues give rise to a large discrepancy between generated text and real text. Two underlying reasons are the architecture and the number of parameters of the LM itself (Radford et al., 2019;Santoro et al., 2018). Many researchers attribute this to exposure bias  because an LM is trained with a maximum likelihood estimate (MLE) and predicts the next word conditioned on words from the ground truth during training. However, an LM only uses words that it has generated during the reference.
Statistically, this discrepancy means the two distributional functions of real texts and generated texts are different. Reducing this distributional difference may be a practicable way to improve text generation.
Some researchers try to reduce this difference with GAN (Goodfellow et al., 2014) as the success in image generation (Wu et al., 2021), image classification (Cao et al., 2021) and stock prediction Wu et al., 2022). They used a discriminator to detect the discrepancy between real samples and generated samples, and fed the signal back to upgrade the generator (an LM). To solve the non-differential issue that arises by the need to handle discrete tokens, reinforcement learning (RL) (Williams, 1992) was adopted by SeqGAN (Yu et al., 2017), RankGAN (Lin et al., 2017), and LeakGAN . The Gumbel-Softmax is also introduced by GSGAN (Jang et al., 2017) and RelGAN (Nie et al., 2019) to solve this issue. These language GANs pre-train both the generator (G) and the discriminator (D) before adversarial learning. 1 During adversarial learning, for each round, the G is trained several epochs and the D is trained tens of epochs. The learning process will not stop until the model converges. Furthermore, considering the generated texts' quality and diversity simultaneously (Shi et al., 2018), MaskGAN (Fedus et al., 2018), DpGAN (Xu et al., 2018) and FMGAN  are proposed. They evaluate the generated text with Bleu (Papineni et al., 2002) versus self-Bleu (Zhu et al., 2018) or LM score versus reverse LM score (Cífka et al., 2018), and claim these GANs improve the performance of the generator.
However, some questions have been recently raised over these claims. Semeniuta et al. (2019) and Caccia et al. (2020) showed that by more precise experiments and evaluation, these considered GAN variants are outperformed by a well-adjusted language model. They draw a performance line in a quality-diversity space by adjusting the softmax temperature. Bleu and language model scores are usually used for measuring local and global quality, respectively; self-Bleu and reverse language model scores are used for measuring local and global diversity, respectively. To overcome these two-dimension metrics, de Masson et al. (2019) proposed a single metric Fréchet embedding distance (FED). It computes the Fréchet distance between two Gaussian distributions. However, Cai et al. (2021) showed that all metrics are not appropriate to evaluate an un-conditional text generator and proposed a novel metric. In short, whether these language GANs fail or not is still an open problem.
We investigate this issue in depth. For language GANs, several critical issues are still not clear, such as whether D detects the discrepancy, whether the detected discrepancy is severe, and whether the signals from D can improve the generator. In this paper, we try to solve these problems by investigating GAN in both pre-training and the adversarial learning processes. Theoretically analysing the signal from D, we employ approximate discrepancy and absolute discrepancy (Cai et al., 2021) to measure the distributional discrepancy. With these two functions, we first measure the discrepancy between the real text and the faked text, which is generated by an MLE-trained language model (pre-train). Second, we attempt some methods to update the generator with a feedback signal from D. Then, we use these metric functions to evaluate the updated generator. Finally, we analyse the performance of two typical language GANs during adversarial learning with these two functions across two benchmark datasets.
Our contributions are as follows: • We are the first to measure the variation of the distributional discrepancy between real text and generated text during the training of language GANs by using a discriminator to design and implement two equations. • Although this discrepancy could be detected by a discriminator (D), the feedback signal from D cannot improve G using existing methods. This manifests as an increase in the discrepancy with adversarial learning. • Experimenting on two existing language GANs, SeqGAN and RelGAN, the distributional discrepancy between real text and generated text increases with more adversarial learning rounds. This demonstrates that existing adversarial learning does not work. Thus, the industrial systems need not try in this way.
The rest of the paper is organised as follows. Section 2 describes the related work. Section 3 introduces the proposed method to measure the distributional discrepancy. The next section presents the experimental procedure in detail. The experiments and analysis are shown in Section 5. Finally, we give a short summary in Section 6.

Related work
Many GAN-based models were proposed to improve neural language models. SeqGAN (Yu et al., 2017) attacked the non-differential issue by resorting to RL. By applying a policy gradient method (Sutton et al., 2000), they optimised the LSTM generator with rewards received through Monte Carlo (MC) sampling. Many researchers such as RankGAN and MailGAN (Che et al., 2017) also used this technique, although it is ineffective in MC search. The RL-free model, e.g. GSGAN, contained continuous application of the approximating softmax function and working on latent continuous space directly. TextGAN (Salimans et al., 2016) added Maximum Mean Discrepancy to the original objective of GAN based on feature matching. Considering the drawbacks of pre-training a neural language model, Nie et al. (2019) proposed a RelGAN that uses the relation memory (Santoro et al., 2018), which allows for interactions between memory slots by using the self-attention mechanism (Vaswani et al., 2017). Gu & Cheung (2018) optimised GAN by evolutionary algorithms. We selected SeqGAN and RelGAN as representatives for this study. The results show that the adversarial learning does not work for either of these models. Caccia et al. (2020) argued the current evaluation measures correlate with human judgment (Cífka et al., 2018) was treacherous. They furthermore proposed a temperature sweep, which evaluates model at many temperature settings rather than only one. By drawing lines in a quality-diversity space such as Bleu versus self-Bleu, or language model score versus reverse language model score, they show that a well-adjusted language model can beat those considered language GANs. Unfortunately, the limitations of Bleu vs self-Bleu were shown by training a 5-gram language model and its scores were even better than the training data (de Masson et al., 2019). Cai et al. (2021) also revealed that they were unreliable and proposed a novel metric to evaluate unconditional text generation by calculating the distributional discrepancy between two text sets. This single metric could simultaneously measure both the diversity and quality. We adopt it and propose a simpler version. Semeniuta et al. (2019) and He et al. (2021) also argued GAN-based models were weaker than LM, because they observed a less severe impact of exposure bias. The latter furtherly quantified the exposure bias by using conditional distribution. Obviously, the existing methods only assessed the final generated texts, thus neglecting the dynamic evaluating the adversarial learning processes. Different from the above-mentioned methods, we investigate the mechanism of language GANs and quantify the discrepancy between real texts and generated texts both after the pre-training and the whole adversarial learning process.

Method
In GAN, the generator G θ implicitly defines a probability distribution p θ (x) to mimic the real data distribution p d (x). θ is the parameters of the language model G θ and is the parameter of the value function V, which is listed as follow.
Alternating optimisation of G θ and D φ is used to resolve the above equation. Given θ, to detect the discrepancy between p θ (x) and p d (x), we optimise D φ as follows: Assuming D * φ (x) is the optimal solution for a given θ, according to Goodfellow et al. (2014), it will be, and it is obvious to get the follow formula: Because the real distribution p d cannot be obtained in practice, it is impossible to directly measure the discrepancy according to Equation (3). Fortunately, we have massive real sentences and each x could be a sample from p d . Based on real samples and the above equation, we obtain a way to estimate the distributional discrepancy.

Approximate discrepancy
Let, (5), we can get a constraint and an approximate measure of distributional function. Figure 1(a) illustrates the relationship between q θ (x) and q d (x). Let, These are two equations for these two statistics, which are the expectation of the D * φ 's predictions on real text and on generated text. From the above equation, it is easy to obtain the following equation: The results give a constraint for D φ converging to D * φ . We should take this constraint into account when estimating the ideal function D * φ . From Equation (3), the optimisation process for the discriminator increases u d and decreases u θ to as small a value as possible. So, we can estimate the distributional discrepancy according to the following function.
Intuitively, using u d and u θ , we get a metric function to measure the discrepancy between p θ (x) and p d (x), We call this approximate discrepancy. It is the difference in the average score that a welltrained discriminator (denoted asD φ ) makes in the predictions on real samples compared to generated samples. It reflects the discrepancy between these two sets to some degree. From Equations (5), (6) and (8), we get Equation (9), The range of d a is 0 ∼ 1. The bigger its value is, the larger the discrepancy. When p d (x) = p θ (x), namely there is no discrepancy, there will be d a = 0. On the contrary, d a = 1 if ∀x, p d (x) * p θ (x) ≡ 0. Figure 1(a) illustrates the discrepancy between two distributional functions q θ (x) and q d (x). Both of them are systematic to the line of q = 0.5. Cai et al. (2021) proposed a novel metric that is more complete than ours because there are a positive part and also a negative part, as presented in Figure 1(b), and it is defined as absolute discrepancy d s . This is represented by the following equation: The range of d s is also 0 ∼ 1. The drawback of this metric is that it needs more computation than ours. Both metrics are used in this paper.

Using D * φ (x) to improve G θ
Given an instance x generated by G θ , if D * φ (x) is larger, it means the possibility of x in real data is larger. For an instance x, if D * φ (x) = 0.8, there will be p θ (x) < p d (x) according to Equation (3). So, we should update G θ to increase the probability density p θ (x). It may improve the performance of G θ . Based on this, we can select some generated instances by the value of D * φ (x) to update the generator. In fact, we find it improves the performance a little when compared with the random selection. However, this method is still worse than without it. Experiment 5.3 shows the results.

Implementation procedure
The optimal function D * φ is an ideal function that can only be statistically estimated by an approximated function. We can design a function D φ and sample from real data and generated data; then, we train D φ according to Equation (2). When the results convergence, we getD φ , which is the approximated function of D * φ . The degree of approximation is mainly determined by three factors: the structure and the number of the parameters number of D φ , the volume of training data, and the settings of hyper-parameters.
Based on the above analysis, we obtain two metric functions to measure the distributional discrepancy between dataset A and B (for example, A is composed of real sentences while B consists of machine-generated sentences). The implementation procedure is described as follows: Step 1: Design a discriminator D φ .
Step 2: Sets A and B are, respectively, divided into a training set D trainA and D trainB , a validation set D devA and D devB , and a test set D testA and D testB . The partition should be as equal an amount of instances as possible for classification training.
Step 3: D φ is optimised with D trainA and D trainB according to the Equation (2). Validating with D devA and D devB , we can judge whether D φ convergences or not and then getD φ .
Step 4: According to Equation (8) and (10), with two test datasets, we can estimate the discrepancy of two distributional functions between dataset A and B.d s denotes the absolute discrepancy, andd a denotes the approximate discrepancy, respectively. Algorithm 1 illustrates the procedure. Generally speaking, there should be d s ≤d s . Because D * φ cannot be obtained, it is hard to get the degree of the approximation of d s tod s . Many research results have shown that discriminators with deep neural networks are very powerful, some of which can even exceed human performance on tasks such as image classification (He et al., 2016) and text classification (Kim, 2014). So, if D φ with CNN, Algorithm 1 The obtain ofD φ and two discrepancies according toD φ . Input: Discriminator D φ ; dataset A (real sentences); dataset B (machine-generated sentences), |A| = |B| Output:D φ ; approximate discrepancyd a ; absolute discrepancyd s 1: Randomly initialise D φ with parameter φ 2: A = D trainA ∪ D devA ∪ D testA ; B = D trainB ∪ D devB ∪ D testB according to the ratio of 8:1:1, let N = |D testA | 3: Training D φ with D trainA and D trainB according to equation (2) 4: Verifying D φ with D devA and D devB gives to obtainD φ 5: S a≤0.5 = S a>0.5 = 0 6:û a = 0 7: for minbatch X a ∈ D testA do 8:û a ←û a +D φ (X a )  % equation (10) and an attention mechanism is well trained,D φ will be a meaningful approximation of D * φ . Therefore, we can obtain the meaningful approximation of d s and d a viaD φ .

Experiment
We select SeqGAN and RelGAN as representative models for our experiment, and the benchmark datasets are also the same as used by these models previously. Then, we show that the well-trained discriminator D φ can measure the discrepancy between the real and generated texts, and then point out that the existing GAN-based methods does not work. Finally, a third-party discriminator is used to evaluate the performance of adversarial learning with incremental training iterations.

Datasets and model settings
Both SeqGAN and RelGAN used a relatively short sentences dataset (COCO image caption) 2 and a long sentences dataset (EMNLP2017 WMT news). 3 For the former dataset, the sentences' average length is about 11 words. There are, in total 4682-word types, and the longest sentence consists of 37 words. Both the training and test data contain 10,000 sentences. For the latter dataset, the average length of sentences is about 20 words. There are in total 5255-word types and the longest sentence consists of 51 words. All training data, about 280 thousand sentences, is used and there are 10,000 sentences in the test data. According to Section 3, each test data is divided into two parts. Half is the validation set and the remaining half is the test set. We always generate the same number of sentences to compare with the two test datasets, respectively.
For these two models, all hyper-parameters, including word embedding size, learning rate and dropout, are set the same as in their original papers. For RelGAN, the standard GAN loss function (the non-saturating version) is adopted because the relative standard loss which is used in (Nie et al., 2019) does not meet the constraints of Equation (7). But, when measuring RelGAN's discrepancy during the adversarial stage, its own loss function is still a relatively standard loss. A critical hyper-parameter, temperature, is set to 100, which is the best result in their paper. During the process of training D φ , we always train 10,000 epochs and observe performance on the validation dataset.

Distributional differences in pre-training
We estimate the distributional differences caused by the MLE-based generators. We first train the generator for N epochs and then D φ until it converges (this needs 10,000 epochs). For example, following Nie et al. (2019), we train G θ for 150 epochs, and select the one whose perplexity (PPL) is the smallest by measuring on the validation set. Then, D φ is trained following the procedure in Section 4. Figure 2 shows the discrepancy between real and generated texts. The discrepancy increases with more training for discriminator until the training is stable. Figure 3 shows the discriminator's prediction on real and machine-generated texts, respectively. The more difference between the scores on these two datasets, the distributional discrepancy is larger. From this figure, we can see D φ convergences after about 3000 epochs for RelGAN, but SeqGAN needs more epochs to train the discriminator because the latter used an LSTM as the generator.
Considering the smoothed value on one batch rather than the prediction on the whole data, we use the convergence discriminator to predict on the all validation data and generated data. 4 Table 1 summarises the discrepancy across two models and two datasets. It shows that the difference between real text and generated text does exist and it is huge.

Detected discrepancy byD φ cannot improve the generator
We explore the improvement of G θ with the discrepancy detected byD φ at the end of pretraining. We select the best pre-train epochs for G θ . It should be noted thatD φ is well-trained with sufficient real sentences and generated ones by G θ . Then, G θ is updated according to the signals from theD φ . To verify the effect of the feedback signals, we generate many rather than only several batch-size instances to adjust θ . Then, fixing G θ , we re-train D φ with 10,000 epochs to get a new convergence discriminator to compute two distributional functions according to Equations (9) and (10). Unfortunately, in the view of both absolute and approximated discrepancy, the discrepancy always exceeds the original value computed in pre-training. It demonstrates that the generator is not improved further. Figure 4 illustrates the comparison.
Besides following Zhu et al. (2018), we also propose a new method to update G θ in the adversarial way. Rather than using all the generated instances to update G θ , only the ones, which are assigned relatively high scores by D φ , are used. We denote it as HW. The reason is that we assume that the higher score instances may be more informative than the lower ones. The method that only the relatively low scores samples are used to adjust the generator is also experimented with. Regretfully, all of them fail. Table 2 lists the discrepancy across two datasets with different settings. The discrepancy is always larger than that of the pre-training.

A third-party discriminator evaluates these language GANs
In order to evaluate different adversarial learning's GANs, we use a third-party discriminator D 3 φ , which is a clone of the discriminator in its counterpart language GAN except for the parameters' values. For each adversarial round, we train D 3 φ from scratch many epochs (verifying its convergence) with real text and generated text. Then, two distributional functions are computed according to its prediction. Figure 5 shows the dynamic evaluation result. In view of both approximate discrepancy and absolute discrepancy, the distribution difference on the real text and generated text does not decrease when more adversarial learning rounds are adapted. Once again, the results show that the approach of the existing language GANs cannot improve text generation. Notes: It should be noted that the lower discrepancy is better. #samples denotes the amount of the generated data is used for updating the G θ . For example, 2S means the generated instances are two times the amount of training data. Random denotes the existing way, but the other rows are the results according to HW. 0.3-0.5 means the generated instances whose scores are between 0.3 and 0.5 assigned by D φ are selected out.

Conclusion and future work
Unconditional text generation is the step-stone of conditional text generation such as news generation and text summarisation. It is not clear that GAN can improve the unconditional text generation. We present two metric functions to measure the discrepancy between real text and generated text. Numerous experiments show that this discrepancy does exist. We use various methods to update the generator parameters according to the detected discrepancy signals. Unfortunately, the distributional difference between real data and generated data does not decrease, indicating the difficulty of generator improvement with these signals. Finally, we use a third-part discriminator to evaluate the effectiveness of GAN and find that with more adversarial learning epochs, the discrepancy increases rather than decreases. Our study provided valuable information for the industry by analysing the existing language GANs do not work in-depth. Many studies could be done in the future. First, the novel method used to facilitate the reward signals to improve the generator is worth further study. Besides the constraints from intrinsic language characteristics, common sense and logic should be introduced to improve text generation. Finally, diversity, such as conversation generation in chat platforms, should be further investigated.