FragmGAN: Generative Adversarial Nets for Fragmentary Data Imputation and Prediction

Modern scientific research and applications very often encounter"fragmentary data"which brings big challenges to imputation and prediction. By leveraging the structure of response patterns, we propose a unified and flexible framework based on Generative Adversarial Nets (GAN) to deal with fragmentary data imputation and label prediction at the same time. Unlike most of the other generative model based imputation methods that either have no theoretical guarantee or only consider Missing Completed At Random (MCAR), the proposed FragmGAN has theoretical guarantees for imputation with data Missing At Random (MAR) while no hint mechanism is needed. FragmGAN trains a predictor with the generator and discriminator simultaneously. This linkage mechanism shows significant advantages for predictive performances in extensive experiments.


Introduction
Modern scientific research and applications very often encounter data from multiple data sources, and for each data source, various variables can be collected for data analysis.Such increasing data sources bring big opportunities for predicting people's behaviors with huge potential social and commercial benefits.However, these different data sources usually can not be available for every sample, which leads to "fragmentary data" and brings big challenges to data imputation and label prediction.To be more specific, we introduce two motivating examples that represent the most typically practical scenarios for fragmentary data.
Inernet Loan: A leading company of wealth management is exploring its internet loan business and trying to predict the applicants' income for risk management purpose.There are five possibly available data sources (Table 1).(i) Card: the credit card information; (ii) Shopping: the shopping history at internet; (iii) Mobile: the monthly bill of mobile phone; (iv) Bureau: the credit report from the Central Bank; (v) Fraud: the information from an anti-fraud platform.However, some applicants are not willing to provide their shopping or mobile information, not all the applicants have credit reports, and many of them are never included in the database of the anti-fraud platform.As a result, there are 10 "response patterns" in the Internet Loan data as shown in Table 1, where " √ " means the data source is available for the applicants with the corresponding response pattern.
Such kind of fragmentary data, also known as "block-wise missing data" in the statistics literature, are very common in the area of risk management, marketing research, social sciences, medical studies and so on.Data imputation and label prediction are two main goals for the analysis of such data.But the extremely high missing rate and complicated missing patterns bring big challenges to the achievement of the goals.
Some work has been done to deal with fragmentary data in both areas of statistics and computer sciences in recent years.
From the statistics perspective, methods based on model averaging (Fang et al., 2019), factor models (Zhang et al., 2020), generalized methods of moments (Xue & Annie, 2021), iterative least squares (Lin et al., 2021) and integrative factor regression (Li & Li, 2021) are proposed.These statistical methods provide useful theoretical properties but exhibit notable shortcomings: (i) They depend on certain statistical models, for example, linear regression models.(ii) They are not flexible in handling mixed data types that include continuous and categorical variables.(iii) Only a couple of methods consider imputation and prediction at the same time.
From the computer science perspective, GAIN (Yoon et al., 2018) first uses a Generative Adversarial Net (GAN) to impute data Missing Completed At Random (MCAR), which means the missingness occurs entirely at random without depending on any of the variables.MisGAN (Li et al., 2019) trains a mask generator along with the data generator for imputation.GAMIN (Yoon & Sull, 2020) proposes a generative adversarial multiple imputation network for highly missing data.HexaGAN (Hwang et al., 2019) deals with missing data imputation, conditional generation and semi-supervised learning together.GRAPE (You et al., 2020) proposes a graph-based framework for data imputation and label prediction.MIWAE (Mattei & Frellsen, 2019) and Not-MIWAE (Ipsen et al., 2021) propose imputation methods based on variational auto-encoding (VAE) framework instead of GAN.However, these generative methods have various drawbacks.For instance, some of them (Yoon & Sull, 2020;You et al., 2020;Mattei & Frellsen, 2019;Ipsen et al., 2021) do not have the theoretical guarantee that the imputed data has the same distribution as the original data.Some of them (Yoon et al., 2018;Li et al., 2019;Hwang et al., 2019) only have theoretical results for data MCAR, which is highly unlikely in the practice.Most of them either consider data imputation and label prediction separately or only consider data imputation.
In this paper, by leveraging the structure of response patterns, we propose a "FragmGAN" for fragmentary data imputation and prediction.The main contributions are: • FragmGAN is a unified framework based on GAN to deal with fragmentary data imputation and label prediction at the same time.It's flexible in the sense that (i) It's applicable to both continuous and categorical data and label.(ii) Users can adjust the relative importance of the task of imputation to prediction by an "adjusting factor".
• FragmGAN has theoretical guarantees for imputation with data Missing At Random (MAR), which is much more general than MCAR and will be defined in Section 3.2.Also, the theoretical results do not need a hint mechanism that is required by GAIN.
• Using similar technical skills, we extend the theoretical results of GAIN to MAR.
• Other than the generator and discriminator, FragmGAN trains a predictor simultaneously.This linkage mechanism shows significant advantages for predictive performances in extensive experiments.
There are several other GAN based imputation methods.CollaGAN (Lee et al., 2019) proposes a collaborative GAN for missing data imputation but it focuses on image data.WGAIN (Friedjungová et al., 2020), CGAIN (Awan et al., 2021), PC-GAIN (Wang et al., 2021) and S-GAIN (Neves et al., 2021) extend GAIN in various ways.IFGAN (Qiu et al., 2020) conducts missing data imputation using a feature-specific GAN and MCFlow (Richardson et al., 2020) proposes a Monte Carlo flow method for data imputation but no theoretical result is provided.When all the variables are assumed to be categorical, theoretical results of GAN based methods are extended to an uncommon concept of Extended Always Missing At Random (Deng et al., 2020).
Although they are not our main interest, we also mention some other VAE based imputation methods including VAEAC (Ivanov et al., 2019), variational inference of deep subspaces (Dalca et al., 2019), iterative imputation using AE dynamics (Smieja et al., 2020), VAE using pattern-set mixtures (Ghalebikesabi et al., 2021) and VSAE (Gong et al., 2021).Some of them only focus on image data.A common disadvantage of VAE based methods is the lack of theoretical guarantee for imputation.Some results of empirical comparison of GAN and VAE based methods are presented (Camino et al., 2019).

GAN-Based Fragmentary Data Imputation
We first formulate the problem and discuss the method and theory of fragmentary data imputation in this section.The problem of label prediction will be addressed in Section 4.
Throughout the paper we usually use bold type letters to denote vectors and use the regular letters for scalars.The upper-case letters are used for random variables and the corresponding lower-case letters are their realizations.Abusing notation slightly, we use a generic notation p(•) or p(•|•) to denote the distribution/probability or conditional distribution/probability for various continuous/categorical variables as long as there is no ambiguity.

Imputation Method
Let X = (X 1 , • • • , X d ) be the d-dimensional data vector of interested variables that could take continuous or categorical values.Note that d is the number of variables but not the number of data sources since each data source may have multiple variables.
Define the mask vector So what we actually observe is where denotes element-wise multiplication.
Assume overall there are K possible response patterns in the data and define W = (W 1 , • • • , W K ) as the pattern indicator, where W k = 1 if the sample belongs to the kth response pattern and In the fragmentary data setting, M can actually only take K (rather than 2 d ) different values and there is a one-to-one mapping between M and W. In the two motivating examples, K = 10 and 8 respectively.

Generator
Let Z = (Z 1 , • • • , Z d ) be a d-dimensional noise vector that is independent of all other variables.It is typically taken as Gaussian white noise.We then feed X = M X, Z and W into the generator G and obtain X is the generated data vector but we are only interested in the missing variables.So the complete data vector after imputation is Our target is to make sure the distribution of X is the same as the distribution of X, i.e., p( X) = p(X).The randomness of Z makes our method a random imputation method rather than fixed imputation.Although we focus on single imputation in the paper, but by modeling the distribution of the data, we are able to make multiple imputation to capture the uncertainty for the imputation value (Rubin, 2004;van Buuren & Groothuis-Oudshoorn, 2011).

Discriminator
The discriminator D tries to figure out which part of X is from the generator.The vanilla GAIN (Yoon et al., 2018) aims to distinguish each component of X is real (observed) or fake (imputed).It's a hard task since d usually is a large number.Consequently, a hint mechanism, which reveals all but one of the components of M to D, is required for GAIN to solve the model identifiability problem and make sure the generated distribution is what we want.
In the fragmentary data setting, each sample should exactly belong to one of the K response patterns.By leveraging this informative structure, our discriminator D just needs to figure out which pattern is the predicted probability vector for W, where Ŵk is the predicted probability that X is from the kth response pattern and We train the discriminator D to maximize the probability of correctly predicting W. On the other hand, the generator G is trained to minimize the probability of D correctly predicting W. The objective function is defined to be the negative cross-entropy loss where D k ( X) is just Ŵk .Note that the objective function depends on G through X.Then the minimax problem is given by Remark 3.1.The key difference of our imputation method to GAIN is that we use a different objective function by taking the response patterns into consideration.This adjustment makes sure the model is identifiable even no hint mechanism is used as we show in the next subsection.

Theoretical Results
Most previous theoretical results for GAN-based imputation methods including GAIN (Yoon et al., 2018), MisGAN (Li et al., 2019) and HexaGAN (Hwang et al., 2019) are established under the MCAR assumption, which means the missingness occurs entirely at random without depending on any of the variables.This is a very restrictive assumption and rarely satisfied in the real world.In contrast, our theoretical results will be established under the MAR assumption.
Assume X can be decomposed into (X o , X m ), where X o is an always observed subvector of X, and X m could be missing.The missing mechanism is characterized (Little & Rubin, 2014) into three types: • Missing Not At Random (MNAR): p(M|X) depends on X m .
Remark 3.2.For a random vector X, it could be ambiguous for the definition of MAR.Another way to define MAR is However, since M appears in both sides of the equation, there is no way to generate a group of independently and identically distributed samples satisfying this equation, unless there exists an always observed subvector X o such that p(M|X) = p(M|X o ).This is the reason why we use the MAR definition as above.
The complete data vector X can be decomposed into ( Xo , Xm ) correspondingly.Note that Xo = X o .So To verify that the solution to the minimax problem (2) satisfies p( X) = p(X), we just need to show p( First we present a lemma. Lemma 3.3.Let x is a realization of X such that p(x) > 0. For a fixed generator G, the kth component of the optimal discriminator D * (x) to the minimax problem (2) is given by Proof.All proofs are provided in the Appendix A.1.
We now rewrite (1) by substituting D * to obtain the objective function for G to minimize: It's worthy to mention that Lemma 3.3 and Theorem 3.4 do not depend on the MAR assumption and they are generally true even under MNAR.
Theorem 3.4 tells us that the optimal generator will generate data so that the conditional distributions of Xm given X o across different response patterns are the same.But it does not guarantee p( To further explore, we assume the first response pattern is the case that all the variables are observed, i.e., M i = 1 for all i ∈ {1, • • • , d}.Note the first response patterns in the two motivating examples are exactly the case.Then given W = w 0 1 , there is no missing variable and we have Xm = X m .So following (3), we have Under the MAR assumption, M is conditionally independent of X m given X o , and so is W since there is a one-to-one mapping between M and W. Therefore Combining ( 4) and ( 5) gives us the final theorem that provides theoretical guarantees for our proposed imputation method.
Theorem 3.5.Under the MAR assumption, the density solution to (3) is unique and satisfies So the distribution of X is the same as the distribution of X.
Compared to GAIN (Yoon et al., 2018), our method do not need a hint mechanism for model identifiability.An intuitive explanation is that we just need to classify each sample into one of the K response patterns.It requires much less model parameters than GAIN, in which d binary classifiers need to be modeled if the hint mechanism is not applied.
Our theoretical results are established under MAR assumption while the vanilla GAIN (Yoon et al., 2018) assumes MCAR.However, we find that GAIN (with hint) also guarantees that p( X) = p(X) under the MAR assumption, which is consistent to a recent theoretical result (Deng et al., 2020).We provide a direct proof of this conclusion of GAIN in the Appendix A.2.

A Unified Framework for Imputation and Prediction
Many previous methods including GAIN (Yoon et al., 2018) consider label prediction as a post-imputation problem, that is, they first impute the data and then develop a prediction model as if the data were fully observed.The disconnection between imputation and prediction mostly likely damages the accuracy of prediction.In this section we propose a unified framework that considers data imputation and label prediction together.The key idea is to train a predictor P with the generator and discriminator simultaneously.

Predictor
Let Y be the interested q-dimensional label that could be continuous or categorical.Unlike the semi-supervised learning, the label Y is assumed to be available for all the training samples.A predictor P is a function from R d to R q such that Ŷ = P ( X) is a predicted value of Y.
To evaluate the prediction performance of P , we define a loss function L(Y, P ( X)) where L is from The explicit form of L depends on the data type of Y and is very flexible.For example, if Y is continuous, we may use is a binary scalar and the predicted value is the probability of being 1, then we may use To train G, D and P together, define the linked objective function as where V (G, D) is from (1) and γ ∈ [0, 1] is an "adjusting factor" that controls the relative importance of data imputation to label prediction.
The second part of ( 6) does not involve D, so the target of D is still to maximize V (G, D).The first part of ( 6) does not involve P , so the target of P is to minimize the predictive loss E (Y, X) L(Y, P ( X)).Both parts of (6) involves G, but fortunately they both require G to minimize.So the minimax optimization problem is give by The choice of γ is quite flexible.If the user is just interested in data imputation, he can take γ = 1 and U (G, D, P ) is reduced to V (G, D).If the user is mainly interested in label prediction, he may use a cross-validation procedure to choose an appropriate γ or simply take γ = 0.5 which works quite well as shown in the experiments.Note that γ = 0 is not a good choice since it will lead to overfitting.If the user cares about both imputation and prediction, he may decide γ by the relative importance of the two tasks in his mind.
The pseudo code to implement (7) is given in Algorithm 1. Several issues are discussed as follows.
First, although the hint mechanism is not required for our theoretical results, it is still empirically helpful.So we also use the hint mechanism (Yoon et al., 2018) in implementation.The impact of including the hint mechanism or not will be checked in the experiments.
Second, the generator also generates data even for the observed variables, which can be used to check the generation performance.An extra loss function for training G, where L M : R × R → R is a user-specified loss function depending on the variable type of X i .The algorithm result is not sensitive to the choice of hyper-parameter α.Actually, as long as α is relatively large (α = 10 in the experiments), its main effect is to force Xi = X i for the variable with M i = 1.

Experiments
In this section we check the imputation and prediction performance of FragmGAN in multiple datasets.First we consider five UCI datasets (Lichman, 2013) used in GAIN (Yoon et al., 2018) (Breast, Spam, Letter, Credit and News).Since the original datasets do not have any missing value, we randomly remove part of data by variable groups to make it fragmentary.Unless otherwise stated, the miss rate is 20%.By designing the removing strategy, we can make it MCAR or MAR.For this group of datasets, we are able to check the performance of data imputation along with label prediction since the true data Algorithm 1 Pseudo Code for FragmGAN repeat (1) Discriminator optimization.Draw k D samples {(x(j)), m(j), w(j)} k D j=1 , draw k D samples of random noise {z(j)} k D j=1 for j = 1 to k D do x(j) ← G(x(j), (1 − m(j)) z(j), w(j)) x(j) ← m(j) x(j) + (1 − m(j)) x(j) Generate hint h(j) end for Update D using stochastic gradient ascent Update G using SGD (D and P are fixed) Draw k P samples {(x(j)), m(j), w(j), y(j)} k P j=1 , draw k P samples of random noise {z(j)} k P j=1 for j = 1 to k P do x(j) ← G(x(j), (1 − m(j)) z(j), w(j)) x(j) ← m(j) x(j) + (1 − m(j)) x(j) end for Update P using SGD (G is fixed) ∇ P k P j=1 L(y(j), P (x(j))) until training loss has converged values are known.Then we consider two datasets Inernet Loan and ADNI for the motivating examples introduced in Section 1.The miss rates of them are 46.6% and 22.3%, respectively.More details of these two datasets are provided in Appendix A.3 and the data are available in the Supplementary Material.Since the missing values are unknown, we can only check the label prediction performance for these two datasets.
For the purpose of comparison, we consider MICE, MissForest, matrix completion (Matrix), Auto-Encoder (AE), Expectation Maximization (EM) and MisGAN that have been mentioned in Section 1.For the prediction task for Inernet Loan and ADNI, we also consider two statistical methods: Model Averaging (Fang et al., 2019) and FR-FI (Zhang et al., 2020).
The hyperparameters of FragmGAN and some implementation details are provided in Appendix A.3.More details can be found in the implementation code of FragmGAN that is available in the Supplementary Material.
For each dataset, we randomly split it into a training set (80%) and a test set (20%) by response patterns.All the methods are fitted in the training set and then applied to the test set.The imputation and prediction performances are evaluated at the test set.We repeat this experiment 10 times and report the averages and standard deviations of the evaluation criteria (RMSE or AUC).In each table, the best result for each dataset is marked in bold type.

Results for the UCI Datasets
Imputation Performance.Table 3 reports the RMSEs of the imputation errors for the UCI datasets.We take γ = 1 for FragmGAN since imputation is the focus here.For both FragmGAN and GAIN, we consider two versions with or without the hint mechanism.
As we can see from Table 3, FragmGAN outperforms all the other methods in most cases.For the two cases that FragmGAN is not the best (Breast and Letter with MAR, in which MissForest performs the best), it performs the second best.Both FragmGAN and GAIN perform better than their corresponding versions without hint, indicating that the hint mechanism really helps empirically.This is expected since the hint mechanism provides useful information to the discriminator.Note that the results here can not be directly compared to the results in the paper of GAIN (Yoon et al., 2018) since here we consider fragmentary data with certain response patterns while the missing data in GAIN is generated totally at random.

RMSE
q q q q q q q q q GAIN no hint FragmGAN no hint GAIN FragmGAN 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.18 0.20 0.22 0.24 0.26 0.28 0.30 Miss Rate RMSE q q q q q q q q Figure 1.RMSE of imputation error of the Credit data under different miss rates.Left: MCAR.Right: MAR.
To check the imputation performance under different miss rates, we take the dataset Credit and generate missing data with miss rate from 10% to 80%. Figure 1 presents the RMSEs of imputation errors under different miss rates.We can see that FragmGAN consistently performs the best.Again, both FragmGAN and GAIN perform better than their corresponding versions without hint.FragmGAN outperforms GAIN in both versions with or without hint.
Overall speaking, FragmGAN performs quite well in data imputation in the sense that it has smaller RMSE of imputation error compared to the competitors.Specifically it is better than GAIN, indicating that considering the structure of response patterns in the algorithm is really useful.
Prediction Performance.Table 4 reports the AUCs for the prediction performance in the datasets Breast, Spam, Credit and News.The dataset Letter is not considered here since it does not have a binary label.We include hint for both FragmGAN and GAIN.The adjusting factor γ is taken as 1 or 0.5 for FragmGAN.Note that when γ = 1, FragmGAN first imputes data and then makes the prediction as if the data were fully observed.When γ = 0.5, the imputation and prediction are considered simultaneously.
As we can see, FragmGAN with γ = 0.5 outperforms the other methods in all the cases.This result shows that the linkage mechanism of training generator and predictor together can improve the prediction performance as we expected.Also note that although FragmGAN with γ = 1 performs worse than FragmGAN with γ = 0.5, it still performs better than all the other methods.

Results for the Motivating Examples
For the dataset Inernet Loan, the original label is the applicant's income, which is a continuous variable.In the analysis we use log(income) as the label Y .For the dataset ADNI, the original label Y is the score of Mini-Mental State Examination (MMSE) taking value from 0 to 30, in which higher score means better cognitive function.In the real analysis, we consider two labels: (i) The normalized MMSE which can be considered as a continuous variable.(ii) A binary label Y = 1 if MMSE≥28 and Y = 0 otherwise.
Prediction Performance.Table 5 reports the RMSEs for the continuous label prediction and AUCs for the binary label prediction.The last two methods (Model Averaging and FR-FI) rely on linear regression models so they are not applicable to the binary label prediction.For the proposed FragmGAN, we take γ =1, 0.75, 0.5 and 0.25, indicating different relative importance of imputation to prediction.Also, we use a 5-fold cross-validation to select the best (for label prediction) γ.The CV criterion is defined as the averaged prediction performance in the leave-out samples.

A. Appendix.
A.1.Proofs of the Proposed FragmGAN 1.1.Proof of Lemma 3.3 for the x such that p(x) > 0.
where "∝" means equation holds by ignoring terms unrelated to G, and (9) holds since x p(x|W = w 0 k )dx = 1 is a constant and x = (x o , xm ).Note that log is unrelated to xm and xm p(x m |x o , W = w 0 k )dx m = 1, so the second term of ( 10) is unrelated to G. Following (10), we have where KL(•|•) denotes the KL divergence.Its minimum is achieved when p(x m |x o , W = w 0 k ) = p(x m |x o ) for each k ∈ {1, • • • , K} and (almost) every x such that p(x) > 0 and p(x o |W = w 0 k ) > 0.
1.3.Proof of Theorem 3.5 Proof.Actually we have proved this theorem in the statements between Theorem 3.4 and Theorem 3.5 in Section 3.2.
A.2. Extend Theoretical Results of GAIN (Yoon et al., 2018) to Missing at Random We first rewrite the formulation of GAIN under MAR with our notation (just a little bit different from the original GAIN paper).
The original data X as the response indicator for X m .We assume X m is missing at random, i.e., p( is the predicted probability vector for M. The minimax problem is: where log is element-wise logarithm and dependence on G is through X. The proof of Lemma 1 in GAIN (Yoon et al., 2018) does not depend on the decomposition of X.So the result still holds: the optimal D for given G is given by (12) Note that log p(m i = t|h) and x p(x, h, m i = t)dx = p(h, m i = t) are not related to x.So the second term of ( 12  Note that H i = t means M i = t for t ∈ {0, 1} and H i = 0.5 implies nothing about M i .With this H, D * i (x, h) = t for h such that h i = t and t ∈ {0, 1}.For any m = (m 1 , • • • , m d m ) ∈ {0, 1} dm and i ∈ {1, • • • , d m }, let m 0 , m 1 ∈ {0, 1} dm be any two vectors such that they are the same as m on the jth element for j = i, and the ith components of m 0 and m 1 are 0 and 1, respectively.So m = m 0 if m i = 0 and m = m 1 if m i = 1.Define a realization of the hint vector H as h such that h j = m j if j = i and h j = 0.5 if j = i.Since p(h|m i = t) > 0, by (13) we have p(x m |x o , h, m i = 0) = p(x m |x o , h, m i = 1). (14) vector with only the kth element being 1, and W = w 0 k means that the sample belongs to the kth response pattern.

Table 1 .
The response patterns of the Internet Loan data.The Alzheimers Disease Neuroimaging Initiative http://adni.loni.usc.edu is a widely used data by researchers for the Alzheimers disease which has four data sources.(i)CSF:cerebrospinal fluid; (ii) PET: positron emission tomography; (iii) MRI: magnetic resonance imaging; (iv) Gene: the gene expression.As show in Table2, it has 8 different response patterns corresponding to different data availability for each data source.

Table 2 .
The response patterns of the ADNI data.

Table 5 .
Prediction performance for the two motivation examples (Average ± Std) m |x o , h, m i = t) log p(x m |x o , h, m i = t) p(x m |x o , h) dx m dx o dh o p(x o |h, m i = t)KL p(x m |x o , h, m i = t)||p(x m |x o , h) dx o dh, which achieves its minimum when p(x m |x o , h, m i = t) = p(x m |x o , h)(13)for t ∈ {0, 1} and i ∈ {1, • • • , d m }.
t p(h, m i = t) x p(x|h, m i = t) log t p(h, m i = t) x p(x m |x o , h, m i = t)p(x o |h, m i = t) log p(x m |x o , h, m i = t)p(x o |h, m i = t) p(x m |x o , h)p(x o |h) x o p(x o |h, m i = t) xm p(x m |x o , h, m i = t) log p(x m |x o , h, m i = t) p(x m |x o , h) dx m dx o dh x o p(x o |h, m i = t) xm p(x m |x o , h, m i = t) log p(x o |h, m i = t) p(x o |h) dx m dx o dh t p(h, m i = t) x o p(x o |h, m i = t) xm p(x t p(h, M i = t)x