Survival stacking with multiple data types using pseudo-observation-based-AUC loss

ABSTRACT There have been many strategies to adapt machine learning algorithms to account for right censored observations in survival data in order to build more accurate risk prediction models. These adaptions have included pre-processing steps such as pseudo-observation transformation of the survival outcome or inverse probability of censoring weighted (IPCW) bootstrapping of the observed binary indicator of an event prior to a time point of interest. These pre-processing steps allow existing or newly developed machine learning methods, which were not specifically developed with time-to-event data in mind, to be applied to right censored survival data for predicting the risk of experiencing an event. Stacking or ensemble methods can improve on risk predictions, but in general, the combination of pseudo-observation-based algorithms, IPCW bootstrapping, IPC weighting of the methods directly, and methods developed specifically for survival has not been considered in the same ensemble. In this paper, we propose an ensemble procedure based on the area under the pseudo-observation-based-time-dependent ROC curve to optimally stack predictions from any survival or survival adapted algorithm. The real application results show that our proposed method can improve on single survival based methods such as survival random forest or on other strategies that use a pre-processing step such as inverse probability of censoring weighted bagging or pseudo-observations alone.


Introduction
Supervised machine learning (ML) algorithms that target a particular outcome are now a standard tool available for building risk prediction models (Ambale-Venkatesh et al. 2017;Corey et al. 2018;Gulshan et al. 2016;Weng et al. 2017). Particularly when the risk under consideration is that of death, censoring due to the end of follow-up or dropout from a cohort or observational study can complicate the analysis of the value of interest, the risk of an event occurring prior to a given time point. Although many standard methods exist for performing inference on right censored data, such as Cox proportional hazards regression or parametric survival models (Cox 1972), they are not as flexible as ML methods, and fewer machine learning methods that can accommodate right censored data have been developed in comparison to methods for classification. Ensemble methods like bagging, boosting, and stacking that combine predictions from several machine learning algorithms can provide additional improvements in predictive accuracy, but this requires a greater number of ML methods and classes of methods that account for right censoring.
In comparison to 10 years ago, there are now a large number of ML methods that allow for the use of right censored data for the goal of risk prediction via supervised learning. However, most methods have been adaptations of specific ML algorithms or models or within specific classes of methods, such as the adaptation of trees to survival-based trees and survival random forests (Ishwaran et al. 2014(Ishwaran et al. , 2008, which are ensemble of trees. In addition, there have also recently been some pre-processing methods proposed using inverse probability of censoring weighted bootstrapping, called IPCW bagging, which allows for the adaption of all classification ML methods to right censored and competing risks data to predict the risk of an event prior to some fixed time (Gonzalez Ginestet et al. 2020). Sachs et al. (2019) proposed the use of pseudo-observations as the outcome of stacked ensembles of ML methods for continuous outcomes, as pseudo-observations are continuous. Both of these pre-processing methods allow for a large number of ML methods to be used in right censored data, without ignoring censored observations. However, even the pre-processing methods suggested in Gonzalez Ginestet et al. (2020) and Sachs et al. (2019), as well as the single algorithm methods (Ishwaran et al. 2014(Ishwaran et al. , 2008, are limited to the class of algorithms or outcome types. When using right censored survival outcomes, this may limit the available predictive methods. Each algorithm may have its own merits and estimation methods, yet in practice, it is not possible to know which algorithm should perform best on a given dataset. Said another way, for each ML algorithm that is theoretically justified and shown to work well practically, there exists a dataset in which it will perform poorly. Thus, we propose to allow for ensembles of standard survival methods such as Cox and parametric survival models, inverse probability of censoring weighted (IPCW) binary outcome methods, including methods that allow weighting or those using the IPCW bagging methods of Gonzalez Ginestet et al. (2020) or simple binary methods ignoring censoring, as well as pseudo-observation based methods all in the same stack. We propose to optimally stack these methods using as the loss function the pseudo-observation-based-AUC, which allows for the use of all data, censored or not, and the responses from all ML methods to be stacked.
We find that by including multiple ML types, which consider the survival outcome in different ways, we can improve upon even the stacked methods that use only one ML algorithm type or type of outcome, such as those for binary or continuous outcomes alone. We demonstrate this using two breast cancer datasets: one from the survival package (Therneau 2020;Therneau and Grambsch 2000) in R (R Core Team 2020) and the other from the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) (Curtis et al. 2012). In Section 2, we outline our notation and method before demonstrating it in the breast cancer datasets, in Section 3. Finally, Section 4 outlines the limitations of our method and suggests some possible areas of future research.

Notation and preliminaries
Let Y i be the right censored event time, T i be the true event time, and Δ i 2 f0; 1g be the event indicator where 0 indicates censoring and 1 indicates failure, for subjects i ¼ 1; . . . ; n. Let X i ¼ ðX i1 ; . . . X ip Þ denotes the vector of covariates for subject i where X ij denotes the biomarker measurement j (or covariate j) measured at baseline for subject i. We assume that the censoring time is independent of the failure-event time and the covariates.
We are interested in predicting the risk of failure within t time units for subject i. This quantity corresponds to the cumulative incidence of failure up to time t: We summarize the survival data as the counting process: NðtÞ ¼ P i IðY i � t; Δ i ¼ 1Þ; giving the number of observed failures on or before time t, and RðtÞ ¼ P i IðY i � tÞ gives the number of subjects still at risk just before time t. Our estimator of the marginal quantity CðtÞ of the cumulative incidence function is the Aalen-Johansen estimator (Aalen and Johansen 1978 is the Nelson-Aalen estimator for the cumulative hazard for failures, and Ŝ is the Kaplan-Meier estimator of the overall survival. This is equivalent to one minus the Kaplan-Meier estimator of the survivor function, but the theoretical development was done to accommodate competing risks (Graw et al. 2009), which we do not consider here.
The goal then is to predict an individual's risk based on their covariate vector X i by flexibly fitting a model for where f denotes a generic prediction function that takes as input the observed covariate vector and which may also depend on some parameters. For example, f may represent a parametric regression model, a tree-based model, or the result of a k nearest neighbor algorithm. We let f denote the estimated prediction function, and we will distinguish between prediction functions estimated by different algorithms or on different subsets of the data using subscripts.

Cross-validation of prediction functions
Consider a set of K prediction algorithms, which we will call the library of prediction algorithms. In general, a prediction algorithm with index k for k ¼ 1; . . . ; K, could be any method that takes training data (including the outcome and covariates) as input, and outputs a prediction function f k . For example, regression methods, penalized regression methods, tree-based methods, and nearest neighbor methods, to name a few. This library of methods should be pre-specified by the analyst, and our key contribution is that the library may contain a mix of IPCW classification, pseudo-observation based regression, and survival-specific methods. Each method may require pre-processing of the censored time-to-event outcome, estimation of weights, or specification of tuning parameters.
We proceed by constructing an n � K matrix Z of cross-validated or "pre-validated" (Tibshirani and Efron 2002) predictions for each subject and of the prediction algorithms under consideration as follows. Split the dataset into a partition, according to a V-fold cross validation scheme: split the n observations into V approximately equal size groups. Let the ν-th group be the validation sample, and the remaining group the training sample, ν ¼ 1; . . . ; V. Define X to be the full dataset, VðνÞ to be the ν th validation data split, TðνÞ ¼ X nVðνÞ to be the ν th training data split for ν ¼ 1; . . . ; V.
ii. Output predictions f ν;k ðVðνÞÞ ¼ Z ν;k into the k th column and appropriate rows of Z.
Each algorithm in the library can use whichever estimation method is appropriate for that type of model. For instance, a typical choice for survival data would be a Cox model (Cox 1972), in which case the algorithm would maximize the partial likelihood. Another sensible option, given that we are trying to predict the risk at time t, is to use pseudo-observation regression for the cumulative incidence at that time point (Andersen and Pohar Perme 2010). Next, we will create an ensemble over Z by optimizing a loss function that targets our prediction quantity of interest. Specifically, we will seek to find a coefficient vector α (a column vector of length K) such that Zα is optimal in the sense that it optimizes our loss function of interest.

AUC loss using pseudo-observations
Following Andersen and Pohar Perme (2010), the i th pseudo-observation time t is defined as where Ĉ À i ðtÞ is the Aalen-Johansen estimate of the cumulative incidence function that is computed by using the sample excluding the i th observation from the full sample of size n. By construction and the unbiasedness of the Aalen-Johansen estimator, the pseudo-observations are unbiased for the cumulative incidence: EfĈ i ðtÞg ¼ CðtÞ: Moreover, observe that the survival can be related to the pseudoobservations by Ef1 ÀĈ i ðtÞg ¼ PðY i > tÞ.
An important property of pseudo-observations that we will exploit is asymptotic unbiasedness conditional on covariates, i.e., in large samples. Graw et al. (2009) proved that (1) holds for the Aalen-Johansen estimator when the censoring is independent of the event time and of all covariates in the model. This can be relaxed by using inverse probability of censoring weighted estimators instead of the Aalen-Johansen (Binder and Schumacher 2008). For some coefficient vector α, let B i ¼ Z T i α denotes the ensemble predicted outcome for subject i and let c 1 ; . . . ; c m denote the unique values or a fixed grid of the B i values sorted in descending order. Then, the time dependent AUC at time t for failure, calculated by the trapezoidal rule, can be written as where ϕ is a ramp function. A ramp function in this context measures to what degree the first argument is greater than the second argument. One example is the indicator step function: ϕ � ða; bÞ ¼ 1 if a > b and is 0 otherwise, which yields the empirical AUC. To see this relationship to the receiver operating characteristic (ROC) curve, note that the numerator of y j is an estimate of and the denominator is an estimate of PðT i � t ? Þ. Since the numerator and the denominator are asymptotically unbiased under conditions (A1) and (A2) of (Graw et al. 2009) which state that the censoring time is stochastically independent of the event time, the event indicator and the covariates, and that the time point of interest are chosen such that the probability of remaining uncensored is bounded away from 0 at that time point, by the continuous mapping theorem, their ratio is a reasonable estimator of PðB i > c j jT i � tÞ, which is the time-dependent true positive fraction (Heagerty et al. 2000) at c j . Similarly, x j is an estimator of the time-dependent false positive fraction at c j . These two quantities estimated at all c 1 ; . . . ; c m yield the ROC curve, and the trapezoidal rule calculates the area under it. Since our goal is to find the α that maximizes the AUC, the step ramp function is not ideal as it includes non-differentiable points. Instead, we use a smoothed ramp function as suggested by Fong et al. (2016), the sigmoid ramp: for a fixed, small value of σ. We set the parameter σ equal to 0:01 in our data analysis. See Fong et al. (2016) for other ways to select this parameter. Our goal is to optimize d AUCðα; tÞ with respect to α for a fixed t. Maximization of the AUC is difficult, in general, because the ROC curve is invariant to monotone transformations of the predictor, regardless of how the ROC curve is estimated (Fong et al. 2016;Pepe 2000). Multiplication of the coefficient vector by a constant does not change the AUC and hence, without any restrictions, the solution α is not unique. Adding a penalty term to the objective function, denoted by λ, is one way to solve this problem and obtain a unique solution for α. Following Fong et al. (2016), we select the normalized vector Using the smooth ramp function, the gradient of this objective with respect to α exists at all points, which will give us better optimization properties. Regardless of how the predictors in Z are created, this ensemble technique will ensure that the final product targets the quantity of interest and will borrow strength from the different algorithms in a data-driven manner. It has been suggested to normalize the coefficients by their sum and to restrict coefficients to being greater than zero to improve calibration. Although this is clearly useful when the predictions from the individual ML methods are all probabilities, when the predictions are pseudo-observation based or simply the linear predictor from a Cox model it is less clear what, if any, advantages this offers. In our example, we found that restricting coefficients to be greater than zero improved performance slightly in crossvalidation. This can therefore be considered something to optimize a further tuning parameter of the stacking procedure.

Creating the final ensemble predictor and evaluating performance
For k ¼ 1; . . . ; K, fit each algorithm on the full dataset to obtain f k ðX i ; tÞ, for i ¼ 1; . . . ; n. This forms an n by K matrix where the rows are predicted outcomes for each subject, and the columns correspond to different prediction methods. The final predictor is the linear combination α �Tf k according to the optimized coefficient vector α obtained in the previous step. This also gives us the final prediction function f that can be applied to new observations.
It is recommended to comprehensively evaluate the performance of this predictor in an independent validation sample, including discrimination, calibration, and measures of explained variation (Royston and Altman 2013;Steyerberg et al. 2010). In our illustrative examples, as a measure of discrimination, we report the time dependent receiver operating characteristic curve (Heagerty et al. 2000) where the true and false positive fractions are estimates as detailed above. Additionally, we construct a predictiveness curve (Pepe et al. 2007) for the risk outcome using smooth regression of the pseudo-observations conditional on the predicted risk for calibration. The fitted values from this model provide an estimate of the predictiveness curve (Sachs et al. 2019). Specifically, the function E½C i ðtÞjf ðX i Þ ¼ x� is specified as a generalized additive model (Hastie and Tibshirani 1990) as an estimate of the conditional risk PðT i � tjf ðX i Þ ¼ xÞ: This estimation procedure works for the same reason that pseudo-observation regression does, namely, that conditional means of pseudo-observations are asymptotically unbiased (Graw et al. 2009).
Finally, measures of explained variation and associations between the predicted risk and the observed outcome are estimated using the explained variation statistic described by Royston and Sauerbrei (2004), by fitting Cox regression models with continuous and categorized risk as the predictors, and the Kaplan-Meier curves stratified by the categorized risk, as recommended by Royston and Altman (2013).

Illustrative examples
We illustrate our proposed method using the two breast cancer datasets: the ones used in Royston and Altman (2013) (Royston-Altman) and the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) (Curtis et al. 2012).

Royston-Altman dataset
Following Royston and Altman (2013), we used the Rotterdam data as our training data and the German Breast Cancer Study Group (GBSG) as our external validation set. These dataset are available in the survival package (Therneau 2020;Therneau and Grambsch 2000) in R (R Core Team 2020) under the name Rotterdam and GBSG, respectively. The Rotterdam tumor bank includes 2982 primary breast cancers patients. Following Royston and Altman (2013), we restrict to the 1546 node positive patients because the validation data includes only node positive patients. However, we do not use the same outcome or target of prediction. We consider the outcome of progression or death within 5 years of primary surgery, rather than the outcome of risk of recurrence or death at anytime, nor do we limit the follow-up of patients included in the data, as was done in Royston and Altman (2013).
The GBSG validation data contain 720 patients with primary node positive breast cancer who were enrolled in a clinical trial from July 1984 to December 1989, see Schumacher et al. (1994) for more detail. The maximum follow-up time available was 7 years. We used the same prognostic factors that in Royston and Altman (2013): age in years at primary surgery, menopausal status, tumour size, number of positive lymph nodes, progesterone receptors, oestrogen receptors, and hormonal treatment. Just as in Royston and Altman (2013), we categorized tumor size as � 20 mm, between 20 to 50 mm and > 50 mm. All methods run on these data used all covariates. Based on the model build in the Rotterdam data, we predict death or recurrence within 5 years of primary surgery in the GBSG data as a validation of the model.

Metabric
METABRIC database is a Canada-UK project which contains targeted sequencing data of breast cancer samples. METABRIC cohort were collected between 1977 and 2005 from five centers in the UK and Canada (Curtis et al. 2012;Mukherjee et al. 2018). The clinical data were downloaded from cBioportal (http://www.cbioportal.org) on October 2021 and contains 1904 patients. We considered the following prognostic factors: age at diagnosis, menopausal status, tumour size, number of positive lymph nodes examined, tumor grade, progesterone and oestrogen receptors status, hormonal and radiotherapy treatment, and the Nottingham prognostic index (NPI). The number of positive lymph nodes examined was categorized as greater than and equal to 0. The only continuous variables are age at diagnosis, tumor size, and NPI. The rest of the variables are binary. All methods run on these data used all covariates. We split the patients randomly into training (70%) and validation (30%) set, respectively. We considered two endpoints: overall survival (OS) and recurrence-free survival (RFS) in months. Based on the model build in the training data, we predict the risk of dying and of breast cancer recurrence within 5 years in the validation dataset.

Results
To generate predictions, we fit Cox regression with stepwise selection, a survival random forest (Ishwaran et al. 2008), and a Cox Boost model (Binder et al. 2009). These are our survival based methods of comparison. We also fit a inverse probability of censoring weighted (IPCW) logistic regression and an IPCW bagged support vector machine and an IPCW bagged k-nearest neighbors (KNN) algorithm for the binary outcome of observed death or recurrence within 5 years. The IPCW and the IPCW bagging are implemented as outlined in Gonzalez Ginestet et al. (2020). These three methods serve as our comparators in the classification class of ML methods. Finally, we fit pseudo-observation based linear models, applying a LASSO penalty (Friedman et al. 2010), and non-penalized GLM models using an identity link for five different time-points separately (Sachs and Gabriel 2021). We denote these models as LASSO eventglm and eventglm, respectively. In each of these 10 models, we also included B-splines with three knots for all continuous predictors. We used the grid of time points f2; 4; 5; 6; 7g years for Royston-Altman dataset and a grid of time points f3; 4; 5; 6; 7g years for the METABRIC dataset. We selected appropriate tuning parameters, when needed, via internal cross-validation. The specification for each method as well as the code for reproducing our results can be found at https://github.com/pablogonzalezginestet/megalearner.
We additionally combine all methods into a single ensemble using 10-fold cross-validation and the pseudo-observation-based-AUC loss function at 5 years for a linear function of the predictions of each method. The results in the validation set for each respective dataset are illustrated in Table 1. As can be seen in Table 1, the optimized stack AUC for the 5 year risk in the validation sample in the GBSG and METABRIC using the OS is 0.718 and 0.745, respectively. Both are superior to all other individual methods, with one exception: in the validation sample of the METABRIC using the RFS, the optimized stack AUC performed similarly to best individual method, which is LASSO eventglm at t 5 .
Although it may seem counterintuitive, the pseudo-observation-based AUC at 5 years is not always highest for the pseudo-observation based methods. Furthermore, it will not always be the case that pseudo-observation-based methods will seem to have superior performance to other outcome based methods, as can be seen by comparing the glmnet pseudo-observation-based methods to the IPCW Logistic stepwise method. Figure 1 depicts the ROC curve and the predictiveness curve for the optimized stack method for the 5 year risk. For visual representation, the ROC curve for the optimized stack is displayed with three individual methods that were chosen depending on its contribution to the final model. We chose the Table 1. Summary of the models used in the ensemble. AUC column is the validation sample AUC for the individual model at 5 years and the normalized coefficient is the estimated α � . METABRIC RFS and METABRIC OS refer to the analysis that uses recurrence-free survival and overall survival as endpoints in the dataset METABRIC, respectively. SRF refers to survival random forests. The grid of time points t 1 ; . . . ; t 5 is f2; 4; 5; 6; 7g years for Royston-Altman and f3; 4; 5; 6; 7g years for METABRIC. individual methods that resulted in having the largest contribution and one that had a modest contribution and one that had practically no contribution. We also displayed in the ROC curve the risk threshold at or above 0.30 for each method. In the Royston-Altman example, the optimized stack had a 0.825 sensitivity and 0.474 specificity at 0.30 risk threshold. In the METABRIC dataset, the optimized stack had a lower sensitivity (0.520 and 0.373 in the OS and RFS as endpoints respectively) but higher specificity (0.803 and 0.851 in the OS and RFS as endpoints, respectively). Since the ROC curves did not align by risk thresholds, they do not allow direct comparison by thresholds. Figure 2 depicts the KM curves for the predicted risk categorized into quartiles for each example. The example of Royston-Altman demonstrated good discrimination over the entire time range. Instead, METABRIC's example did not show good separation between the second (Q2) and third quartile (Q3), particularly the example with RFS as endpoint. This can also be seen in Table 2 that shows the survival estimates in addition to the number of events and number of individuals at risk per each time point corresponding to Figure 2.

Royston-Altman
Finally, Table 3 shows discrimination measures and hazard ratios evaluated in the validation datasets that complement the information shown in the KM curves. Royston-Altman's example showed better discrimination performance than both METABRIC's examples: hazard ratios were more separated between risk groups and it had larger survival concordance and explained variation. As we discussed in the KM curves, the poor discrimination between Q2 and Q3 in METABRIC is reflected in a close hazard ratio between these two risk groups.

Discussion
We have demonstrated that the pseudo-observation-based AUC can be used to optimize a survivalbased stack of predictions from ML methods using various survival outcomes or methods to account for right censoring. We have illustrated our proposed method by re-analyzing the datasets used in Royston and Altman (2013), the Rotterdam as training data and German Breast Cancer Study Group (GBSG) as an external validation set, and also analyzing data from the METABRIC study using two different outcomes (Curtis et al. 2012;Mukherjee et al. 2018). Based on these two real datasets, we have shown that our proposed method can improve, marginally, the performance of predictions based on a single ML method, for instance, survival random forest (Ishwaran et al. 2008), or a classical survival methods such as Cox proportional hazard model. Our approach seems to be a better alternative to methods that only use ensembles of the same class of ML methods such as IPCW Bagging for binary outcomes (Gonzalez Ginestet et al. 2020) and the method developed by (Sachs et al. 2019) for continuous outcomes. Moreover, the optimized stack outperformed any single criteria of discrimination, calibration and measured of explained variance obtained in Royston and Altman (2013). We speculate that the strength of our approach comes from the combination of methods that consider single time points (binary and pseudo-observation based) with methods that consider all times simultaneously (Cox based). Though we focused on the 5 years risk of recurrence or death illustrated in the real data, our approach can be applied at several prediction horizons.
We show that our proposed method can include in the same stack the methods suggested in Gonzalez Ginestet et al. (2020) and Sachs et al. (2019), which allow for the adaption of all classification methods and all ML methods for continuous outcomes, respectively, to be adapted to work with right censored data. In addition, we show that classical survival methods and methods that have already been adapted to survival can be included, too. By combining the methods of Gonzalez Ginestet et al. (2020) and Sachs et al. (2019) and all existing survival-based methods, we have shown that now all existing ML methods can be used for risk prediction within a given time-point accounting for right censoring via the pseudoobservation based AUC. Our method as well as the ensemble methods suggested in Gonzalez Ginestet et al. (2020) and Sachs et al. (2019) are related to the Super Learner approach Van der Laan et al. 2007;) and share the property by the 'Oracle result' where the optimally stacked prediction is guaranteed to, on average, perform at least as well as the best single method in the ensemble. Since it is not possible to know which single method will perform best in advance, the advantage of optimal stacking is clear. However, the types of ML procedures that have generally been used in stacks of this nature have been limited to one outcome type. The limitations of this method are the use of the pseudo-observation-based AUC as the loss function for optimizing the stack, which may not be the ideal target of interest in all settings. We use unweighted, nonparametric pseudo-observations throughout. Although other pseudo-observation -based loss functions could be used, the use of nonparametric pseudo-observations in general may not be ideal in all settings. For example, when censoring is highly dependent on covariates, the IPC weighting or stratification of the pseudo-observations should be considered. In these settings the timevarying AUC or concordance index, using a binary outcome and IPCW weighting, may be better targets. It is not immediately clear how this would be used with classical survival models which use all times, although the adaption is likely straightforward. We focus on simple right censoring here, unlike Gonzalez Ginestet et al. (2020) and Sachs et al. (2019) which both also consider competing risks. We believe this method can easily be extended to allow for competing risks data, using the AUC based loss suggested in Sachs et al. (2019), which accounts for competing risk, but this is an area of future research. Other areas of future research include weighted versions of the pseudo-observation-based AUC and other loss functions following Binder et al. (2014) to deal with covariate-dependent censoring, and parametric pseudo-observations to deal with interval censoring following Sabathé et al. (2020) and Nygård Johansen et al. (2020).