Capturing simple and complex time-dependent effects using flexible parametric survival models: A simulation study

Abstract Non-proportional hazards are common within time-to-event data and can be modeled using restricted cubic splines in flexible parametric survival models. This simulation study assesses the ability of these models in capturing non-proportional hazards, and the ability of the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) in selecting degrees of freedom. The simulation results for scenarios with differing complexities showed little bias in the survival and hazard functions for simple scenarios; bias increased in complex scenarios when fewer degrees of freedom were modeled. Neither AIC nor BIC consistently performed better and both generally selected models with little bias.


Introduction
Time-to-event data is commonly analyzed using the Cox proportional hazards model (Cox 1972). This model has the advantage of making no assumptions about the shape of the underlying hazard function and can directly estimate relative effects, i.e., hazard ratios. By fitting a proportional hazards model one assumes that the hazard ratio of a covariate is constant over time (Hern an 2010; Jatoi et al. 2011). This assumption is often violated, especially when follow-up time is long. If present, it is important to account for non-proportionality in order to understand the natural history of disease and biological process, and to make appropriate model-based predictions.
As an alternative to the Cox model, one can make assumptions about the shape of the underlying hazard function by using a parametric model; parametric models will directly estimate absolute effects in addition to relative effects. However, standard parametric models, such as the Weibull model, can sometimes struggle to capture true effects when the underlying hazard is complex (Rutherford, Crowther, and Lambert 2015). Using a spline function to model the underlying hazard function, as in the flexible parametric survival model of Royston and Parmar (Royston and Parmar 2002;Royston and Lambert 2011), is more flexible than standard parametric models and allows for the direct estimation of relative and absolute effects in addition to other useful measures, such as differences between survival and hazard functions, to be estimated over the time scale of interest; the benefits of such measures are becoming more recognized in applied research (Colzani et al. 2011;King, Harper, and Young 2012;McMinn et al. 2012;Smith et al. 2012;Kandala et al. 2015).
Although alternative methods exist for analyzing non-proportional time-to-event data, much of which has been done in a Cox setting (Gray 1992;Hastie and Tibshirani 1993;Grambsch and Therneau 1994;Abrahamowicz, MacKenzie, and Esdaile 1996;Abrahamowicz and MacKenzie 2007;Wynant and Abrahamowicz 2015), here we concentrate on assessing the ability of flexible parametric survival models in capturing time-dependent effects. Under proportional hazards, flexible parametric survival models have been shown to be able to more accurately capture complex hazard functions than standard parametric models and provide unbiased estimates of hazard ratios (Rutherford, Crowther, and Lambert 2015), but no assessment under non-proportionality has been done. Since, 1) flexible parametric models are becoming more popular, 2) non-proportionality is common, and 3) time-dependent effects are easily implemented in these models by including an interaction between the covariate of interest and time, it is important to assess how well the restricted cubic splines within flexible parametric models capture different time-dependent effects.
In order for users to fit flexible parametric survival models, the user needs to decide on an appropriate number of degrees of freedom for the restricted cubic splines used to model both the baseline and the time-dependent effects. The larger the degrees of freedom specified, the more knots there are, which results in a more complex function. The degrees of freedom are specified for the underlying baseline hazard function, and if a time-dependent effect is specified, the deviation of the time-dependent effect from the baseline hazard function We aim to assess the ability of restricted cubic splines in flexible parametric survival models in capturing time-dependent effects through simulating biologically-plausible and complex hazard functions with time-dependent effects.

Flexible parametric survival models
Flexible parametric survival models were first proposed by Royston and Parmar (2002) and have been extended in a variety of settings including relative survival (Nelson et al. 2007), cure models (Andersson et al. 2011), and random effects (Crowther, Look, and Riley 2014). These models use restricted cubic splines to model a transformation of the survival function. Restricted cubic splines are piecewise cubic polynomials joined together at points called knots (Durrleman and Simon 1989) with additional constraints that the function is linear beyond the boundary knots to ensure a smooth function which is not affected by sparse data in the tails. Restricted cubic splines are able to capture most shapes, provided enough knots are used, without the added complexity involved in a higher order polynomials. The number and position of knots are predefined when using restricted cubic splines. The positions of knots can be user-defined but are commonly placed at equally distributed centiles of the event times; predicted hazard and survival functions from these models have been shown to be insensitive to the position of knots (Rutherford, Crowther, and Lambert 2015).
A restricted cubic spline function, sða; c 0 Þ; with covariate a ¼ ln ðtÞ where t denotes time, and knots k 1 ; :::; k K where K denotes the number of knots, is defined as where the r th basis function v 0r ðaÞ is defined as where k 1 and k K are the boundary knots, k r ¼ k K Àk r k K Àk 1 and u þ ¼ u if u > 0 and u þ ¼ 0 if u 0: Although flexible parametric survival models proposed by Royston and Parmar can be used to model many different link functions, the most common is log ðÀ log ðSðtÞÞÞ which models on the log cumulative hazard scale; we concentrate on this scale in this paper. A flexible parametric survival model as proposed by Royston and Parmar which incorporates time-dependent effects of covariates x; is expressed as: where H is the cumulative hazard, t is time, sðlnðtÞ; c 0 Þ is a restricted cubic spline function of log time, D is the number of time-dependent covariate effects and sðlnðtÞ; c d Þ is the spline function for the d th time-dependent effect.
(3), the survival function, Sðt; xÞ; and the hazard function, hðt; xÞ; can be expressed as: Modeling the hazard in this way may not be optimal in certain situations. In particular, when modeling multiple time-dependent effects on the log cumulative hazard scale, the time-dependent hazard ratio for one covariate is not independent of another covariate with a time-dependent effect (Royston and Lambert 2011;Lambert and Royston 2009), even with no interaction between the covariates; this can be problematic when interpreting and presenting results. An alternative method which overcomes this problem is to fit flexible parametric survival models on the log hazard scale Crowther and Lambert 2014;Bower, Crowther, and Lambert 2016). Maximum likelihood is used to fit models on both scales, however, numerical integration techniques are required when modeling the log hazard function which can be computationally intensive (Crowther and Lambert 2014;Bower, Crowther, and Lambert 2016) A flexible parametric survival model with a time-dependent effect of covariates x on the log hazard scale can be written as

Model selection criteria
The Akaike Information Criterion (AIC) (Akaike 1973) and Bayesian Information Criterion (BIC) (Schwarz 1978) are commonly calculated and compared separately between models in order to determine the best fitting model. However, it is possible that the best fitting model according to the AIC is different to that suggested by the BIC; this can leave the user unsure about the best model to fit and neither may lead to the most appropriate model. In a simulation setting the true model is known, thus it is possible to assess the ability of the AIC and the BIC in suggesting the models closest to the true model. Here this is performed by comparing the models suggested by the AIC and BIC to that chosen by the Symmetrized Kullback-Leibler distance (SKL) (Pawitan 2001;Kullback and Leibler 1951;Emiliano, Vivanco, and de Menezes 2014) which measures the difference between the fitted model and the true model.

Motivating example
Data from patients diagnosed with breast cancer in England and Wales were used to motivate the use of restricted cubic splines for capturing time-dependent effects. For the purpose of this study, we considered patients younger than 50 years at diagnosis between 1986 and 1990 who were categorized as being in either the most or least deprived quintile according to the Carstairs deprivation index (Coleman et al. 1999;Carstairs and Morris 1991). Thus, we analyzed the all-cause survival of 9 721 women up to five years post-diagnosis and considered the log hazard ratio of the most deprived women versus the least deprived women. Models on both the log cumulative hazard and log hazard scale were fitted including a covariate for deprivation status. The time-dependent effect of deprivation group was incorporated by fitting a flexible parametric survival model with 5 degrees of freedom to model the baseline hazard and 3 degrees of freedom to model the deviation from the baseline in the time-dependent effect; Figure 1a and c shows the hazard ratios (most deprived versus least deprived) from proportional hazards models alongside the hazard ratio from a flexible parametric survival model which allows for time-dependent effects on the log cumulative hazard and log hazard scale respectively. Standard errors and confidence intervals for the hazard ratios and survival proportions from these different models are presented in Table 1 of the Supplementary Materials. Standard errors increased for hazard ratios with increasing degrees of freedom for the time-dependent effects. For example, the hazard ratio at three years estimated from a model on the log cumulative hazard scale with 7 degrees of freedom for the baseline and 1 degree of freedom for the time-dependent effect was 0.04, whereas this was 0.08 when using 5 degrees of freedom for the time-dependent effect. The standard errors of the survival estimates predicted from model with different degrees of freedom were approximately equal. A Likelihood ratio test implied that non-proportional hazards models are more appropriate than proportional hazards models in this situation (p < 0.0001 on both modeling scales).
It is not clear how many degrees of freedom are required to capture the shape of this timedependent effect, or the baseline function in this non-proportional hazards model. Note that we write a, b degrees of freedom to denote a degrees of freedom to capture the baseline hazard and b degrees of freedom to model the time-dependent effect. Figure 1b and d presents the predicted hazard ratios (most deprived versus least deprived) from flexible parametric survival models on the log cumulative hazard scale and log hazard scale respectively with different degrees of freedom; see the figure legends for the different models fitted. Note that a spline function with one degree of freedom is simply a linear function of log time. Although all of the predicted hazard ratios capture a similar shape, a general increase over time, there are some differences between the curves. It is not clear which of the models better captures the true hazard ratio. To aid selection of the degrees of freedom, the AIC and BIC can be calculated. On the log hazard scale both criteria suggested the model with 7 and 1 degrees of freedom provided the better fit. However, on the log cumulative hazard scale the AIC suggested that the model with 5 degrees of freedom to model the baseline and 1 degree of freedom for the deviation in the time-dependent effect respectively had the better fit, whereas the BIC suggested that the model with 3 and 1 degrees of freedom was the better model. This inconsistency on the log cumulative hazard scale highlights a problem with using these model selection criteria; it is of interest to determine whether one criterion outperforms the other. A simulation setting allows us to know the true function, thus we use this approach to assess restricted cubic splines in capturing biologically plausible time-dependent effects across a range of scenarios.

Simulation strategy
A simulation study was undertaken where the true non-proportional hazard functions were known in order to assess the fit of the restricted cubic splines implemented within non-proportional flexible parametric survival models on the log cumulative hazard scale and on the log hazard scale, and to assess the performance of model selection using the AIC and the BIC.

Simulation design and data generation
Five hundred datasets were simulated using both sample sizes of 1 000 and 10 000; results are only presented here for sample size 1 000. Survival times in years were simulated on the hazard scale from a mixture Weibull distributions alongside a time-dependent effect of covariate x. The covariate x was binary and generated from a Bernoulli(0.5) distribution. Simulated survival times were implemented using software available in Stata (Crowther and Lambert 2012) using the method proposed by Bender et al. (Bender, Augustin, and Blettner 2005) and extended by Crowther and Lambert . Administrative censoring was applied at five years.

Simulation scenarios
Nine scenarios were simulated using combinations from three baseline Weibull distributions and three time-dependent effect functions; three out of the nine scenarios are presented here. The baseline distributions were similar to Weibull and mixture Weibull distributions used by Rutherford et al. (Rutherford, Crowther, and Lambert 2015). The time-dependent functions were based upon time-dependent effects found in datasets and previous studies to ensure that both complex and biologically plausible effects were simulated; the survival, hazard and log hazard ratio of the simulation scenarios are shown in Figure 2. Note that spline functions were not used to create the underlying true model. The time-dependent effect used within scenario 1 was based upon the timedependent effect seen in England and Wales breast cancer data described previously. Thus, scenario 1 had a log hazard ratio which was a function of log time. The timedependent effect used within scenario 2 had a log hazard ratio with a turning point. This pattern was based upon a hazard ratio seen between stages of colon carcinoma patients diagnosed in Finland (Lambert et al. 2007). The time-dependent effect used within scenario 3 was based upon a time-dependent effect used within a simulation study by Abrahamowicz et al. (Abrahamowicz, MacKenzie, and Esdaile 1996); in this study quadratic splines struggled to capture the shape of this time-dependent effect. Further details on the simulated scenarios are available in Table 2 of the Supplementary Materials.

Analyses of simulated data
Flexible parametric survival models were fitted to each simulated dataset both on the log cumulative hazard and log hazard scale with varying degrees of freedom. Three, five and seven degrees of freedom were used to model the baseline and one, three, five and seven degrees of freedom to model the time-dependent effect. The degrees of freedom for the time-dependent effect were limited to less than or equal to the number of degrees of freedom for the baseline effect since in practice one would rarely fit a model with more parameters for the time-dependent effect than the baseline. The timedependent log hazard ratio, and hazard and survival functions for both x ¼ 0 and x ¼ 1 were predicted from each model and compared to their respective true value in order to calculate the bias and coverage of each estimate; these were evaluated at one, three and five years. Coverage was calculated as the percentage of the confidence intervals, estimated from each model in every simulated dataset, that contained the true value. Root mean squared error (rMSE) estimates were also calculated in order to give a reflection of the variance of the estimates. The AIC, the BIC and the SKL were calculated for each fitted model and the optimal degrees of freedom suggested by each criteria compared.

Results
The average number of events were 803.2, 722.1 and 567.8 for scenarios 1, 2 and 3 respectively for sample size 1 000 over 5 years of follow-up. Convergence of models was high; 100% of all models on the log cumulative hazard scale converged and at least 98.9% of models on the log hazard scale converged in each scenario.  In scenario 1, the simplest of all the scenarios, negligible bias was found in the survival estimates over all degrees of freedom; absolute bias in the survival proportion was below 0.002. Very little bias was seen in the hazard rates and log hazard ratios. Larger bias was seen in models on the log hazard scale at time five as the time-dependent degrees of freedom increased, indicating potential over-modeling on this scale with larger degrees of freedom. Similar patterns were seen in the rMSE estimates. Very low values were seen in models with fewer degrees of freedom and increased rMSE values were seen in more complex models where it is possible that over-modeling had occurred.
Scenario 2 was more complex and had larger amounts of bias for low degrees of freedom. Bias was particularly large in hazard estimates and log hazard ratio estimates; this noticeably decreased as more degrees of freedom were chosen to model the deviation in the time-dependent effect. This pattern was also evident in the survival estimates. Absolute bias in the survival estimates was lower than 0.007 for models on the log cumulative hazard scale with at least 5 degrees of freedom to model the baseline hazard and at least 3 degrees of freedom to model the deviation from the baseline in the timedependent effect. Absolute bias in survival estimates was generally higher when modeling on the log hazard scale, although this was approximately 0.01 or less if at least 7,5 degrees of freedom were specified. The rMSE estimates indicated that more variation was seen in the estimates for scenario 2 than 1 and 3. The rMSE values were similar in models with more than 5,5 degrees of freedom; lower rMSE estimates were generally seen on the log cumulative hazard scale. The bias in scenario 3 was low for all survival estimates and approximately zero for some. The bias for the hazard and log hazard ratio estimates decreased as larger degrees of freedom were specified. Absolute bias in the survival estimates was less than or equal to 0.001 for models on the log cumulative hazard scale with at least 5 degrees of freedom to model the baseline hazard and to model the deviation from the baseline in the time-dependent effect. Absolute bias in survival estimates was less than 0.003 when modeling on the log hazard scale provided 5 degrees of freedom or more were specified to model the deviation in the time-dependent effect. The rMSE estimates suggested that variation was low in the survival and hazard estimates; larger rMSE estimates were seen in the log hazard ratio estimates especially in models with 1 degree of freedom to model the time-dependent effect. Bias was marginally higher when modeling on the log hazard scale. Coverage for all scenarios was close to 95%, the more complex scenarios required larger degrees of freedom to obtain this coverage. Table 1 shows the agreement between the optimal models selected by the SKL and the AIC, and the SKL and the BIC for models on the log cumulative hazard scale. In scenario 1 with sample size 1 000, the SKL selected the model with 3,1 degrees of freedom 481 times out of 500 replications as being the optimal model. The AIC and the BIC suggested the same model 356 and 499 times respectively. The SKL selected a larger variety of optimal models in scenario 2; the model with 5,5 degrees of freedom was chosen as optimal in approximately half of the datasets. The AIC generally suggested more complicated models and BIC suggested simpler models. The optimal model selected most frequently by the SKL in scenario 3 was the model with 5,5 degrees of freedom. The AIC suggested this as the optimal model the majority of the time whilst the BIC suggested models which were too simple in the majority of replications. Results for all three scenarios were similar when modeling the log hazard, see Table 2, or the log cumulative hazard and were in agreement with the models which showed the lowest bias.

Discussion
We undertook this simulation study to better understand the impact of user-selected degrees of freedom in flexible parametric survival models implemented in applied studies. The results presented indicate that restricted cubic splines accurately capture timedependent effects if appropriate degrees of freedom are selected; these results are consistent with the findings of Rutherford et al. (Rutherford, Crowther, and Lambert 2015) for proportional-hazards models. Bias was higher in models with lower degrees of freedom since these models were unable to capture the underlying complexity of the functions. However, survival estimates were fairly insensitive to an inappropriate number of degrees of freedom specified for the baseline hazard and the time-dependent effect. Models with many degrees of freedom produced slight biases due to over-modeling by capturing too much random variation in the observed data, however, these effects were smaller than those found when under-modeling since the inclusion of time-dependent effects increases the complexity of the data. Models on the log cumulative hazard scale and the log hazard scale captured hazard functions similarly well, although different degrees of freedom were optimal in some scenarios. A well fitting model was found for all scenarios on both scales. Neither the AIC nor the BIC consistently performed better than the other, however, the BIC selected too simple models in scenario 3; it is important to remember that the timedependent effect selected for scenario 3 was artificial and thus the largest challenge for the splines. Other than this, both the AIC and BIC suggested optimal models with low bias which is reassuring when conducting applied research.
Since the hazard function is inherently more complex than the cumulative hazard function, more degrees of freedom may be required when modeling on the log hazard scale. Also, over-modeling was more apparent on the log hazard scale in the simplest scenario, 1. This scenario, due to its simplicity, did not require many degrees of freedom and bias was larger when a model was fitted with a large number of degrees of freedom to model the deviation in the time-dependent effect. The computation time was also different due to the need for numerical integration when modeling on the log hazard scale. However, the computation time taken to model on the log hazard scale in a single dataset is less problematic than the many models fitted in a simulation study.
Due to the complexity of some of the scenarios, accurately capturing the underlying hazards were problematic due to a small sample size. This was particularly clear when considering the Table 1. Agreement of AIC and BIC with SKL for different models on the log cumulative hazard scale.
AIC BIC 3,1 3,3 5,1 5,3 5,5 7,1 7,3 7,5 7,7 Total 3,1 3,3 5,1 5,3 5,5 7,1 7,3 7,5 7,7 Total BIC in scenario 3. The BIC suggested the two simplest models as optimal the majority of the time where bias was large. When the sample size was increased to 10 000 (results not shown) the BIC suggested more complex models which resulted in lower bias. The results in this paper indicate that neither the AIC nor BIC out-perform the other in all situations; unfortunately this conclusion is not helpful for real data analyses. There has been evidence that one or the other criterion perform better in different situations (Emiliano, Vivanco, and de Menezes 2014;Burnham and Anderson 2004). Emphasis is put on the AIC and BIC here since these are the most widely used in practice, and thus we were interested in determining whether there was a clear advantage in using one over the other, however, there are other information criteria which may be useful in other situations (Spiegelhalter et al. 2002;Ishiguro, Sakamoto, and Kitagawa 1997;Wei 1992;Konishi and Kitagawa 1996). Our results suggest that relying on one criterion in this context is not advisable, instead both should be used for guidance. One possible alternative to selecting a model by use of information criteria is to make inferences based on an averaged model (Hoeting et al. 1999;Burnham and Anderson 2002).
Although the functions for the time-dependent effect considered in the simulations were of varying complexity, reality may still be more complex than considered in this study due to the consideration of only one binary time-dependent effect. This simplistic analysis is a useful starting point, but further assessment under more complex multivariate analyses should be an extension of this work. An additional extension would be to also consider non-linear effects of predictors on the log hazard since there is evidence AIC BIC 3,1 3,3 5,1 5,3 5,5 7,1 7,3 7,5 7,7 Total 3,1 3,3 5,1 5,3 5,5 7,1 7,3 7,5 7,7 Total that not allowing for non-linear effects can result in bias under non-proportionality (Abrahamowicz and MacKenzie 2007;Sauerbrei, Royston, and Look 2007); these can easily be introduced by modeling covariates continuously using restricted cubic splines. These additional complexities would contribute to a more complete assessment since they may be more similar to those seen in practice. We also only consider flexible parametric survival models to capture time-dependent effects. Outside of the fully parametric framework, there are several alternative approaches to modeling non-proportional hazards, including estimation using penalized or restricted maximum likelihood (Gu 2013). These alternatives may be advantageous when modeling for multiple smooth terms. A comparison of these methods with the flexible parametric survival approach would make for interesting further research.
The greatest advantage of flexible parametric survival models is that their use of restricted cubic splines make them flexible and one can include time-dependent effects simply in the models. They are also easily implemented for relative survival (Nelson et al. 2007) and for modeling cure (Andersson et al. 2014); the results in this study also apply within these frameworks.
Our motivating example illustrates one of the problems with using the AIC and BIC, that each information criterion can select a different model. Although one final model should be chosen in practice, we believe it is a strength to be able to show results are not sensitive to model choices via sensitive analyses. We present such a sensitivity analysis in Figures 1 and 2 of the Supplementary Materials where we compare the survival proportions and hazard ratios from different models fitted to the motivating example dataset. Different degrees of freedom resulted in small changes on the hazard ratio scale, but made very little difference to the survival estimates. The hazard ratio estimated from more complex models fluctuated a little over the five years of follow-up. It seems unlikely that hazard ratio predicted from the model with, for example, 7,5 degrees of freedom on the log hazard scale, is the true hazard ratio due to these fluctuations, and is more likely an artifact of the complex model. In this motivating example, we would have suggested to select the model with 7,1 degrees of freedom on the log hazard scale (where the AIC and BIC agreed). On the log cumulative hazard scale we would have selected the model with 5,1 degrees of freedom (chosen by the AIC) and been happy that 1) 5 degrees of freedom is most likely flexible enough to accurately capture the effect (Syriopoulou et al. 2019), and 2) that both point estimates and uncertainty were almost identical to the model chosen by the BIC (3,1 degrees of freedom).

Advice for users of flexible parametric survival models
We have presented a simulation study where it has been possible to identify the best model in terms of the lowest SKL. However, when faced with a single data set such an approach cannot be used as the true function is unknown. A key result from our simulation study is that provided there are sufficient degrees of freedom the bias in the estimated survival, hazard functions and time-dependent log hazard ratios is small. We tend to deal with fairly large datasets with several thousand observations. Our usual starting point is to use 5 degrees of freedom for the baseline and 3 degrees of freedom for any time-dependent effects. This is likely to capture most complex shapes. We would then inspect the AIC and BIC for a selection of models with more and less degrees of freedom to see if we had strong evidence of needing a more simple or more complex model. However, when comparing models we would always compare the key quantity we were interested in estimating. For example, by comparing the hazard ratio as shown in the motivating example, or other measures such as the age standardized survival (Lambert, Dickman, and Rutherford 2015), a time-dependent hazard ratio  or the loss in expectation of life (Andersson et al. 2015;Bower, Crowther, and Lambert 2016). We see these comparisons as more important than just a comparison of the AIC or BIC since it is a strength to be able to state that conclusions of the analysis do not change when different reasonable models are chosen.

Conclusions
Restricted cubic splines accurately capture simple and complex time-dependent effects provided suitable degrees of freedom are chosen to allow an appropriate flexibility. If interest lies in survival estimates then model misspecification should not be too problematic. It should always be possible for an analyst to identify a well fitting model when modeling on the log cumulative hazard scale or on the log hazard scale. The AIC and BIC should be calculated to guide users towards appropriate degrees of freedom but one or the other should not be relied upon to select the 'correct' model. Instead, users should perform sensitivity analyses, such as those conducted in Figure 1b and d, and remember that many models will likely fit the data well.