Potential bias introduced by not including multiple time-scales in survival analysis: a simulation study

Abstract When analyzing cohort data, we rarely consider potential simultaneous effects of multiple time-scales on estimated survival proportions and hazard ratios. Instead, often, only one time-scale, such as time-on-study or attained age, is taken as the main time-scale and the other time-scale(s) included in the model as time-fixed covariate(s). In this simulation study, we investigate the potential bias in estimated survival proportions and hazard ratios by not modeling multiple time-scales. We simulate data with two time-scales, under various scenarios, and a binary covariate as the exposure of interest. Flexible parametric models with one time-scale were fitted to the simulated data, including the exposure variable of interest and the other time-scale as a fixed covariate. For many of our simulated scenarios, the models with one time-scale showed small bias for the survival proportions. Bias in the log hazard ratios (for the covariate of interest) was observed when there were non-proportional hazards on the second time-scale. If one is modeling only one time-scale it is advisable to include non-proportional hazards and interactions with the time-scale. Exploration of approaches to modeling multiple time-scales is outside the scope of this study.


Introduction
Cohort data can include several time-scales, such as time since diagnosis, treatment duration, attained age or calendar time.Multiple time-scales can be seen in situations, where the event rate may vary along more than one time-scale.For example, breast cancer incidence depends on attained age as well as time since first childbirth; in many chronic diseases such as diabetes, the risk of complications is contingent on time since diagnosis and on attained age; the risk of getting an infection in intensive care unit may depend on length of stay and calendar time (Albrektsen et al. 2005;Wolkewitz et al. 2016).
When analyzing cohort data, we rarely model simultaneous effects of multiple time-scales (time-on-study, attained age, calendar year, others) on the hazard function even when we are aware of their presence.We often choose to model only one time-scale, such as time-on-study, as the main time-scale for the baseline hazard.The effect of the other time-scale(s) is then modeled, through a proxy of a time-fixed covariate, such as age at entry or calendar year of entry in the model.
It is possible to model multiple time-scales by splitting the data along the time-scales and include the time-intervals in the model.However, this can result in very large datasets.The absence of a well documented standard software to jointly model multiple time-scales can be an additional deterrent to fitting multiple time-scales models.These limitations play a role in our preference to model only one time-scale.Unfortunately, we rarely consider the potential bias in estimates from a misspecified model that ignores joint effects of multiple time-scale.
In this study, within the framework of flexible parametric survival models, we aim to investigate whether there is bias in estimates of survival proportions and hazard ratios when we underfit the survival model by only including one time-scale and a fixed proxy for the second time-scale.Assessment of choosing an optimal time-scale is out of scope of this study as there has been a considerable attention paid to the choice of a time-scale when analyzing cohort data (Korn, Graubard, and Midthune 1997;Thi ebaut and B enichou 2004;Pencina, Larson, and D'Agostino 2007;Chalise, Chicken, and McGee 2013), as well as combining time-scales into a unique optimal single time-scale (Farewell and Cox 1979;Oakes 1995;Duchesne and Lawless 2000).

Models with one or more time-scales
At least one time origin is inherent in the analysis of time-to-event data.Examples include the start of study, onset of disease or date of birth.The frequency of an event of interest from the time origin can dictate which time-scale(s) to use when modeling the baseline hazard function.For example, in cancer patients, the rate of deaths increases from onset of the disease, making the time from onset, or rather its proxy, the time from diagnosis, as the time-scale of interest.However, attained age, which is the time from birth, also plays an important role in the rates of death.
We cannot know for certain whether one time-scale should be preferred over the other in the analysis of time-to-event data as there are many other risk factors that can influence the time at risk.Conventionally, one of the two time-scales is chosen as the primary time-scale in the baseline hazard function without further consideration that modeling both time-scales may help in capturing the true hazard function.
A general model for the log hazard function with only one time-scale, t 1 , such as time from diagnosis, can be expressed as: where h 0 is the baseline hazard function, c is the corresponding parameter vector and x is a vector of covariates with associated log hazard ratios b: In the Cox model, the baseline hazard is not modeled directly due to the use of the partial likelihood (Cox 1972), whereas in the Poisson regression it is common to model the behavior of the baseline hazard by categorizing t 1 through function g into time-intervals, say monthly or yearly intervals.If a continuous format of the time-scale is preferred then g can be any polynomial function or a spline function, which allows for a smooth non-linear function.
In situations where it is known that the hazard rate may depend on two time-scales, such as time from diagnosis and attained age, the usual approach is to include the time from diagnosis, t 1 , in the baseline hazard and use age at entry, a 0 , as a constant proxy for the second time-scale, as shown in the following model: In this model a 0 is part of the covariates x Ã , and the relative effect of a 0 , modeled by function f, is part of the vector of the log hazard ratios b Ã : Similar to g, function f is usually specified as a categorical function (such as five-yearly intervals), or as a smoothing spline function.To allow for time-dependent effects of age at diagnosis and of any x p (p ¼ 1, :::P) covariate in x, model (2) can include interaction terms with t 1 : where P is the number of covariates, excluding age, with time-dependent effects.Now, a model that includes both time from diagnosis, t 1 , and attained age, t 2 , in the baseline hazard function, h 0 , assuming proportional hazards, can be written as: log ðhðt 1 , t 2 ; c, w, bÞÞ Furthermore, model (4) can also incorporate interaction between two time-scales t 1 and t 2 : log ðhðt 1 , t 2 ; c, w, s, bÞÞ If necessary, this model can be extended to include more time-scales, and interaction between time-scales and other covariates.
Depending on the origin of the time-scales, often one time-scale can be expressed in terms of the other.This may not always be true if the time-scale is defined in different units to time units.For example, failure rates in vehicles may be primarily measured in terms of the mileage rather than age, whereas for aircraft the time-scales can include the number of flights, the time in air and calendar time.However, in the medical setting, such as our example, using the time of diagnosis as the time origin, and a 0 as an offset term, we can specify attained age, t 2 , in terms of t 1 and a 0 , therefore transforming model (4) into log ðhðt 1 , t 2 ; c, w, bÞÞ Non-proportional hazards as well as interactions between time-scales can be accommodated by including interactions between the covariates and the time-scales: In this study, within the framework of flexible parametric models, our focus is on errors in predictions when fitting models as shown in equations ( 2), (3) with one time-scale while controlling for the second time-scale as a constant covariate, when the true underlying rates depend simultaneously on both time-scales as shown in equations ( 6), (7).

Flexible parametric survival models
Flexible parametric survival models (FPMs) are parametric models that use restricted cubic splines to model the baseline hazard function as well as time-dependent effects on the log hazard scale or log cumulative hazard scale (Royston and Parmar 2002;Royston and Lambert 2011;Crowther and Lambert 2014).A flexible parametric model on the log hazard scale with one timescale that includes interactions with time, is written as: where h is a hazard function over time-scale t, x is a vector of covariates with associated log hazard ratios b, with additional p th time-dependent effect for covariate x p (p ¼ 1, :::, P).The baseline hazard is represented by the restricted cubic spline function, sð log ðtÞ; c 0 , k 0 Þ, with knot location vector, k 0 , and associated coefficient vector c 0 from the spline expansion (Crowther and Lambert 2014).Users need to determine the number of knots for the spline expansion of log ðtÞ and for each of the spline expansions for the time-dependent effects.Interactions between the covariates and the restricted cubic spline function of log ðtÞ is omitted to fit a model with proportional hazards.It should be noted that it is common to model log ðtÞ in FPMs on the log hazard scale, but other functional forms of time t can be used as well.

Simulation strategy
This simulation study was conducted to evaluate bias in estimates of survival functions and hazard ratios when modeling hazard rates using FPMs with one time-scale given that the true hazard rates depend simultaneously on two time-scales.For this purpose, we simulated cohort data loosely based on haematological cancer patients who relapsed after transplantation, and were followed from relapse until death.
We simulated jointly two time-scalestime since relapse and time since initial transplantation.In the interest of parsimony, we chose to exclude calendar time and attained age, which should not be ignored in the clinical study of death rates in these patients.To the simulated data we fitted FPMs with time since relapse as the time-scale, and used the time since transplantation until relapse as a constant proxy for the time since transplantation, while also allowing for possible non-proportional hazards.

Simulation design and data generation
We simulated 1000 datasets with sample sizes of 100 and 1000 observations from each of the seven data-generating mechanisms (DGMs) with and without interaction terms with time-scales presented in Table 1.Each data-generating mechanism simulated time-to-event data based on two time-scales: the first time-scale, t, time (up to 10 years) from relapse until death or censoring, and the second time-scale, t þ c 0 , time from transplantation, where c 0 represents the time from transplantation until relapse, and was generated from a uniform distribution U(0, 5).Notation c 0 for offset is used instead of a 0 from equations ( 2)-( 7) to make it clear that these offsets are from different time-scales.We used Bernoullið0:8Þ for simulation of the single binary covariate x.The chosen probability distributions may not be realistic within the context of relapsed patients after transplantation, but they provide a useful foundation for the purpose of this study.Different DGMs with and without interactions with time-scales were chosen to reflect the wide spectrum of situations that we usually encounter when analyzing cohort data.Scenarios S 1 , S 2 generated data with proportional hazards for x; S 3 , S 4 simulated interaction between x and t, while S 5 , S 6 simulated interaction between x and t þ c 0 ; S 7 generated data with both interaction terms: between x and t, and between x and t þ c 0 : Of the seven data-generating mechanisms, three scenarios (S 2 , S 4 , S 6 ) also simulated interactions between the two time-scales.
x log ðhðt; xÞÞ ¼ À0:5 þ 0:1 log ðtÞ À 0:03 log x log ðhðt; xÞÞ ¼ À0:5 þ 0:1 log ðtÞ À 0:03 log For illustration of the DGMs, Figure 1 displays the survival proportions and hazard rates for x ¼ 0, 1 as well as the log hazard ratios (x ¼ 1 versus x ¼ 0) during follow-up time t, which is the time from relapse until death or censoring, given relapses occurred at c 0 ¼ 0:5, 4 years after transplantation, respectively.For scenarios S 5 , S 6 , it should be noted that the log hazard ratios are not technically independent of the first time-scale t due to the mathematical relationship between the first time-scale and the second time-scale (t 2 À t 1 ¼ ðt þ c 0 Þ À t ¼ c 0 ).This can be seen in declining log hazard ratios over follow-up time t.

Analyses of simulated data
To each of the simulated datasets, we fitted flexible parametric models on the log hazard scale with one time-scale, t, while adjusting for a constant c 0 .For the restricted cubic spline expansion of log ðtÞ, we chose six knots (five degrees of freedom), and three knots (two degrees of freedom) for the restricted cubic splines of the time-dependent effects.The choice of the number of knots was not of interest in our analyses, and according to previous studies, estimates from FPMs are not sensitive to the number of knots as long as sufficient number of knots is selected (Rutherford, Crowther, and Lambert 2015;Syriopoulou et al. 2019;Bower et al. 2021).Furthermore, the knots positions, as is most common, were chosen according to the equallyspaced centiles of the distribution of event times, with the boundary knots at the first and last event times.The effect of c 0 was also modeled with restricted cubic splines, with three degrees of freedom.
For each two-time-scale DGM we aimed to have a conceptually corresponding model with one time-scale that might be a sufficient mode to capture the effects of both time-scales.Table 2 provides formulae for the fitted models and indication of correspondence with the DGMs.
The survival proportions for x and log hazard ratios were predicted from each fitted model over time-scale t for specific values of c 0 .Average values of bias (the mean difference between the predicted values and true values) and 95% Monte-Carlo confidence intervals (MC CI) were evaluated at t ¼ 1, 3, 5 years and c 0 ¼ 0:5, 4 years.
When analyzing real-world data, the underlying data generating mechanism is unknown, and the AIC (Akaike 1974) and the BIC (Schwarz 1978) are often used to guide in model selection.
In order to mimic this model fitting procedure we also estimated AIC and BIC for each fitted model.We calculated the percentage of times the fitted models showed lowest AIC or BIC relative to other models when fitted to the same dataset from each DGM.We used the number of events in calculation of the BIC (Volinsky and Raftery 2000).
All the results from the analyses including interactive figures with average survival predictions and average log hazard ratios for all follow-up time t for each fitted model for each simulating scenario can be found in the Supplementary Material 1 : https://github.com/nurbatyr/Suppl-material-fitting-1ts-to-mult-ts.
For reference, to mirror the true models used for simulating data, we fitted flexible parametric models that jointly modeled both time-scales using three degrees of freedom for log ðtÞ and for t þ c 0 : Since these models are close approximations of the true models, they are only presented in the Supplementary Material.

Results
Tables and figures from fitting models to simulated samples of size 1000 observations are presented in this article.The Supplementary Material provides further details from fitting models to samples of size 100 observations.
Figures 2-4 present the bias in predicted survival functions and the log hazard ratios at follow-up times t ¼ 1, 3, 5 (years since relapse) and c 0 ¼ 0:5, 4 (time of relapse after transplantation) for the fitted models for the DGMs.

Scenarios with proportional hazards
For two simpler scenarios S 1 , S 2 , which simulated data with proportional hazards for x, the absolute bias was low (<0.06) in the survival estimates from all fitted models when sample size was 1000 observations.In particular, very low absolute values (<0.005) were seen in fitted models 1 Supplementary material may also be found in the online version of this article.

COMMUNICATIONS IN STATISTICS -SIMULATION AND COMPUTATION
Flexible parametric models with one time-scale used to fit the simulated data with two time-scales.

Fitted model
M2, M4 and M6, which included interaction between t and c 0 .For samples of size 100, the absolute bias was still low (<0.08) in the survival estimates but only M2, M4 were seen to have lowest bias.For both sample sizes, larger absolute bias was observed in the log hazard ratios, where only models M1 and M2, with proportional hazards for x, showed close estimates to the true values (absolute bias <0.05).These models were aimed to correspond to S 1 and S 2 , respectively.The AIC and BIC were both seen to also support models M1, M2 most of the time in respect to these scenarios, as shown in Table 3.

Scenarios with interactions with the first time-scale
For scenarios S 3 , S 4 , which generated data with interaction between x and the first time-scale t, the largest bias in survival proportions and the log hazard ratios was seen in the proportional hazards models for both sample sizes.For example, for samples of size 1000, model M1 in scenario S 4 showed a bias of À0.08 (MCSE 0.001) in survival proportions for x ¼ 0 at t ¼ 1, c 0 ¼ 0:5 Figure 2. Bias in survival proportions from models fitted to the simulated samples of size 1000 observations for x ¼ 0 for t ¼ 1, 3, 5 years since relapse given relapse times at c 0 ¼ 0:5, 4 years after transplantation.Note that the y-scales differ across scenarios.and 0.39 (MCSE 0.0026) for the log hazard ratio at t ¼ 5.In contrast, model M4, which conceptually corresponds to S 4 , showed the lowest absolute bias in survival proportions in both S 3 and S 4 (<0.003,<0.006, respectively).Similarly, for samples of size 100, M4 was also seen to have lowest bias in survival estimates.These were closely followed by estimates from models M3 and M7.Very low bias (<0.05) was also observed in the log hazard ratios from models M3, M4 and M7 for samples of size 1000 but bias increased for these models for samples of size 100.Furthermore, model M4 showed the lowest AIC 92.2% of the time compared to other models when fitting data of size 1000 generated by S 4 , and model M3 had the lowest AIC in over 67% of the fitted samples from S 3 .

Scenarios with interactions with the second time-scale
In comparison to other scenarios, larger absolute bias was seen in both survival proportions and the log hazard ratios from all fitted models for scenarios S 5 , S 6 which simulated interactions with the second time-scale t þ c 0 : Models M5, M6 that included interaction x Á c 0 , and conceptually   corresponded to scenarios S 5 , S 6 , respectively, showed low bias (<0.1) in survival predictions for both sample sizes.However, model M7 that included both interaction x Á log ðtÞ and interaction x Á c 0 , showed the lowest absolute bias (<0.05) in survival predictions.Furthermore, only model M7 was seen to have the closest estimates to the true log hazard ratios, particularly for t < 5.However, bias increased with increasing follow-up time, as shown in Figures 10.2,10.3,11.3,11.4 in the Supplementary Material.In addition, model M7 showed the lowest AIC in 83.7% of the fitted samples of size 1000 from S 5 , and only in 19.4% of the fitted samples of size 1000 from S 6 .
3.1.4.Scenario with interactions with both time-scales For scenario S 7 , which included both interaction terms between x and the two time-scales, model M7 was conceptually the closest model.This model had the lowest absolute bias (<0.02) in survival predictions compared to other models for both sample sizes.It also showed the closest log hazard ratios to the true values within the first five years of follow-up time for sample size 1000, but bias increased with increasing follow-up time beyond five years as shown in Figure 12.3 in the Supplementary Material, and was seen to be large for sample size 100.The remaining fitted models had large absolute bias in both survival proportions and the log hazard ratios (>0.1, >1.1, respectively) for both sample sizes.The lowest AIC was seen for model M7 in 835 of the fitted 1000 datasets of size 1000 observations generated by scenario S 7 .

Discussion
The purpose of this simulation study was to assess the performance of one time-scale models fitted to data generated using two simultaneous time-scales.To this end, we simulated data from seven scenarios with two time-scales, and for each scenario we evaluated bias in seven fitted flexible parametric survival models with one time-scale.Although we focused on fitting flexible parametric models, we expect similar results from fitting Cox regression models or Poisson regression models provided the data are split into very short intervals and a smooth function fitted to these intervals.The majority of the one time-scale fitted models that approximately corresponded to the datagenerating mechanisms, showed on average very close estimates to the true survival proportions and hazard ratio estimates, and most of the time were seen to have lowest AIC provided the sample size was large.Fitted models that included an interaction between the main time-scale and the constant from the second time-scale (c 0 Á log ðtÞ), while controlling for proportional hazards or non-proportional hazards for x showed most robust results if data had proportional or non-proportional hazards for x on the first time-scale, respectively.However, if sample size was small (100 observations), then bias in estimates increased compared to when fitting to samples of size 1000.
All models underperformed when estimating the log hazard ratios from data with interaction between x and the second time-scale t þ c 0 : It is highly likely that a researcher encountering data similar to data from S 5 , S 6 , may choose models that include an interaction term between x and the main time-scale, such as models M3, M4 or M7, which include such an interaction.However, this can lead to misleading interpretation that the variation in the effect of x on the hazard rates is solely due to the main time-scale (t) rather than the underlying second time-scale (t þ c 0 ).Even though model M7 that includes both interaction x Á log ðtÞ and interaction x Á c 0 was seen to have very close survival predictions to the true values, the bias in the log hazard ratios was still visibly large, especially for follow-up time beyond five years.We can never be safe from such pitfalls in the analysis of real-world data.With our simulation study, we hope to raise awareness of such pitfalls and direct attention toward more nuanced analysis with cautious approach to modeling time-scales.It should be noted that the simulated scenarios were designed to have joint strong effects from both time-scales, and therefore we do not expect to observe as severe bias in situations with less strong effects.
As in the majority of simulation studies, we only looked at limited number of scenarios, and made simplifying assumptions.In particular, we simulated data using only two time-scales with one type of functional form for both time-scales and interactions with time-scales.It would be of interest to investigate the performance of models fitted to the data generated using different time-scales with different functional forms as well as different lengths of follow-up time.In this study we focused on the estimation of the effect of x, and did not explore the estimation of the effect of c 0 , which is of great interest in clinical practice.Furthermore, we did not investigate possible situations of smaller population size.
The growing complexity of disease progression and increasing length of follow-up time in chronic disease and cancer patients require more advanced models that account for outcomes on multiple time-scales and can provide more reliable estimates with respect to different time-points on either of the time-scales.Unfortunately, it is rarely possible to know with confidence whether the hazard rates depend on more than one time-scale even if the cohort data may contain other variables measured in units of time.If empirical evidence suggests that the hazard rates are dependent on multiple time-scales, then users are encouraged to use models with multiples timescales and, if necessary, allow for non-proportional hazards on either of the time-scales.Using the AIC or the BIC may be helpful in selecting the most appropriate model.

Conclusion
Through this study, we aimed to demonstrate the importance of careful consideration of potential bias in estimates when fitting one time-scale models to data dependent simultaneously on multiple time-scales.
When modeling hazard rates dependent on two time-scales, where one time-scale can be expressed as a function of the other time-scale and a constant offset, the models with one time-scale appear to approximate well the survival proportions and log hazard ratios.We observed this in situations where the models correctly accommodate the proportional or non-proportional hazards on the first time-scale, and include interaction between the first time-scale and the constant c 0 (time-ofentry on the second time-scale).On the other hand, if the cohort data have interactions between the covariates and the second time-scale or both time-scales then caution should be exercised.Using models with one time-scale with non-proportional hazards for the main covariate x and interaction x Á c 0 may provide good approximation of survival proportions but it may still lead to bias in the log hazard ratios.Users may find the AIC useful when making model choices in practice, but this is not guaranteed to select the most appropriate model (Bower et al. 2021).

Disclosure statement
The authors declare that they have no conflict of interest.NB and RS are employed by SDS Life Science and their time within this project was funded by SDS Life Science, however, SDS Life Science had no involvement in the study.

Figure 1 .
Figure 1.Survival proportions, hazard rates and log hazard ratios of the seven data generating mechanisms over time t since relapse given relapses occurred at c 0 ¼ 0:5, 4 years after transplantation.Note that the y-scales differ across scenarios.

Figure 3 .
Figure3.Bias in survival proportions from models fitted to the simulated samples of size 1000 observations for x ¼ 1 for t ¼ 1, 3, 5 years since relapse given relapse times at c 0 ¼ 0:5, 4 years after transplantation.Note that the y-scales differ across scenarios.

Figure 4 .
Figure4.Bias in log hazard ratios from models for t ¼ 1, 3, 5 years since relapse given relapse times at c 0 ¼ 0:5, 4 years after transplantation for sample size 1000 observations.Note that the y-scales differ across scenarios.

Table 1 .
Data-generating mechanisms for 1000 simulated datasets with 100 and 1000 observations with two time-scales: first time-scale,

Table 3 .
Percentage of times the fitted models showed lowest AIC or BIC relative to other models when fitted to the same dataset of size 1000 observations from each data-generating mechanism.Cells in bold represent the highest percentage.