Weighted hazard ratio estimation for delayed and diminishing treatment effect

,


Introduction
Randomised clinical trials (RCTs) are regarded as the gold standard for evaluating the effectiveness of a new treatment.In RCTs with time to event outcomes, the primary outcome is to measure the differences in the survival curves of the different arms and report the treatment effect to quantify treatment differences.The analysis of such trials are routinely accomplished by the log rank test and the Cox proportional hazard (PH) model.The hazard ratio is obtained from the Cox PH model to measure the effectiveness of the new treatment compared to the control group.The Cox PH model assumes that the hazard ratio between the treatment and control group stays constant from the start of study period until the end of the follow up period.Under PH, the log rank test is the most powerful test to detect differences in the survival curves [1].However, non proportional hazards (NPH) are commonly observed in RCTs and have been reported in many published trials e.g. the IPASS trial in lung cancer [2] and the ICON7 trial in ovarian cancer [3] .When the PH assumption does not hold, the log rank test is still valid but may suffer substantial power loss and the interpretation of the hazard ratio becomes invalid and challenging [4].There may be many reasons for non PH to exist in clinical trials, for instance due to delayed clinical effect in the case of immuno-oncology trials.
Various statistical methods have been proposed to analyse time to event outcomes in clinical trials under non proportional hazards scenarios including restricted mean survival time (RMST) [5], weighted log rank tests [6] and maximum combination test (MaxCombo) [7].The RMST measures the average survival time from time t = 0 to some pre-specified time horizon t = τ and it may be estimated as the area under the survival curve up to that time point.The weighted log rank test based on the Fleming-Harrington class of weights G ρ,γ with weight S(t) ρ (1 − S(t)) γ where S(t) is the survival function of the pooled patients from control and treatment group; the parameters ρ, γ allows one to down weight early, late or middle events.The MaxCombo test uses the Fleming-Harrington weight function to analyse time to event outcomes in the presence of non PH.The test avoids having to pre-specify the weight function and considers a procedure which selects the best test from the set of test statistics.The procedure consist of selecting the best combination of weighted log rank test statistics with different choices of ρ and γ.When the MaxCombo approach is used, Lin et al [7] suggest using the weighted hazard ratio for effect estimation.The estimated effect is obtained from a Cox model with a time-varying effect, where the weights are those associated with the weighted log rank test [8].
This paper focuses on the estimation of treatment effect under non proportional hazards, particularly the weighted hazard ratio method proposed by Lin and León [8].They propose fitting a particular Cox model with a time-varying covariate/effect for which the score test corresponds to a weighted log rank test.The model proposes an effect adjustment factor A(t) = w(t) ma x(w(t)) where w(t) is the weight function from the chosen weighted log rank test.This is incorporated in to the Cox PH model to provide time varying effect which can be viewed as the treatment coefficient β weighted by the adjustment factor A(t).The hazard ratio obtained from the model is expressed as HR(t) = e βA(t) = [HR F ] A (t) where HR F represents the maximal effect at time t with A(t) = 1.In this paper, we are interested in investigating whether the estimated treatment effect is unbiased under scenarios in which the associated weighted log rank test is the most powerful.
In Section 2, we describe the standard weighted log rank test and the proposed weighted hazard ratio estimation [8] and its estimation of the time-varying HR.In Section 3, we compare the hazard ratio functions obtained from the proposed model with the true hazard ratio functions analytically.In Section 4, we assess, through simulations, the performance of the proposed model under the two non PH scenario with different choices of ρ and γ.Section 5 is a discussion.

Weighted log rank tests and effect estimation
In this section, we will provide a brief overview of the log rank test and the weighted log rank test, as well as introduce weighted hazard ratio for effect estimation proposed by Lin and León [8].

Weighted log rank test
Let t 1 ,t 2 ,. . .,t r be the ordered failure times across both treatment arms, d 1 j and d 2 j be the number of deaths at t j in treatment and control arm respectively with d j = d 1 j +d 2 j and n 1 j and n 2 j be the number of individuals at risk before t j with n j = n 1 j + n 2 j .Let X 1 , X 2 , . . ., X r denote the treatment assignment where X j =1 if the j-th subject is assigned to treatment arm and X j =0 if assigned to control arm.The log-rank test statistic is defined as, , where e 1 j = n 1 j d j n j is the expected number of failure at time t ( j) under the null hypothesis of equal survival curves.V L is the variance of the observed number of failures, .
Under the null hypothesis of equal survival functions in the two treatment groups, asymptotically the test statistic has a chi-squared distribution on one degree of freedom.Although the log rank test is still a valid test under non proportional hazards it is not the most powerful test [5].Therefore, the weighted version of log rank test is sometimes utilised when the ratio of the two hazard functions are not constant over time [6].
The weighted log rank test (WLRT) assigns a weight function to different time points depending on the expected type of non-proportional hazard scenario.For instance, in immunooncology trials there may be a delayed treatment effect and therefore the survival curve for the treatment group will only emerge to separate from the control survival curve after a certain period of time.In this case, higher weights can be allocated to later time points [7].Formally, the WLRT is defined as, with variance where w(•) is the non-negative weight function.We will consider the Fleming-Harrington family of weighted log rank test, commonly denoted as G ρ,γ with weight, where Ŝ(t) is the Kaplan-Meier estimate of the survival function of the pooled patients from the treatment and control arm.The parameters ρ and γ control the shape of the weight function.G 0,0 represents the log-rank test with w(t) = 1 that has constant weight over time; delayed treatment effects can be tested using G 0,γ with w(t) = (1 − Ŝ(t)) γ that allocates higher weight at later time points to detect late survival difference, G 1,0 represents diminishing effect with w(t) = Ŝ(t) that gives more weight the earlier time points to detect early separation; G 1,1 represents mid separation with w(t) = Ŝ(t)(1 − Ŝ(t)) that puts more weight at the middle of the follow-up period than at the ends.

Weighted hazard ratios
Lin and León [8] proposed fitting a Cox model to provide a time varying treatment effect estimate to complement the weighted log rank test.Lin and León propose fitting a Cox model with, λ(t; X ) = λ 0 (t)e A(t)β X where A(t) is defined by, max(w(s)) is evaluated over the values of t in the observed dataset and A(t) thus has a maximum value of 1.The time varying covariate X * (t) = A(t)X represents the treatment assignment weighted by the adjustment factor.Once X * (t) has been determined, the constant coefficient β can be estimated.It is shown in the paper that the score test of the null hypothesis that β = 0 in the proposed model above is equivalent to the weighted log rank test with the corresponding choice of weight function [8].Lin and León used the fact that the score test from this Cox model corresponds to a particular weighted log rank to motivate estimating the time-varying treatment effect using this Cox model.The coefficient β can be interpreted as the maximal effect where the model assumes the patients experience the most benefit (A(t) = 1).To differentiate the time varying hazard ratio derived from the proposed method, we will denote this as HR L L and it can be expressed as, where HR F =e β represents the maximal effect (i.e., at time t with A(t)=1).The time profile of the treatment effect can be observed as the treatment coefficient β weighted by the adjustment factor A(t).The estimate derived from the model provides a consistent time profile of the hazard ratio over time given that the true hazard ratio also varies over time as specified by A(t).If this does not hold then the treatment effect measure will in general no longer be unbiased for the true value of the hazard ratio at a given time.

Evaluation of weighted hazard ratio estimation method
In this section we give expressions for the hazard ratio function for which the Fleming-Harrington [1] G ρ,γ test statistics are most efficient.We will also be exploring whether, for a given weight function, the Lin and León estimation method [8] gives you the correct treatment effect profile if the true data generating mechanism is such that the corresponding weighted log rank test is the most powerful.

Diminishing effect
Fleming and Harrington [1] showed that the G ρ test statistic is the most powerful test statistic when the hazard ratio at time t is of the following form, where S 1 (t) = exp − t 0 λ 1 (x)dx is the survival function for the control arm.The parameter e ∆ represents the hazard ratio at t = 0.The parameter ρ controls how quickly the effect diminishes.The above hazard ratio function (4) has its full treatment effect initially e β and decreases monotonically to 1. Later, we will demonstrate various examples with ρ = 0.5, 1, 2 to further develop our understanding about the behaviour of the weighted logrank test and the proposed model.We now calculate the weighted hazard ratio function proposed by Lin and León [8] under the set up where the G ρ test statistic would be optimal.In order to calculate the proposed weighted hazard ratio function under the set up where WLRT is optimal, we will need to derive the true pooled survival function which is determined by finding the survival function of the two treatment arms.The hazard function for the treatment arm λ 2 (t) can be derived, assuming that the true data generating mechanism is such that that WLRT is optimal, by using equation ( 4) Thus, the survival function for the treatment arm is, We can replace the adjustment factor A(t) with the weight function that the Lin and León method converges to (as n tends to ∞) (3), in this case S(t).For the Lin and León [8] method, we estimate the pooled survival function S(t) ρ by the Kaplan-Meier estimator for the pooled sample which is consistent for the true pooled survival function.Therefore the Lin and León method consistently estimates, Substituting S 2 (t) in the hazard ratio function (6) for which this WLRT is optimal for gives, When fitting the proposed model, the β is estimated from the Cox PH model once the adjustment factor A(t) is determined.Comparing the true hazard ratio function ( 4) with the proposed hazard ratio function (7), we see that the two expressions do not appear to be equal to each other.In Appendix B, we demonstrate that the two hazard ratio functions are approximately equal to each other when ∆ is close to 0. We also demonstrate this equality graphically in the next section where we choose the value of e ∆ to compare the hazard ratio functions.We also compare the hazard ratio expressions analytically for some particular choices of λ 1 (t) and e ∆ to illustrate in Appendix A.

Delayed treatment effect
The paper by Garès [9] has provided an expression for the hazard function for which the Fleming-Harrington test G γ for detecting late effects is optimal.Assume, the survival time in the control arm is exponentially distributed with some constant hazard λ 1 .Then Garès et al [9] show that the G γ test is optimal for testing the hypothesis, where The expression for the hazard ratio has no closed form and needs numerical integration techniques to compute.Garès et al [9] has provided the definition for the parameter ϕ which is evaluated at some chosen time point τ and is of the following form, where r is the discrepancy rate, r = S 2 (τ)−S 1 (τ) 1−S 1 (τ) .Similar to the diminishing effect scenario, we can check whether the true hazard ratio for which the G γ test is optimal, as given in equation ( 8), agrees with the hazard ratio estimation proposed by Lin and León [8].The true hazard ratio under which the G γ test statistic is optimal is shown to be the second equation in (8).Now the the weighted hazard ratio model proposed by Lin and León corresponds to, From Garès et al, the survival in the treatment and control arm is defined as . We now substitute in for S 1 (t) and S 2 (t), We assume that the Lin and León method estimates the correct hazard ratio (e β ) at the time point in the study where it is the biggest effect i.e. at t = τ.Like in Section 3.1, we show that if the Lin and León method estimates the correct hazard ratio at the time point when it is largest, then it will estimate the HR at all other times correctly too.In Appendix C, we see that the expression for the hazard ratio function from Lin and León model (10) does approximately match the true HR assumed as per equation ( 8) under the condition that ϕ is small.

Simulation studies
This section investigates the weighted hazard ratio method under diminishing and delayed effect using simulations [8].We will focus on investigating the method proposed by Lin and León and its properties when non proportional hazards are present in the data.Specifically, we will empirically check the analytical results of Section 3 that showed that the Lin and León approach is approximately unbiased when the treatment effect is small, and how large the bias might be when the treatment effect is larger.

Scenario I: Diminishing effect
We simulated the survival data where the treatment is assumed to have its maximal effect initially and decrease monotonically to 1.The survival function for the treatment arm is given in equation (5).Three different scenarios ρ = 0.5, 1, 2 were considered to understand the behaviour of the proposed model under various rates at which the effect diminishes.We also chose three different values of the maximum effect e ∆ = 1.4,4, 8 to investigate whether substantial bias occurs when the treatment effect is larger.We fitted the Cox model proposed by [8] where the weight function w(t) = Ŝ(t) ρ for the non PH data we simulated is specified for three scenarios ρ = 0.5, 1, 2. The study enrolls 200 patients, equally allocated in the two arms.The survival of the control arm follows an exponential distribution with hazard rate λ 1 = 0.5.The survival in the treatment arm is as per equation (5) .The follow up time is assumed to be 3 years; patients whose event time exceeds 3 years are censored.We conduced 5000 runs for each scenario.
Figure 1 shows the true hazard ratio over time for which the G ρ statistic is optimal and the average hazard ratio profile over time estimated by the Lin and Leòn method.The plot displays e βS(t) ρ where β is the average of the hazard ratio at t = 0 estimated from the Lin and Leòn method and S(t) is the true pooled survival function.The proposed model overestimates the maximal effect when the true HR at t = 0 i.e. e ∆ is large, and slightly underestimates immediately after 1 month of follow up.Note that when the treatment effect is small, the proposed model estimates the maximal effect very close to the true one, in line with our analytical results in Appendix B. We observe through our work that when the treatment effect is large, the proposed model is not able to estimate the maximal effect at t = 0 correctly which has an impact on the treatment effect profile over time, in that case the implied profile of treatment effect over time would not match the true one (as given by equation ( 4)).

Time
Hazard ratio

Scenario II: Delayed treatment effect
In this scenario, the data were simulated based on the assumption that the treatment will have no effect at the start of the follow up and gradually increases to have its maximal effect at the final time point of follow up period.The study enrolls the same number of patients as in the diminishing effect scenario i.e. 200.The maximal treatment effect occurs at the final time point τ = 2.The survival times in the control arm are exponentially distributed with hazard rate 0.5.The survival times in the active arm can then be simulated using the expression for the hazard function given in (8).To simulate the survival times in the active arm, we calculated the value of S 2 (t) for t from 0 to τ in increments of 0.0005.We then simulated U ∼ U(0, 1) and found the value of t such that S 2 (t) was closest to U. We consider two scenarios of S 2 (τ) = 0.25 and 0.1 which provides the values of discrepancy rate r = −0.19 and −0.42 for γ = 0.5, 1, 2 to verify our results obtained in Section 3. The mechanism behind simulating the non proportional data in this way would ensure that the weight function (1−S(t)) γ would provide the most powerful test statistic.Similar to the previous scenario, we applied the Lin and León Cox PH model to our simulated datasets and compared the estimated HR profile with the true HR profile over time.We again considered three scenarios of γ where it is varied to control the rate of how quickly the treatment effect reaches its maximum effect.The true hazard ratio over time for which the G γ statistic is optimal and the average hazard ratio profile over time estimated by the Lin and Leòn method is displayed in Figure 2. The plot displays the three HR profiles over time for γ = 0.5, 1, 2 for two different values of S 2 (τ).The solid lines represents the true HR profile and the dashed lines represents e β(1−S(t) γ .Note that for both values of S 2 (τ), the estimated average hazard ratio is overestimated for γ = 1, 2 and this deviation is reflected on the HR time profile.Though for other scenarios, we see that the Lin and León method estimates the treatment effect profile over time with minimal bias, in line with our analytical results in Appendix C.The proposed model closely estimates the maximal treatment effect for both scenarios where γ = 0.5 and r is relatively small.Figure 2: Hazard ratio over time targeted by weighted hazard ratio model superimposed true hazard ratio over time under a data generating mechanism where the G γ test is optimal.The solid line shows the true hazard ratio over time for γ = 0.5, 1, 2. The true HR for r = −0.19 at τ = 2 is e β = 1.555, 1.73, 2.112 and for r = −0.42 is e β = 2.999, 3.813, 5.935 for γ = 0.5, 1, 2 respectively.The dashed lines show the proposed hazard ratio targeted by the weighted hazard ratio model.The hazard ratio for the dashed lines is the average of the hazard ratios across 5000 simulations estimated from fitting the Lin and León method and is as follows; e β = 1.558, 1.732, 2.138, e β = 2.998, 3.803, 6.070 for r = −0.19,−0.42 and γ = 0.5, 1, 2 respectively

Discussion
In this paper, we considered the Fleming-Harrington class of weights to simulate non-proportional hazard data for the which the same weight function would give the most optimal test.The Lin and León method consists of fitting a Cox model based on a time-varying treatment covariate.The score test from the proposed model gives a p-value which is equivalent to the weighted log rank test and the estimate derived from the model provides a time-profile of the treatment effect.We considered two non proportional hazard scenarios to investigate its ability to correctly estimate the maximal treatment effect and the corresponding treatment effect profile over time when the data were simulated such that the corresponding weighted log rank test would be most powerful.
We showed analytically that the Lin and Leòn method is approximately unbiased in both scenarios when the treatment effect is small.Our simulation results for the diminishing treatment effect scenario showed that the Lin and León method overestimates the maximal treatment effect when the true treatment effect at t = 0 is large as well as when ρ is large.For scenarios where the treatment effect is small, the proposed model estimates the maximal effect very close to the true one.For the delayed treatment effect scenario, the Lin and León method overestimates the maximal effect for γ = 1, 2 but in our simulations not by very much.The Lin and León method gave essentially unbiased estimates of the full effect and treatment effect profile over time for γ = 0.5.
We have shown through simulations that under the delayed treatment effect scenarios, the proposed model is giving close to the correct profile over time when treatment effect is small.However, this does not guarantee that the proposed model would provide unbiased results had the data generating mechanism been different, for example if the the hazard in the control arm was not constant and changes over time.
The choice of ρ and γ in estimating the treatment effect requires extensive knowledge of how the true hazard ratio changes over time.From a testing perspective, choosing the correct value of ρ or γ is less crucial since the test still controls the type I error rate if the incorrect value is used.However, for estimation of the treatment effect profile over time, use of the incorrect value would lead to biased estimates.Given the difficulty of knowing the correct value of ρ and γ, it may be preferable to use methods such as restricted mean survival time which do not rely on how quickly or slowly the treatment effect changes over time, and for which the interpretation of the treatment effect is clinically meaningful to the clinicians and patients [5].

Appendix B: Taylor series expansion of the true hazard ratio function and the proposed hazard ratio function for the diminishing effect
In this section, we demonstrate that the true hazard ratio function and the proposed hazard ratio functions are approximately equal when e ∆ is close to 1 i.e., when ∆ is small.The evaluation of these results are executed generally for ρ.Recall the true hazard ratio function and the proposed hazard ratio function are of the following form, We first assume β = ∆ in order to check whether the Lin and León model estimates the treatment effect over time correctly if it estimates the initial effect correctly.

HR(t,∆)
HR L L (t,∆) .We now proceed with a Taylor series expansion for f (∆), We shall expand the first two terms of the series about ∆ = 0, HR L L (t,0) =1.We now find the first order derivative of f (∆) with respect to ∆ using the quotient rule, so we have We derive the following derivatives for the true hazard ratio function using the product and chain rule, Now, we find the derivative for the proposed hazard ratio function, where we now emphasise the dependence of S 2 (t, ∆) on ∆.
We now evaluate the derivative of the proposed hazard ratio function at ∆ = 0, using the fact that S 2 (t, 0) = S 1 (t).We can substitute the results in the Taylor series expansion, The results confirms that the hazard ratio functions are approximately equal to each other when the treatment effect e ∆ is small.

Appendix C: Taylor series expansion of the true hazard ratio function and the proposed hazard ratio function for the delayed effect effect
We demonstrate the approximate equality of the two hazard ratio functions for the delayed effect when ϕ is small.We show the results for a general γ.Recall, the true hazard ratio and the proposed hazard ratio are of the following form, s ds and γ (x) = x 0.5 1 s L γ (s) ds.Recall S 2 (t) = ( ) −1 ( (e −λ 1 t ) + ϕ) and S 1 (t) = e −λ 1 t.We again assume that the Lin and León model estimates the maximal treatment effect β correctly.For simplicity we write β as some function of ϕ and note that the maximal treatment effect occurs at t = τ, so we have g(ϕ) = log We let f (ϕ) = HR(t,ϕ) HR L L (t,ϕ) and proceed with the first order Taylor series expansion for the ratio of the hazard ratio functions, i.e. f (ϕ) ≈ f (0) + f ′ (0)(ϕ − 0) for small ϕ.We check for ϕ = 0, HR(t, 0) = L γ (S 2 (t,0)) L γ (S 1 (t)) = 1 and HR L L (t, 0) = 1, hence f (0) = 1 1 = 1.We now find the first order derivative of f (ϕ) with respect to ϕ and we get the same expression as for the diminishing effect, i.e.

(a) e ∆ = 1 4 ( 8 ! 8 (c) e ∆ = 8 Figure 1 :
Figure1: Hazard ratio over time targeted by weighted hazard ratio model superimposed true hazard ratio over time under a data generating mechanism where the G ρ test is optimal.The solid line shows the true hazard ratio over time for ρ = 0.5, 1, 2. The true hazard ratio for the three scenarios ρ = 0.5, 1, 2 at t = 0 are e ∆ = 1.4,4, 8 respectively and the hazard rate in the control arm stays the same in all the scenarios λ 1 (t) = 0.5.The dotted lines show the proposed hazard ratio targeted by the weighted hazard ratio model.The hazard ratio for the dotted lines is the average of the hazard ratios estimated from fitting the Lin and León method and is as follows; the average of the log hazard ratio across 5000 simulations is e β = 1.41, 4.07, 8.15,e β = 1.41, 4.07, 8.40, e β = 1.39, 4.16, 9.06 for ρ = 0.5, 1, 2.