Log-rank and stratified log-rank tests

In randomized clinical trials with right-censored time-to-event outcomes, the popular log-rank test without adjusting for baseline covariates is asymptotically valid for treatment effect under simple randomization of treatments but is too conservative under covariate-adaptive randomization. The stratified log-rank test, which adjusts baseline covariates in the test procedure by stratification, is asymptotically valid regardless of what treatment randomization is applied. In the literature, however, under simple randomization there is no affirmative conclusion about whether the stratified log-rank test is asymptotically more powerful than the unstratified log-rank test. In this article we show when the stratified and unstratified log-rank tests aim for the same null hypothesis and that, under simple randomization, the stratified log-rank test is asymptotically more powerful than the unstratified log-rank test in the region of alternative hypothesis that is specified by a Cox proportional hazards model. We also provide some discussion about why we do not have an affirmative conclusion in general.


Introduction
The log-rank test (Mantel, 1966) and stratified log-rank test (Peto et al., 1976) are the two longstanding and most popular nonparametric tests for treatment effect in randomized clinical trials with two treatment arms and rightcensored time-to-event outcomes.What motivates the stratified version of log-rank test is that baseline prognostic factors (covariates), measured prior to treatment assignments and thus not affected by treatments, are adjusted through stratification for efficiency gain.
Adjusting baseline covariates has been widely advocated to improve efficiency for tests and other analyzes, in the following two aspects.(i) In the design stage, covariate-adaptive randomization can be used to enforce the balance of treatment assignments across baseline prognostic factors, which results in more efficient tests (EMA, 2015).More details about covariate-adaptive randomization are given in Section 2. (ii) In the analysis stage, 'incorporating prognostic baseline factors in the primary statistical analysis of clinical trial data can result in a more efficient use of data to demonstrate and quantify the effects of treatment' (FDA, 2021), 'under approximately the same minimal statistical assumptions that would be needed for unadjusted' (EMA, 2015;FDA, 2021;ICH E9, 1998).
If the log-rank test is considered as 'unadjusted test', then the stratified log-rank test qualifies as an adjusted test under the same minimal assumption because it is still a nonparametric test without using any model.Tests using the Cox proportional hazards model as a working model are also qualified (DiRienzo & Lagakos, 2002;Kong & Slud, 1997;Lin & Wei, 1989), but the resulting tests can be less efficient than the unadjusted log-rank test when the working model is wrong (Kong & Slud, 1997).In this paper we focus on the stratified and unstratified log-rank tests.
Although stratified log-rank test uses information from baseline prognostic factors and thus is expected to be more efficient, an affirmative conclusion about whether it is asymptotically more efficient than the unstratified logrank test is not available, under simple randomization in which patients are assigned to treatments completely at random.Another issue is that the stratified log-rank actually tests a null hypothesis stronger than that of the log-rank test and, hence, a prerequisite in their comparison is to investigate when the two null hypotheses are the same.
The purpose of this paper is to establish some affirmative conclusions about the stratified and unstratified logrank tests, in terms of null hypothesis, asymptotic validity of tests and Pitman's asymptotic relative efficiency.The research is important as these two longstanding tests are used a lot in applications without a guidance on which one should be used.
Section 2 describes data, design, and log-rank test statistics.Section 3 introduces hypotheses, assumptions and the concept of validity for log-rank tests.Some theoretical results for stratified and unstratified log-rank tests are given in Section 4, where we show that, under simple randomization, the stratified log-rank test is asymptotically more powerful in the region of alternative hypothesis that is specified by a Cox proportional hazards model.Section 5 contains conclusions and Appendix provides technical proofs.

Data, design and test statistics
For a patient from the population under investigation, let T j and C j be the potential life time and right-censoring time, respectively, under treatment j = 0 or 1, and W be the vector of all baseline covariates and other time-varying covariates, observed or unobserved.Suppose that a random sample of n patients is obtained from the population with independent (T i0 , C i0 , T i1 , C i1 , W i ), i = 1, . . ., n, identically distributed as (T 0 , C 0 , T 1 , C 1 , W).For each patient, only one of the two treatments is assigned and received.
Let I i be a binary treatment indicator for patient i and 0 < π < 1 be the pre-specified treatment assignment proportion for treatment 1.Consider the design, i.e. the generation of I i 's for n sequentially arrived patients.Simple randomization assigns patients to treatments completely at random with P(I i = 1) = π for all i, which may yield treatment proportions that substantially deviate from the target π across levels of some baseline prognostic factors.Because of this, covariate-adaptive randomization using Z, a sub-vector of W containing observed baseline prognostic factors with finitely many joint levels, is widely applied.When patient i with baseline Z i = z is arrived, a treatment is assigned using a mechanism dependent on all previously assigned treatments for patients with Z i = z.For example, the most popular covariate-adaptive randomization scheme, the stratified permuted block design (Zelen, 1974), randomly assigns sequentially arrived patients with Z i = z in blocks of size B, each having Bπ patients in treatment 1, where B is appropriately chosen so that Bπ is an integer and the last block is allowed to be incomplete.Another popular covariate-adaptive randomization is Pocock-Simon's minimization (Pocock & Simon, 1975;Taves, 1974).Other schemes can be found in two reviews, Schulz and Grimes (2002) and Shao (2021).To see how popular covariate-adaptive randomization is, it was used in more than 500 clinical trials between 1989 and 2008 (Taves, 2010) and 237 trials among nearly 300 trials published in two years, 2009 and 2014 (Ciolino et al., 2019).All commonly used covariate-adaptive randomization schemes satisfy the following mild condition (Antognini & Zagoraiou, 2015).
for all i; and for every level z of Z, n z1 /n z → π in probability as n → ∞, where n z is the number of patients with Z i = z and n z1 is the number of patients with Z i = z and I i = 1.
Most commonly used covariate-adaptive randomization schemes except Pocock-Simon's minimization also satisfy the next condition.
(D2) Conditional on Z 1 , . . ., Z n , the vector whose zth component is √ n(n z1 /n − π) with z ranging over all levels of Z converges in distribution to N(0, ), where is the diagonal matrix whose zth diagonal entry is ν/P(Z = z) and ν ≤ π(1 − π) is a known constant depending on the randomization scheme.
Although simple randomization is not counted as covariate-adaptive randomization, it satisfies (D1) and ( D2) After I i is assigned, the observed outcome from patient i is min The log-rank test statistic is where is the indicator of the event T ij ≤ min(t, C ij ), and the upper limit τ in the integral is a point satisfying P(min(T ij , C ij ) ≥ τ ) > 0 for j = 0, 1.The stratified log-rank test statistic is a weighted average of the stratum-specific log-rank test statistics with strata constructed using Z, where It is clear that in terms of test statistics, the stratified SL in (2) utilizes Z values whereas the unstratified L in (1) is unadjusted.Under covariate-adaptive randomization, L is not completely unadjusted since it uses Zinformation through assignments I i 's, although it does not adjust for covariate-adaptive randomization in a correct way.On the other hand, the stratified SL uses Z-information in both design and analysis stages.
We consider stratification with all levels of Z.In applications, it is allowed to use more covariates to form strata.The conclusions in what follows remain the same.However, it is not a good idea to use fewer levels of Z for stratification, because it may result in a test that is not asymptotically valid.

Null hypothesis, assumption and validity
Throughout, α ∈ (0, 1) denotes a given significance level and z α/2 is the (1 − α/2)th quantile of the standard normal distribution.When | L | > z α/2 , the log-rank test rejects the following null hypothesis H 0 of no treatment effect, where λ j (t) is the unconditional hazard function of T j , j = 0, 1. H 0 in (3) is a commonly adopted null hypothesis of no treatment effect unconditional on covariates.The log-rank test is nonparametric.Its validity requires non-informative censoring (DiRienzo & Lagakos, 2002;Kong & Slud, 1997), i.e., (C) C j is independent of T j given j.
Under simple randomization, it is well-known (Kalbfleisch & Prentice, 2011) that the log-rank test is asymptotically valid in the sense that with equality holding for at least one population P under H 0 .Unlike simple randomization, covariate-adaptive randomization generates a dependent sequence of treatment assignments, which may render conventional methods developed under simple randomization, such as the log-rank test, not valid under covariate-adaptive randomization (EMA, 2015;FDA, 2021).It is shown in Ye and Shao (2020) that, under covariate-adaptive randomization with ν in (D2) strictly smaller than π(1 − π), the log-rank test is asymptotically conservative in the sense that, for all P under H 0 .
The stratified log-rank SL in (2) actually tests the null hypothesis where λ j (t | z) is the hazard function of T j conditional on Z = z, j = 0, 1.Note that H 0 in (6) holds if and only if the hazard functions are the same in every stratum z and, thus, is stronger than H 0 in (3).The validity of stratified log-rank test requires the following assumption on censoring: (CZ) C j is independent of T j given j and Z. Conditions (C) and (CZ) are not comparable, although both are implied by that C j is independent of (T j , Z) given j, a reasonable condition for non-informative censoring.
Under simple randomization and covariate-adaptive randomization satisfying (D1) in Section 2, (4) holds with L replaced by SL and H 0 replaced by H 0 (Ye & Shao, 2020), provided that all levels of Z are used in stratification.Since H 0 is stronger than H 0 , the stratified and unstratified log-rank tests are not comparable.Thus, a prerequisite for the comparison of efficiency of two log-rank tests is H 0 = H 0 .Is there a scenario under which H 0 = H 0 ?Consider the following transformation model assumption.
(TR) There is an increasing function h such that h(P(T 0 ≥ t | V)) = θ + h(P(T 1 ≥ t | V)) for all (t, V) and a constant θ, where V is a vector of covariates, Z ⊂ V ⊂ W, and both h and θ can be unknown.
Assumption (TR) is discussed in Cheng et al. (1995), which includes many commonly used semiparametric models as special cases, for example, the Cox proportional hazards model (see formula (7) in Section 4).It is a mild assumption since h is unknown and we only need to know it exists.
The proof of following result is in the Appendix.

Comparison of two log-rank tests
When H 0 = H 0 , is the stratified log-rank test SL more efficient than the unstratified log-rank test L under simple randomization when both tests are asymptotic valid?Intuitively this sounds correct since L does not adjust for covariates.
Unfortunately, there is no result on this in the literature.In this section we try to fill this gap to some extent and explain why the two log-rank tests are not comparable in terms of efficiency.This is important because both stratified and unstratified log-rank tests are used a lot in applications.
To this goal, we first state the following asymptotic result (whose proof is given in Appendix) for the asymptotic distributions of stratified and unstratified log-rank tests under local alternatives.Define Also, we use O j to denote O ij for any i and O zj to denote O zij for any i and z.Note that, under the null hypothesis H 0 , E(O j ) = 0 for j = 0, 1, and under the null hypothesis H 0 , E(O zj | Z = z) = 0 for all z and j = 0, 1. with c zj 's not depending on n and that λ 1 (t | z)/λ 0 (t | z) is bounded and → 1 for every t and z, SL and var H 0 denotes variance under H 0 .(b) Assume (C), (D1), and (D2).Under the local alternative hypothesis that E(O zi ) = c j n −1/2 with c j 's not depending on n and that λ 1 (t)/λ 0 (t) is bounded and → 1 for every t, L can be arbitrarily very different and, thus, SL and L may be not comparable in terms of asymptotic efficiency.In other words, the space of alternative hypothesis is too large to compare efficiency of SL and L , as there is no model at all.A semiparametric model on alternative hypothesis narrowing down the space of alternative hypothesis may result in affirmative results of comparing efficiency.We derive a result under the Cox proportional hazards model to highlight this.Suppose that the true hazard follows a Cox proportional hazards model, where is the hazard conditional on covariate V, θ is an unknown parameter, η is an unknown parameter vector, and λ(t) is an unspecified function.Under model ( 7), (TR) holds with h(s) = − log(− log(s)) and Corollary 4.1: Assume that model (7) holds, C j is independent of (T j , Z) given j, and P(C Then, under simple randomization, the stratified log-rank test SL is always more efficient than the unstratified log-rank test L in terms of Pitman's asymptotic relative efficiency. The proof is given in the Appendix.A key to the proof is that the local alternative hypotheses in (a) and (b) of Theorem 4.1 can be unified into θ = c/ √ n with the help of model ( 7).As both log-rank tests are nonparametric and do not need model ( 7), what does Corollary 4.1 tell us?It says that, under simple randomization, the stratified log-rank test SL is more efficient in the region of alternative hypothesis specified by model ( 7), although we cannot claim that SL is more efficient in the entire alternative hypothesis space.
We now turn to covariate-adaptive randomization, under which the unstratified log-rank test L is not valid but conservative, as we discussed in Section 3. On the other hand, by Theorem 4.1(a), the stratified log-rank test SL is valid for testing H 0 regardless of which covariate-adaptive randomization is applied.Therefore, stratified log-rank test is a clear winner when covariate-adaptive randomization is applied.
Another way to adjust for covariates used in randomization is the modified (unstratified) log-rank test RL = σ L L / σ L (ν) proposed by Ye and Shao (2020), where σ L (ν) is a consistent estimator of σ L (ν) (see §3.2 of Ye and Shao 2020).RL removes the conservativeness of L and is valid for testing H 0 in (3) under covariate-adaptive randomization.
Even if model ( 7) holds, SL and RL are not comparable in terms of asymptotic efficiency.We provide two simulation examples here to demonstrate that SL is more efficient in one scenario but less efficient in another scenario, compared with RL .The simulation setting is model (7) with λ(t) = 12 −1 log 2 for all t and η V = −1.5Z 1 + 0.5Z 2 2 , where Z 1 is binary with P(Z 1 = 1) = 0.5, Z 2 ∼ N(0, 1), and Z 1 and Z 2 are independent.Z 1 and discretized Z 2 with 4 equal probability categories are used for stratified permuted block randomization with block size 4.In scenario 1, censoring is independent of treatment and (Z 1 , Z 2 ) and distributed as uniform on (10,40).In scenario 2, censoring is independent of treatment and Z 2 , but conditioned on Z 1 , and censoring is distributed as 10 + the exponential distribution with mean 2Z 1 .The power curves over θ with α = 0.05 and n = 500 based on 2000 simulations are given in Figure 1.Note that SL is more powerful than RL under scenario 1 but less powerful under scenario 2. Both SL and RL are more powerful than the conservative L in any case.
The reason why the stratified SL and the modified unstratified RL are not comparable in asymptotic efficiency is that the two tests adopt different approaches in utilizing baseline covariates: the former adjusts baseline covariates by stratification, whereas the latter utilizes baseline covariates by modifying the unstratified L whose performance is affected by covariate-adaptive randomization.

Conclusion and discussion
(1) Under some semiparametric models for survival time such as the transformation model (TR) described in Section 3, the null hypotheses of stratified and unstratified log-rank tests are the same.(2) Under simple randomization of treatment assignments, the stratified log-rank test is asymptotically more efficient than the unstratified log-rank test in terms of Pitman's relative efficiency in the region of alternative hypothesis specified by the Cox proportional hazards model given by ( 7).It is of interest to derive more affirmative results using assumptions/models other than the Cox model to narrow down the space of alternative hypothesis.
(3) Under covariate-adaptive randomization of treatment assignments, the unstratified log-rank test is not asymptotically valid but conservative, whereas the stratified log-rank test is asymptotically valid as long as the covariates used in randomization are all included in stratification.Thus, the stratified log-rank test is a clear winner.A modified unstratified log-rank test removes conservativeness and is valid, but its relative efficiency compared with the stratified log-rank test has no definite conclusion, because the two tests apply different approaches in utilizing covariates.(4) Because the region specified by the Cox model is quite large and the stratified log-rank test is a clear winner under covariate-adaptive randomization, we recommend the stratified log-rank test over the unstratified logrank test.
Under the local alternative hypothesis θ = c/ √ n with a fixed constant c = 0, by Theorem 4.1, and SL d − → N(c θ SL /σ L , 1), where Pitman's asymptotic relative efficiency of SL with respect to L is θ 2 SL /θ 2 L .Applying Jensen's inequality ϕ{E(M)} ≤ E{ϕ(M)} with convex function ϕ(t 1 , t 2 ) = t 2 1 /t 2 and M = (E H 0 {Y i (t) exp(η V i )} | Z i }, E H 0 {Y i (t) | Z i }) , we obtain that θ L ≤ θ SL .To reach the conclusion θ 2 SL /θ 2 L ≥ 1, it remains to show that θ L ≥ 0. The condition P(C 1 ≥ t | V) = P(C 0 ≥ t | V) for all t implies that v(t) = π(1 − π) and, hence, Thus, it suffices to show where E V is the expectation with respect to covariate V i and is not depending on θ.Taking the derivative with respect to θ , we obtain that Then, which is the same as the left-hand side of (A3).As E{N i1 (τ )} is the probability of having an observed failure before time τ , it is a non-decreasing function of θ.This implies that (A3) holds.
Theorem 4.1: (a) Assume (CZ) and (D1).Under the local alternative hypothesis that E(O zj | Z = z) = c zj n −1/2 and E H 0 and var H 0 denote expectation and variance under H 0 , respectively.Because the local alternative hypotheses specified in (a) and (b) of Theorem 4.1 do not follow any model, δ 2 SL and δ 2