Bayesian multivariate logistic regression for superiority and inferiority decision-making under observable treatment heterogeneity

The effects of treatments may differ between persons with different characteristics. Addressing such treatment heterogeneity is crucial to investigate whether patients with specific characteristics are likely to benefit from a new treatment. The current paper presents a novel Bayesian method for superiority decision-making in the context of randomized controlled trials with multivariate binary responses and heterogeneous treatment effects. The framework is based on three elements: a) Bayesian multivariate logistic regression analysis with a P\'olya-Gamma expansion; b) a transformation procedure to transfer obtained regression coefficients to a more intuitive multivariate probability scale (i.e., success probabilities and the differences between them); and c) a compatible decision procedure for treatment comparison with prespecified decision error rates. Procedures for a priori sample size estimation under a non-informative prior distribution are included. A numerical evaluation demonstrated that decisions based on a priori sample size estimation resulted in anticipated error rates among the trial population as well as subpopulations. Further, average and conditional treatment effect parameters could be estimated unbiasedly when the sample was large enough. Illustration with the International Stroke Trial dataset revealed a trend towards heterogeneous effects among stroke patients: Something that would have remained undetected when analyses were limited to average treatment effects.


Introduction
The current paper focuses on estimating heterogeneous treatment effects based on covariates in the context of two-arm randomized controlled trials (RCTs) with multiple (correlated) binary outcome variables.Such RCTs are randomized experiments with subjects being assigned at random to either an experimental or a control group, often having the objectives a) to evaluate whether an experimental treatment is superior or inferior to a control condition; b) to inform assignment to eligible subjects in practice (Food and Drug Administration, 2016).Although RCTs are broadly applicable to experimental research in general, we focus on the health domain and to refer to psychological and medical interventions in the broad sense with the word treatment.These interventions include -but are not limited to -behavioral therapies, pharmacological support, and other experimental types of care.
These trials often assess multiple types of (clinical) events (e.g.quitting substance abuse, death), functional measures (e.g.memory decline, ability to walk), and disease symptoms (e.g.fatigue, anxiety) (Food and Drug Administration, 2017), which can provide multidimensional insights into the effects of a treatment.Including such comprehensive insights can improve correspondence between statistical and clinical decision-making, since multiple effects of the intervention can be combined and weighted in various ways to provide a single statistical decision regarding superiority or inferiority (e.g.Pocock et al., 1987;O'Brien, 1984;Murray et al., 2016).Whereas performing multiple univariate analyses on individual outcomes is a common strategy, a single multivariate analysis takes correlations into account and can be statistically preferable (Senn and Bretz, 2007;Ristl et al., 2018;Food and Drug Administration, 2017;Murray et al., 2016).Multivariate analysis has the potential to reduce decision errors: Correlations influence the sample sizes required for decision-making with prespecified error rates and provoke under-or overpowerment when falsely omitted (Chow et al., 2017;Kavelaars et al., 2020;Sozu et al., 2010;Xiong et al., 2005).
RCTs often focus on average treatment effects (ATEs) among the study population when comparing interventions (Thall, 2020).Average treatment effects can be sufficiently insightful when the effects of a treatment are relatively homogeneous over the trial population.However, average effects may give a limited, or even erroneous, impression when the effects of a treatment are heterogeneous and thus interact with characteristics of patients.In that case, treatment effects conditional on a subpopulation contribute to a better understanding of the treatment's potential and are more informative for clinicians advising treatments to patients with specific characteristics.Despite efforts to provide statistical methodology to identify heterogeneous treatment effects (e.g.Wang et al., 2015;Yang et al., 2021;Jones et al., 2011), investigating these effects is not the standard yet: Thall notes that "the great majority of clinical trial designs ignore the possibility of treatment-covariate interactions, and often ignore patient heterogeneity entirely" (Thall, 2020, p.1).This is unfortunate as addressing potential treatment heterogeneity in the evaluation of treatments is crucial to a) identify which patients are likely to benefit from a treatment; and b) optimize treatment results of individual patients via personalized treatment assignment (Goldberger and Buxton, 2013;Hamburg and Collins, 2010;Wang et al., 2015;Simon, 2010).In sum, statistical analysis based on the combination of multiple outcome variables and treatment heterogeneity has the potential to reveal different outcome patterns for different patient profiles, thereby contributing to the personalization of treatment assignment.
An example of a trial with multiple outcomes and potential treatment heterogeneity is the International Stroke Trial (IST; Sandercock et al., 2011;International Stroke Trial Collaborative Group, 1997).Strokes may have far-reaching implications for the quality of life, as they may be recurring and/or lead to long-term impaired (daily) functioning.The IST investigated whether the short-term and long-term perspective of stroke patients can be improved with anti-thrombotic drug therapy.The average treatment differences in the IST were small, so one might conclude that treatment with one of these drugs was marginally effective.
However, these overall findings did not show how specific characteristics of patients (e.g.sex or age) and/or disease (e.g.type of stroke or functional status after stroke) potentially interacted with the treatment to produce different perspectives for patients with different profiles.Average treatment effects do, for example, not reveal whether older patients have better prospects in terms of short-term damage risk and/or longterm recovery potential than younger patients.Clearly, potentially heterogeneous effects as these would have clinically and psychologically relevant implications and advocate the development of more personalized treatment policies.
Although theoretically relevant in many contemporary RCTs, decision-making under treatment heterogeneity in the multivariate context is considerably more complex compared to the non-heterogeneous and/or univariate setting.Generalizations are subject to assumptions that need to be carefully evaluated in light of the research problem at hand.First, the multivariate setting demands an analysis method that incorporates the correlation between outcome variables (i.e. a multivariate analysis method) to obtain accurate decision error rates (e.g.Kavelaars et al., 2020).For accurate inference regarding conditional treatment effects, the analysis should not only include the overall correlation among the trial population, but should also be flexible enough to deal with correlations that differ over subpopulations.The latter is not evident in existing multivariate analysis methods for binary outcome variables: Some methods impose the marginal correlation structure of the trial population on subpopulations (e.g.multivariate probit models by Chib (1995) or Rossi et al. (2005) and multivariate logit models by Malik and Abraham (1973) and O' Brien and Dunson (2004)).Second, the interpretation of treatment effects can be complex in multivariate non-linear models.
Creating insights in so-called marginal effects is strongly recommended in treatment comparison, demanding any multivariate method to return interpretable univariate effects (Food and Drug Administration, 2017;O'Brien and Dunson, 2004).Several existing multivariate models lack insight into marginal distributions (e.g Malik and Abraham, 1973).Third, multivariate methods may estimate a single regression parameter to capture the relation between a covariate and all outcome variables (e.g.O' Brien and Dunson, 2004;Rossi et al., 2005).The latter assumes that all outcome variables vary identically over the full support of the covariate: An assumption that may be too strict to hold in practice.
As a more flexible alternative to capture the complexity of heterogeneous, multivariate treatment effects, we build upon an existing Bayesian multivariate Bernoulli framework for superiority decision-making proposed by Kavelaars et al. (2020).The existing procedure consists of three major components: a) a conjugate multivariate Bernoulli model to estimate unknown (regression) parameters; b) a transformation procedure to interpret treatment effects on the (more intuitive) probability scale; and c) a compatible decision procedure to make inferences regarding treatment superiority.The multivariate Bernoulli as an underlying model has advantages over several other approaches, as it relies on a multinomial distribution and has the flexibility to allow univariate effects, correlations between outcomes and multivariate effects to vary with covariates.
Although joint response probabilities can provide useful insights, the transformation procedure facilitates the interpretation of treatment comparison: marginal (i.e.univariate) probabilities, multivariate probabilities, and differences between (multivariate) probabilities can be used in inference as well.
The framework is suitable for estimation and inference among the trial population (i.e.ATEs), but does not incorporate patient characteristics to model heterogeneous treatment effects directly.Therefore, we expand the framework with a Bayesian multivariate logistic regression analysis to incorporate potential treatment heterogeneity via the inclusion of covariates, aiming to faciliate treatment comparison among subpopulations and contribute to personalized treatment assignment.The proposed modelling procedure relies on multinomial logistic regression and can model treatment effects and correlations on a subpopulation level and is suitable for estimation and inference among other populations than the trial population.The transformation procedure is essential in this extension, as the model produces multinomial regression coefficients, which have no straightforward interpretation in the context of (multivariate) treatment comparison.
Along with the regression model, we include a procedure to compute sample sizes for decision-making with prespecified frequentist error rates.
The paper is organized as follows.In the next section, we introduce the decision framework, including the multivariate logistic regression model to obtain a sample from the multivariate posterior distribution of regression coefficients, a transformation procedure to find posterior treatment differences, and a decision procedure to draw conclusions regarding treatment superiority and inferiority.The section on capturing heterogeneity explains how the framework can be applied to different patient populations.We evaluate frequentist operating characteristics of the framework via simulation in the numerical evaluation section.
Next, we illustrate the methods with data from the International Stroke Trial and conclude the paper with a discussion.

Multivariate logistic regression
Response y k i is the binary response for subject i on outcome variable k ∈ {1, . . ., K}, where y k i ∈ {0, 1}, 0 = failure and 1 = success.Vector y i = (y 1 i , . . ., y K i ) is the multivariate (or joint) binary response vector of subject i on K outcomes and has configuration H q• , which is one of the Q = 2 K possible response combinations of length K given in the q th row of matrix H: The probability of y i can be expressed in two meaningful and related ways.First, θ i = (θ 1 i , . . ., θ K i ) denotes the vector of K-variate success probabilities on individual outcome 1, . . ., K, where θ k i = p(y k i = 1).Second, denotes the vector of Q-variate joint response probabilities, where φ q i = p(y i = H q• ) and sums to unity.The joint response of subject i can be conditioned on covariates in vector x i = (x i1 , . . ., x iP ).
In this case, the probabilities of response vector y i |x i are expressed as functions of x i , namely φ i (x i ) and Joint response probability φ q i (x i ) maps the dependency of joint response probabilities on covariates x i via a multinomial logistic function: for response categories q = 1, . . ., Q − 1.In Equation 2, ψ q i (x i ) reflects the linear predictor of response category q and subject i: Here, x ip can be a treatment indicator, a patient characteristic, or an interaction between these.Vector β q = (β q 0 , β q 1 , . . ., β q P ) is the vector of regression coefficients of response category q.To ensure identifiability, all regression coefficients of response category Q are fixed at zero, i.e. β Q = 0.
The likelihood of response data follows from taking the product over n individual joint response probabilities from Equation 2 of Q response categories: Bayesian analysis is done via the posterior distribution which is given by where p(β) reflects the prior distribution of the unknown parameters before observing the data.Posterior sampling can be done with a Gibbs sampling algorithm based on a Polya-Gamma expansion (Polson et al., 2013).Computational details of this procedure can be found in Appendix A.

Transformation to treatment differences
We aim to make the posterior sample of regression coefficients interpretable in terms of a treatment difference, which is defined as the (multivariate) difference between success probabilities of two treatments.To this end, we execute the following multistep procedure with a fictive setup of the IST trial as running example.
Suppose we are interested in the effect of a combined drug therapy (Heparin plus Asparin; T H+A ) vs.
single drug therapy (Aspirin only; T A ) on recurrent stroke on the short-term (y strk ) and dependency on the long-term (y dep ).There is a total of Q = 4 response categories: {y strk = 1, y dep = 1}, {y strk = 1, y dep = 0}, {y strk = 0, y dep = 1}, {y strk = 0, y dep = 0}, which we refer to as {11}, {10},{01}, and {00} respectively.The treatments are blood thinning agents and may thus interact with the patient's blood pressure.Therefore, we include systolic blood pressure at the time of randomization, resulting in the following model: where The transformation procedure is then as follows: 1. Regression coefficients β to joint response probabilities φ T (x): In the first step, the posterior sample of regression coefficients β is transformed to a treatment effect in terms of joint response probabilities φ T i (x i ) for each treatment T ∈ {0, 1}.Linear predictor ψ q i (x i ) is then transformed to For example, the probability that patient i in the IST does not experience a new stroke and is dependent after six months can be expressed as: .
Note that we are interested in joint response probability φ T (x), which reflects a treatment effect among a population defined by x and is more general than the joint response probability of an individual patient with covariates x i .This population can be reflected by an individual patient in some situations, while other cases target the entire study population or a subpopulation of interest.These variations have slightly different computational procedures, which we discuss in more detail in Section 3.
2. Joint response probabilities φ T (x) to multivariate success probabilities θ T (x): The next step in the transformation involves the conversion from joint response probabilities φ T (x) to multivariate success probabilities of individual outcome variables θ T (x).Especially when the number of outcome variables increases, success probabilities are more straightforward in their interpretation than joint response probabilities.
The relation between both quantities is additive: Success probability θ k T on outcome k and treatment T equals the sum of a selection of elements of φ T , denoted by matrix U k : Selection U k consists of the 2 K−1 rows of H that have their k th element equal to 1.More concretely, the two outcome variables from the IST are the following combinations, where we drop the dependency on x for notational simplicity.
Hence, the multivariate success probabilities in θ T = (θ strk T , θ dep T ) consists of univariate success probabilities: The correlation between these outcome variables is captured in joint response probabilities φ T (x) and automatically taken into account in further transformations (Olkin and Trikalinos, 2015;Dai et al., 2013).
3. Success probabilities θ T (x) to treatment differences δ(x): The treatment difference on outcome k, δ k (x), is defined as the difference between the success probabilities of two treatments on outcome k, such that: The K-variate treatment difference is then δ(x) = (δ 1 (x), . . ., δ K (x)).
Multivariate treatment difference δ = (δ strk , δ dep ) in the IST is a vector of the univariate treatment differences: Applying the three above-mentioned steps to each draw of the posterior sample of β, results in a posterior sample of multivariate treatment difference δ(x).This sample provides estimates that can be used for prediction, where various measures of central tendency (e.g. a mean or high posterior density interval) can be used to summarize the sample into a point estimate.Moreover, the sample can be used for statistical inference, as outlined in the next subsection.

Posterior decision-making
Decisions rely on estimated treatment effects and their uncertainties.More formally, multivariate treatment difference δ has complete parameter spaces S ⊂ [−1, 1] K , which is divided into a rejection region S R and an non-rejection region S N .Rejection region S R reflects the part of the parameter space that indicates the treatment difference of interest, while the non-rejection region S N refers to the part of the parameter space that would not be considered a (relevant) treatment difference.Rejection regions depend on the type of decision and be composed of multiple subregions if desired (van Ravenzwaaij et al., 2019).We consider the following three (commonly used) decision types: 1. superiority with region S R ∈ S S , where the treatment is better; 2. inferiority with region S R ∈ S I , where the treatment is worse; 3. two-sided with rejection region S R ∈ {S S , S I }, where the treatment can be either better or worse.
We would conclude superiority and/or inferiority when the posterior probability that treatment difference δ(x) lies in the rejection region exceeds a prespecified decision threshold, p cut : When the functional form of the posterior distribution is unknown, the rejection probability can be concluded from an MCMC sample of L draws from the posterior distribution of δ(x).Equation 12 is then applied in practice as:

Superiority
In a situation with multiple outcome variables, superiority and inferiority can be defined in multiple ways, resulting in different rejection regions (e.g Pocock et al., 1987;Pocock, 1997;O'Brien, 1984;Prentice, 1997;Tang et al., 1993;Zhao et al., 2007).Although not intended as an exhaustive overview, we list three possible rules and graphically present their rejection regions in Figure 1.Two of these rules (which we refer to as the "Any" and "All" rules) are presented as part of the regulatory guideline regarding multiple endpoints (Food and Drug Administration, 2017) and have been extensively discussed in literature (e.g.Chuang-Stein et al., 2006;Sozu et al., 2010Sozu et al., , 2016;;Xiong et al., 2005).The third rule ("Compensatory") is a -relatively unknown -flexible alternative that weighs benefits and risks of treatments by their (clinical) relevance (Murray et al., 2016;Kavelaars et al., 2020).
1. Any rule: The Any rule results in superiority or inferiority when the difference between success prob-abilities is larger or smaller than zero respectively on at least one of the outcome variables (Sozu et al., 2016).The superiority and inferiority spaces are defined as: 2. All rule: The All rule results in superiority or inferiority when the difference between success probabilities is larger or smaller than zero respectively on all of the outcome variables (Sozu et al., 2010).
The superiority and inferiority spaces are defined as:

Compensatory rule:
The Compensatory rule results in superiority or inferiority when the weighted difference between success probabilities is larger or smaller than zero respectively.The superiority and inferiority spaces are defined as: Kavelaars et al., 2020).

Sample size computations
To control decision error rates, methods for a priori sample size estimation are available for variables that follow a multivariate Bernoulli distribution and are eligible for large sample approximation by a (multivariate) normally distributed latent variable (Sozu et al., 2016(Sozu et al., , 2010;;Chow et al., 2017).When combined with a non-informative prior distribution, these procedures have shown to accurately control Type I rate α and Type II error rate β in a Bayesian multivariate Bernoulli -Dirichlet-model on multivariate response data (Kavelaars et al., 2020).Each of the presented decision rules in Subsection 2.3 has an individual procedure to compute sample sizes, as discussed below.These equations provide insight in the required number of observations in absence of prior information and in the influence of the correlation on the sample size.
They also allow for verification that correlated outcome variables might result in smaller sample sizes than uncorrelated outcome variables under some conditions detailed in Food and Drug Administration (2017) and Kavelaars et al. (2020).For notational simplicity, we discard the dependence on x in the remainder of this subsection.

All and Any rules
Sample size computations for the All and Any rules were formulated in Sozu et al. (2010) and Sozu et al.
(2016) respectively and rely on the assumption of a multivariate normal latent variable.The power, 1 − β, can be expressed in terms of a cumulative K-variate normal distribution Ψ K with mean 0 and correlation matrix Σ (Sozu et al., 2016): In Equation 17, c k for outcome k is defined by the decision rule of interest.Further, the off-diagonal elements of Σ denote (estimated) pairwise correlations between outcome variables.
For the Any rule, For the All rule, In Equations 18 and 19, n is the sample size per treatment and z (.) refers to the selected (1 − α K ) or (1 − α) quantile from the univariate normal distribution.
Since the cumulative multivariate normal distribution does not have a closed-form, the sample size that satisfies targeted decision error rates can be found via the following iterative procedure proposed by Sozu et al. (Sozu et al., 2010) where w k θ k T , and z 1−β is the (1 − β) quantile of the univariate normal distribution.

Capturing treatment heterogeneity
In the proposed framework, treatment heterogeneity can be captured by joint response probabilities that reflect conditional treatment effects and thus depend on the characteristics of a subpopulation of interest.
We describe two ways to represent subpopulations: by fixed covariate values or by a prespecified interval of the covariate distribution(s).Both representations have their own applications.Specific values of covariates may be relevant when we wish to investigate treatment effects based on individual patients or on patient populations that can be accurately represented by a single number of the covariate (such as a mean or a level of a discrete variable).Intervals of covariate distributions may be sensible in particular when multiple consecutive covariate values are sufficiently exchangeable to estimate a marginal treatment effect among a population specified by this range.Although such intervals can be specified for discrete covariates as well, their use is particularly reasonable with continuous covariates, as intervals are inherently consistent with the idea of continuity.
We will discuss procedures for fixed values as well as intervals in more detail in the remainder of this subsection.In these discussions, we use a linear predictor ψ q i (x) (cf.Equation 3) that distinguishes between treatments via a treatment indicator and allows for interaction between the treatment and a covariate.For such a model that includes a single population characteristic x, x = (z, T , zT ) and ψ q T (x) is defined as:

Fixed values of covariate
For a patient population with fixed values of patient covariates, a posterior sample of joint response probabilities φ T (x) can be found by plugging in a vector of fixed covariate values x in linear predictor ψ

Marginalization over a distribution of covariates
When the population is characterized by a range of covariates, the treatment effect can be marginalized over the interval under consideration, based on available information regarding the distribution of the covariate.
A sample of covariate data can be used as input for marginalization.Empirical marginalization involves repeating the fixed values procedure for each subject in the sample to obtain a sample of joint response probabilities for each posterior draw (l).Averaging the resulting sample of joint response probabilities per treatment results in a marginal joint response probability φ (l) T (x) for draw (l).The procedure is presented in Algorithm 2 in the online supplemental materials.Empirical marginalization is computationally efficient for patient populations defined by intervals of more than one continuous covariate.Note however that the procedure is prone to sampling variability in x and that estimation might depend on the availability of cases with the selected covariate values.Increasing the specificity of subpopulations -often resulting from a higher number of included covariates and/or a limited interval size -will reduce the number of available observations eligible for inclusion1 .

Numerical evaluation
The current section presents an evaluation of the performance of the proposed multivariate logistic regression procedure.The goal of the evaluation was threefold and we aimed to demonstrate: 1. how well the obtained regression coefficients and treatment effects correspond to their true values to examine bias; 2. how often the decision procedure results in an (in)correct superiority or inferiority conclusion to learn about decision error rates; 3. how the model performs under a priori sample size estimation to explore the number of required subjects.

Conditions
The performance of the framework was evaluated in a treatment comparison based on two outcome variables and one covariate.We varied the procedure to compute conditional treatment effects, the effect size, the (sub)population of interest, the procedure to compute the posterior distribution, and the decision rule.Each of these factors will be discussed in the following paragraphs.
Procedure to estimate joint response probabilities We used the two regression-based procedures from Section 3 to find the posterior samples of joint response probabilities for two populations of interest defined by: 1.Fixed covariate values

Empirical marginalization
And included a reference approach based on stratification compare the performance of stratified and regressionbased analysis:

Unconditional multivariate Bernoulli -Dirichlet model
We used the unconditional multivariate Bernoulli model in (Kavelaars et al., 2020).This model relies on response data and can be used via stratification in the estimation of conditional treatment effects.
Samples of treatment-specific joint response probabilities φ T could be drawn directly from a posterior Dirichlet distribution with parameters , where α 0 is a vector of Q prior hyperparameters.
Effect size We included four treatment differences that varied the heterogeneity of treatment differences: 1. Conditions 1.1 & 1.2:A homogeneous treatment effect, with average and conditional treatment differences of zero.This scenario aims to demonstrate the Type I error rate under a least favorable treatment difference for the Any and Compensatory rules in the trial as well as the subpopulation.
2. Conditions 2.1 & 2.2: A heterogeneous treatment effect, with an average treatment difference of zero and a conditional treatment effect larger than zero.
3. Conditions 3.1 & 3.2: A heterogeneous treatment treatment effect, with average and conditional treatment differences larger than zero.The conditional treatment difference is larger than the average treatment difference.The effect size is chosen to compare power of different methods, when the sample size should not lead to underpowerment for any of the approaches to the estimation of conditional treatment effects.
4. Conditions 4.1 & 4.2: A heterogeneous treatment effect on one of the outcomes with both average and conditional treatment differences larger than zero.The conditional treatment difference is smaller than the average treatment effect.The effect size is chosen such that the expected sample size after stratification of the study sample is smaller than the required sample for evaluation of the conditional treatment effect and aims to investigate the statistical power of regression-based methods when stratification leads to underpowered decisions.Further, this effect size reflects the least favorable treatment difference for a right-sided test of the All rule and should result in a Type I error rate equal to the chosen level of α.
For each of these four effect sizes, we varied the measurement level of the covariate and created a model with a binary covariate and a model with a continuous covariate.This resulted in the eight data generating mechanisms (DGMs) presented in Table 1.

Patient (sub)population
We aimed to assess the treatment difference in two different types of patient populations: 1. Trial population: We assessed the average treatment effect among the trial population.The binary covariate was binomially distributed with a probability of 0.50, while the continuous covariate in the trial population followed a standard normal distribution.

Subpopulation:
We assessed the conditional treatment effect among patients scoring low on the covariate.The low subpopulation of the binary covariate was described by a value of zero.Note that this subpopulation could not be assigned a range, since subsetting a binary variable inherently results in a single value.As a consequence, marginalization reduces to the procedure for fixed covariate values.For the continuous covariate, we specified two different subpopulations.One subpopulation had a value of one standard deviation below the mean, while the other subpopulation was used in the marginalization approaches and defined by a range that entailed all values between the mean and one standard deviation below the mean.
Decision rules and sample size We applied the three decision rules from Subsection 4.1.2: 1. Any rule 2. All rule 3. Compensatory rule with equal weights (w = (0.50, 0.50)) We computed sample sizes per treatment group via the procedures from Subsection 2.4 for conditions with non-zero true average treatment effects.If the true average treatment difference was equal to zero, we used n = 1, 000 per treatment group.The sample size for the average treatment effect was thus leading for the analysis of both average and conditional treatments.As a result, the power of conditional treatment effects was not targeted at .80, but should exceed this target when the required sample size for a CTE was larger than the sample size for an ATE.Similarly, the power of CTEs with a sample size smaller than the ATE sample size should be lower than .80.The required sample sizes are presented in Table 2, where we also included a) the required sample size for the conditional treatment effect in the subpopulation; and b) the sample size after stratification of the trial population.The sample size after stratification is the expected size in subpopulation analysis of a) response data in the reference approach; and b) covariate data in empirical marginalization.
Table 2 Required sample sizes to evaluate the average treatment effect (ATE) and conditional treatment effect (CTE) and expected sample sizes of the subpopulation after stratification (Sub).Bold-faced subsamples are smaller than required for estimation of the CTE.

Procedure
Data generation For each data generating mechanism and each unique (decision-rule specific) sample size, we sampled 1000 datasets.We generated one covariate x and included an interaction between the treatment and the covariate as well, resulting in the following linear predictor ψ q i : To generate response data, we first applied the multinomial logistic link function (Equation 2) to each true linear predictor ψ i (x i ) to obtain joint response probabilities φ i (x i ) for each subject i. Next, we sampled response vector y i |x i from a multinomial distribution with probabilities φ i (x i ).
Prior distribution For the multivariate logistic regression analysis, we used multivariate normally distributed prior with means b q = 0 and precision matrix B 0q = diag (1e −2 , . . ., 1e −2 ) for all regression coefficients.Prior covariances between regression coefficients were set at zero, implying that regression coefficients were independent a priori.For the reference approach, we used an improper prior with hyperparameters α 0 = 0.

Gibbs sampling
The regression coefficients in response categories 1, . . ., (Q − 1) were estimated via the Gibbs sampler detailed in the online supplemental materials.We ran two MCMC-chains with L = 10, 000 iterations plus 1, 000 burnin iterations.Convergence diagnostics implied that there were no signals of nonconvergence when the sample size was large enough.Multivariate Gelman-Rubin convergence diagnostics were below < 1.10 for most of the conditions.We noticed signs of non-convergence (Gelman-Rubin statistic 1.10 to 1.32) in a few datasets generated under mechanisms 4.1 and 4.2 with small sample sizes (i.e.belonging to the Any and Compensatory rules).We generated extra data to replace the datasets with questionable convergence.
Transformation and decision-making We applied the procedures from Subsections 2.2 and 2.3 to arrive at a decision.In marginalization, we included the selection of subjects that belonged to the subpopulation.

Bias
Mean estimates of regression coefficients were asymptotically unbiased, implying that bias was negligible (< .01) in conditions with a sufficiently large sample.We observed some bias in conditions with smaller samples (DGM 3.1, 3.2, 4.1, and 4.2 under the Any and Compensatory decision rules).Although smallsample bias is a well-documented property of logistic regression in general, we discussed these results in more detail in the online supplemental materials.The bias in regression coefficients was not necessarily problematic for our actual parameters of interest (success probabilities and differences between them), as transfer to these transformed quantities was not inherent.Even when regression coefficients were slightly biased (DGMs 3.1 and 3.2 under sample sizes of the Any and Compensatory rules), success probabilities and treatment differences could be estimated without bias (absolute bias < |0.025|), similar to the conditions without biased regression coefficients.More severe bias of regression coefficients in conditions with smaller sample sizes was not fully corrected in the transformation steps.Treatment effect estimation based on fixed values under DGMs 4.1 and 4.2 resulted in treatment differences with absolute biases up to 0.077 for the Any and Compensatory rules, as shown in Table 3. Bias appeared slightly more severe when the covariate was discrete, compared to a continuous covariate.The reference and marginalization approaches could estimate Table 4 Proportions of superiority decisions (p) and their standard errors (SE) for ATEs by data-generating mechanism (DGM), estimation method, and decision rule.Bold-faced proportions represent correct rejections (i.e.power).marginalization was generally more powerful than the reference approach.The fixed-values approach could only be compared to the other approaches when the covariate was discrete: In the continuous case, the treatment effect reflected a different (sub)population than empirical marginalization and the reference approach.
Here, the reference approach and the fixed value approaches performed similarly in terms of power.

Illustration
We applied the proposed method to a subset of data from the n = 19, 435 subjects from the International Stroke Trial (International Stroke Trial Collaborative Group, 1997)We selected participants who were alive after six months and were treated with either a combined treatment (Aspirin + medium / high-dose Heparin) Table 6 Average and conditional treatment differences (ATE and CTE respectively) and their posterior probabilities (pp) in the IST data, by range of blood pressure (Bp).Superiority or inferiority was concluded when > or < respectively.

Results
Results are presented in Table 6 for different intervals and in Table 7 for fixed values of blood pressure.
Among the trial population, the regression-based and reference approaches resulted in similar treatment difference estimates and posterior probabilities.Treatment differences were close to zero and each of the decision rules resulted in the conclusion that it does not matter whether Aspirin was administered alone or in combination with Heparin.
These average treatment effects gave a limited impression of the efficacy of Aspirin and Heparin, since a picture of heterogeneous treatment effects emerged when conditional treatment effects among subpopulations were considered separately.As opposed to Aspirin only, the combination of Aspirin and Heparin showed a trend towards higher failure probabilities on both outcome variables for patients with a lower blood pressure, while failure probabilities were generally lower among patients with a higher blood pressure.
A visual comparison of empirical marginalization and stratification of response data (i.e. the reference approach) resulted in relatively similar estimates and posterior probabilities in the center of the distribution of blood pressure (e.g. between −1 SD and +1 SD), but deviated from the regression-based approach in the tails.Point estimates of treatment differences demonstrated a less stable relation between blood pressure and treatment differences after stratification, as shown in Figure 2. If the regression approach is flexible enough to properly model the effects over the full support of blood pressure, the different behavior in the tails of the covariate distribution might be explained by the smaller sample size after stratification, as implied by the larger error bars.

Discussion
The current paper proposed a novel multivariate logistic regression framework to identify heteregenous treatment effects on multiple correlated outcome variables.When the sample size was large enough, the proposed regression models were able to reproduce average and conditional treatment differences accurately, and with more robustness against bias than posterior regression coefficients.The model could also make accurate superiority and inferiority decisions among subpopulations, and these decisions were more powerful than those obtained by a stratification approach.Under a priori sample size estimation, anticipated decision error rates were found, when the sample size was not too small.The illustration with the International Stroke Dataset demonstrated how modeling treatment heterogeneity could provide a more in-depth understanding of results beyond average treatment effects.
The model was proposed as an alternative that is flexible enough to model multivariate treatment effects with correlation structures that are free to vary over covariates, supporting accurate decision error rates and a priori sample size computations.This flexiblity comes with additional parameters, compared to other multivariate logistic models for correlated binary outcome variables (e.g.Malik and Abraham, 1973;O'Brien and Dunson, 2004) and may result in computational issues when the number of parameters becomes too high.The Gibbs sampling procedure may become unstable when the sample size is too small compared to the number of parameters, although weakly informative priors may be helpful in stabilizing computations (Gelman et al., 2008).Therefore, the model is most suitable for a limited number of outcome variables and covariates.
In practice, researchers are encouraged to consider model assumptions in real data, as highlighted by the illustration with IST data.Additional efforts may be undertaken to verify that the chosen generalized linear model fits the data well enough.If the assumption of linearity on the log-odds scale does not hold, the modelling procedure may benefit from generalization to methods that are more flexible with respect to

Empirical marginalization
Blood pressure this assumption, such as (penalized) splines.Again, increased flexibility increases the number of parameters and should be balanced with a) the general risk of overfitting; and b) computational challenges as outlined above.In a more general sense, the researcher should determines which type of flexibility is most appropriate for the research question and data at hand.
Several directions for future research naturally follow from the current results.First, the procedure theoretically lends itself for out-of-sample prediction to populations within or beyond the covariate range of the trial population.The robustness of the framework in these applications remains to be investigated and may include evaluations of model fit.
Second, research might shed light on further sample size considerations.The presented sample size formulas relies on the size of an estimated treatment effect.Under treatment heterogeneity, average and (multiple) conditional treatment effects have different effect sizes by definition, resulting in different sample sizes and raising the question which considerations meanfully guide this choice.Further, in line with our observations, small-sample bias in regression coefficients is a well-documented property of nonlinear regression methods in general (Firth, 1993;Nemes et al., 2009).Although some bias in regression coefficients disappeared during transformation to joint response probabilities, success probabilities, and treatment differences, the mechanism is not yet fully understood.Hence, more light may be shed on circumstances for inheritance of distributional properties in the (non-linear) multinomial logistic transformation to obtain more elaborate insights in the minimum number of observations required for satisfactory model performance.Larger effect sizes (i.e.smaller sample sizes), complexity of the model (i.e.number of parameters), and events per variable are candidate factors to interact in their effects on model performance in small samples (Jong et al., 2019).There is no short answer to that question, but in practice power among different subpopulations might be balanced with the number of subjects a researcher is willing or able to include in the trial.Therefore, optimum sample sizes in these regression-based decision approaches remain to be investigated more elaborately.
Lastly, causal inference is less straightforward in (stratified) subgroup analysis as conditioning upon co-

A Details of posterior computation
The current section describes the Gibbs sampling procedure used to obtain parameters.To simplify notations, we omit the dependence on x in denoting functions that rely on covariates (e.g.φ, θ).
Starting from the likelihood individual K-variate response y i (Equation 2), the likelihood of n Kvariate responses follows from taking the product over n individual joint response probabilities in Q response categories: Following Polson et al. (Polson et al., 2013), we introduce the Pólya-gamma variable by rewriting the multivariate likelihood in Equation 23 as a series of binomial likelihoods.The likelihood of y conditional on the parameters of the q th response category, β q , then equals: where −q refers to all rows in H not having index q and η q The Polya-Gamma transformation to a Gaussian distribution relies on the following equality (Polson et al., 2013): where ω q i has a Polya-Gamma distribution, i.e. p(ω q i ) ∼ P G(1, ψ q i ).If we use the equality in Equation 25, the binomial likelihood in Equation 24 can be transformed to a multivariate Gaussian likelihood by including an auxiliary Pólya-Gamma variable ω q i (Polson et al., 2013): where , κ q = (κ q 1 , . . ., κ q n ), ω q = (ω q 1 , . . ., ω q n ), and Ω q = diag(ω q ).

A.0.1 Prior distribution
The Gaussian likelihood in Equation 26is conditionally conjugate with a normal prior distribution on regression coefficients β q : where b q is the vector of prior means of regression coefficient vector β q and B 0q is a P × P symmetric square matrix reflecting the prior precision of regression coefficients β q .A researcher who is willing to include prior information regarding treatment effects into the analysis, has several options to specify prior hyperparameters for a normally distributed prior that is compatible with the Gibbs sampling procedure (e.g.Sullivan and Greenland, 2012;Chen and Ibrahim, 2000).We discuss the specification of informative prior means b q in terms of joint response probabilities φ in the next Appendix.

A.0.2 Posterior distribution
Bayesian statistical inference is done via the posterior distribution which is given by: p(β|y) ∝p(y|β, x)p(β), The combination of a Polya-Gamma transformed Gaussian likelihood (Equation 26) and a normal prior distribution (Equation 27) respectively is proportional to a normally distributed posterior distribution, conditionally on Polya-Gamma variables in ω q (Polson et al., 2013): p(β q |Y, Ω q ) ∝p(y|β q , ω q )p(β q ) (29) where V q = (X T Ω q X + (B q ) −1 ) −1 .Similarly, subject-specific variable ω q i follows a Polya-Gamma distribution that depends on regression coefficients β q via linear predictor ψ q i .Updating these two conditional distributions via a Gibbs sampling procedure results in a sample from the posterior distribution of β.Specifically, the sampling procedure involves iterating L times over the following two steps for q = 1, . . ., Q − 1, while keeping β Q fixed at zero: 1. Draw a vector of P + 1 regression coefficients β q |ω q from a multivariate normal distribution with mean vector m q and precision matrix V q .β q |ω q ∼ N (m q , V q ) (30) where 2. Sample ω q |β q as a vector of n draws ω q i |β q from a Pólya-Gamma distribution: The Gibbs sampling procedure results in a sample of L sets of regression coefficients from the posterior distribution of β.

B Specification of prior means of regression coefficients
In the current Section, we introduce a procedure to determine prior means, based on beliefs regarding success probabilities and correlations between them.We outline the procedure for two outcome variables and a linear predictor ψ with one covariate and an interaction between the treatment and the covariate: ψ q T = β q 0 + β q 1 T + β q 2 x + β q 3 x × T (32) First, choose x L and x H as low and high values of covariate x respectively.Next, specify success probabilities and correlations θ T (x L ), ρ T (x L ), θ T (x H ), and ρ T (x H ) for each treatment T that accompany the low and high values of covariates respectively.These success probabilities θ T (x .) and correlations ρ T (x .) can be transformed to joint response probabilities φ T (x .) via the following set of equations: φ 10 T (x .) = θ 1 T (x .) − φ 11 T (x .) φ 01 T (x .) = θ 2 T (x .) − φ 11 T (x .) φ 00 T (x .) = 1 − θ 1 T (x .) − θ 2 T (x .) + φ 11 T (x .) For each response category q, joint responses φ q.
T can be transformed to linear predictor ψ q.T using the multinomial logistic link function in Equation 2. Solving these linear predictors for β q results in the following definitions of the elements in β q : x H ψ q 0 (x L ) − x L ψ q 0 (x H ) x H − x L (34) for joint response q ← 1 : Q do for subject i ← 1 : n do 3: for joint response q ← 1 : Q do 4: Compute ψ q(l) i = β q(l) Compute φ q(l) i Figure1 Subsequently applying the multinomial logistic link function in Equation 2 to each ψ (l) T (x) results in joint response probability φ (l) T (x) for treatment T .Applying these steps each posterior draw (l) of regression coefficients β (l) results in a sample of posterior joint response probabilities.The procedure is presented in Algorithm 1 in Appendix C.
variates might interfere with randomization (European Medicine Agency, 2019; Food and Drug Administration, 2019).Causal relationships might require additional checking of assumptions and tutorials by Hoogland et al. (2021) and Lipkovich et al. (2016) may be of help.

C
Procedures for estimation and inference over a specified (sub)population Algorithm 1 Transformation of posterior regression coefficients to posterior joint response probabilities based on fixed covariate values.Define x = x 2 , . . ., x P as a vector of covariate values of interest Let β Q = (0, . . ., 0) 1: for draw (l) ← 1 : L do 2:for treatment T ← 0 : 1 do 3: Transformation of posterior regression coefficients to posterior joint response probabilities based on empirical marginalization.Let β Q = (0, . . ., 0) 1: for draw (l) ← 1 : L do 2: . Plug in a starting value for n in Equation 18 or 19 and calculate the power via Equation17.3.Repeat step 2 with gradually increasing n until the power exceeds the desired level 4. Select n as the sample size per treatment group : 1. Plug in estimates of θ k T in Equation 18 or 19.2

Table 1
Parameters of average treatment effects (treatment differences and correlations between univariate success probabilities) in the trial and conditional treatment effects in a subpopulation, by data-generating mechanism (DGM).

Table 5
Proportions of superiority decisions for CTEs (p) and their standard errors (SE) by data-generating mechanism (DGM), estimation method, and decision rule.Bold-faced proportions represent correct rejections (i.e.power).

Table 7
Conditional treatment differences and their posterior probabilities (pp) in the IST data, by range of blood pressure (Bp).Superiority or inferiority was concluded when > or < respectively.