Massive Parallelization of Massive Sample-Size Survival Analysis

Abstract Large-scale observational health databases are increasingly popular for conducting comparative effectiveness and safety studies of medical products. However, increasing number of patients poses computational challenges when fitting survival regression models in such studies. In this article, we use Graphics Processing Units (GPUs) to parallelize the computational bottlenecks of massive sample-size survival analyses. Specifically, we develop and apply time- and memory-efficient single-pass parallel scan algorithms for Cox proportional hazards models and forward-backward parallel scan algorithms for Fine-Gray models for analysis with and without a competing risk using a cyclic coordinate descent optimization approach. We demonstrate that GPUs accelerate the computation of fitting these complex models in large databases by orders of magnitude as compared to traditional multi-core CPU parallelism. Our implementation enables efficient large-scale observational studies involving millions of patients and thousands of patient characteristics. The above implementation is available in the open-source R package Cyclops.


Introduction
Increasing accessibility of large-scale observational health data provides rich opportunities to study comparative effectiveness and safety of medical products, but also poses unprecedented challenges.
Typical administrative claims and electronic health record (EHR) databases now follow tens to hundreds of millions of individuals (Hripcsak et al., 2016) with tens of thousands of possible health conditions, drugs and procedures occurring over decades of patient lives (Suchard et al., 2013).The massive scales of these databases offer more power for statistical analyses to learn about the effects of these products on health outcomes but also bring taxing computational burden.
The increasing complexity of common statistical models further exacerbated this big-data problem.For instance, the Cox proportional hazards model and the Fine-Gray model are widely applied in comparative effectiveness and safety studies.The computational complexity of likelihood evaluation for the Cox model and the Fine-Gray model naively grows quadratically with sample size.In addition, some form of regularization (Madigan et al., 2010) is often needed to achieve parsimonious model selection when entertaining hundreds to thousands of patient characteristics.This regularization typically requires computationally intensive cross-validation to select the optimal regularization parameter(s), further straining limited computational resources.
One can distribute computationally intensive work to the cloud or dedicated clusters that house multiple central processing units (CPUs) across separate compute nodes that are linked together loosely through an Ethernet or InfiniBand network (Holbrook et al., 2021).However, for problems that require communication between nodes, communication latency may become a severe bottleneck.Since fitting survival models generally requires iterative algorithms, the communication latency costs often overpower the gains from the parallelized work within each iteration.To minimized communications between CPUs, we often follow coarse-scale parallelism, where programs are split into small number of large tasks (Barney et al., 2010).However, this may result in load imbalance and only achieve "embarrassingly parallel" benefits (Suchard et al., 2010).Furthermore, CPU clusters and the cloud can be costly and arcane for many clinical researchers.
Multi-core CPU parallelization is another choice for distributing intensive computational works, as modern CPU chip usually consists of 2 to 18 or more cores that can run independently.One limitation of this architecture is that all cores share a single "memory bandwidth", the amount of data that can be written to or read from memory in a given period of time (Holbrook et al., 2021).This approach also suffers from the limited number of cores.Thus, multi-core CPU parallelization is often only useful for modest-scale problems.
In contrast, graphics processing units (GPUs) offer a relatively inexpensive and efficient approach for speeding up fine-scale parallel computation.In fine-scale parallelism, we decompose programs into a large number of small tasks to facilitate load balancing and achieve much higher level of parallelism than coarse grain approaches (Barney et al., 2010).The GPU is an ideal device for fine-scale parallelism because (1) it consists of hundreds to thousands of compute cores and (2) the shared memory architecture of a GPU's coupled thread block allows for threads to communicate and share data among each other at a very high speed.Finally, GPUs are often conveniently available on standard laptops and desktop computers and can be externally connected to a personal computer.
Accelerating statistical computing via GPUs is an emerging discipline.As examples, Zhou et al.
(2010) attain 100-fold speedups with GPUs in high-dimensional optimization.Suchard et al. (2013) demonstrate that GPU parallelization achieves one to two orders of magnitude improvement over CPUs for a Bayesian self-controlled case series model.Beam et al. (2016) accelerate Hamiltonian Monte Carlo using GPUs by efficient evaluation of their probability kernel and its gradient.Terenin et al. (2019) demonstrate that Gibbs sampling can run orders of magnitude faster than on a CPU.Holbrook et al. (2021) apply GPU computing to a Bayesian multidimensional scaling model and deliver more than 100-fold speedups over serial calculations.Ko et al. (2022) explore the GPU parallelization for proximal gradient descent on modest-sized 1 regularized dense Cox regression using PyTorch.In this paper, we leverage GPU parallelization to the Cox model and the Fine-Gray model through innovative algorithmic mapping that play to the GPU's strengths for accelerating observational studies utilizing massive healthcare data.Specifically, we identify the computational bottleneck of the Cox model and the Fine-Gray model and take advantage of the cutting-edge GPUaccelerated library CUB (Merrill and Garland, 2016) that navigate this bottleneck.We further implement our GPU advances in the easy-to-use R package Cyclops (Suchard et al., 2013).Our implementation supports a sparse data format, considering that observational healthcare data are generally sparse; the vast majority of patient characteristics are encoded as the presence or absence of some clinical condition, drug exposure, medical procedure or laboratory measurement above or below a cutoff point within given time-frames.We finally demonstrate that our GPU implementation accelerates the computation of fitting these complex models by order-of-magnitude compared to a similar CPU implementation on multiple GPUs and CPUs with different technical specifications.

Cox proportional hazards model
We first establish notation under a typical survival analysis setting.Suppose there are N observed individuals available in a study.For individual i = 1, . . ., N, let Y i = min(T i ,C i ) represent their survival time, where T i and C i are the time-to-event time and right-censoring time, respectively.Let δ i be the indicator variable such that δ i = 1 if we observe the event occurrence of individual i and δ i = 0 if the individual i is censored.Let x i be a P-dimensional vector of time-independent covariates for individual i.The survival data then consist of triplets {Y i , δ i , x i } n i=1 .
The cumulative distribution function of survival times gives the probability that the event of interest has occurred by time t, i.e.F(t|x) = Pr(T ≤ t|x).The survival function gives the probability that the event has not occurred by time t, i.e.S(t) = Pr(T > t).Then we define the hazard function of time-to-event time as: where is the density function of random variable T .
Let β β β = (β 1 , β 2 , . . ., β P ) be a P-dimensional vector of unknown, underlying model parameters.Assuming survival times y 1 , y 2 , . . ., y N are independent and identically distributed from density f (y|β β β ) and β β β parameterized the survival function S(y|β β β ), Cox (1972) proposes a semi-parametric hazard function as the product of an unspecified baseline hazard function h 0 (y i |β β β ) and an exponential link function of covariates: Parameter estimation of the Cox proportional hazards model follows from the log-partial likelihood where R 1 (Y i ) = {r : y r ≥ Y i } denotes the set of subjects who are "at risk" for event at time Y i .Then one often estimates β β β by its maximum log-partial likelihood estimator β This log-partial likelihood has a complicated form due to the repeated calculation of the risk sets, and thus brings a high computation burden.In practice, we need to keep track of the sum of many terms for each subject that usually requires O(N 2 ) number of operations and will explode quickly as N increases (Kawaguchi et al., 2020).This computational burden prevents traditional model fitting as there can be millions of observations available in observational health databases.

Fine-Gray model
The Fine-Gray model (Fine and Gray, 1999) generalizes the Cox proportional hazards model for competing risks time-to-event data that consist of more than one type of event.Unlike the standard survival analysis setting such as under the Cox model where individuals are only susceptible to one type of event during follow-up, competing risks arise when individuals can experience more than one type of event and the occurrence of one type of event will either prevent the occurrence or change the underlying risk of the others (Noordzij et al., 2013).For individual i, competing risks data inherit the definition of event time T i , possible right censoring time C i , event indicator δ i , and covariates x i from the standard survival data setting, and additionally include an event type variable ε i .Without loss of generality, we assume there are two types of events, where ε i = 1 indicates that T i refers to the time of primary event and ε i = 2 indicates the competing risk event.
The cumulative incidence function (CIF) for competing risks data describes the probability of failing from the event of interest before the other possible (competing) event.Under the above setting when ε = 1 indicating the event of interest, the CIF and hazards function are defined as: (5) To model the covariate effects on F 1 (t|x), Fine and Gray (1999) propose the proportional subdistribution hazards function: where h 10 (y i |β β β ) is an unspecified baseline subdistribution hazard, and β β β is a P-dimensional vector of model parameters.
Parameter estimation of the Fine-Gray subdistribution proportional hazards model follows from the log-pseudo likelihood where R(Y i ) = {r : (y r ≥ Y i ) or (y r < Y i and ε r = 1)} denotes the risk set at time Y i , and ŵr (t) is a time-dependent weight based on an inverse probability of censoring weighting (IPCW) technique (Robins and Rotnitzky, 1992).Assuming two event types exist, the risk set R(Y i ) contains two

Statistical regularization
Observational healthcare datasets often include a large number of patient characteristics.For example, administrative claims usually contain information on all drug prescriptions, medical procedures and diagnosis codes for patients, and EHRs generally further contain demographics, medical history notes, laboratory results, and other health status indications (Madigan et al., 2014).A statistical regularization approach is typical in such high-dimensional data analysis to avoid overfitting.We can conveniently add a penalty π(β β β ) for β β β to the log-partial likelihood of Cox model or the log-pseudo likelihood of Fine-Gray model and estimate β β β through these joint penalized likelihoods to achieve regularization.
For 1 regularization that shrinks many components of β β β to be zero, we define a separable penalty for each dimension β j in β β β through where the tuning parameters γ j control the degree of regularization for each dimension.Similarly, one may employ an 2 penalty on the dimensions of β β β , such that: Usually one assumes γ j = γ ∀ j and τ j = τ ∀ j, and we choose γ or τ through cross-validation.
Note that statistical inference in the context of regularization remains a challenge.Various stan-dard errors estimators based on the non-parametric bootstrap have been proposed (Casella et al., 2010;Chatterjee and Lahiri, 2011), but an approach that is both computationally efficient and statistically valid still remains out of reach.Since we are focusing on computational bottleneck in this paper, we decide to follow the standard practice of regularization in large-scale and high-dimensional observational health studies (Mueller-Using et al., 2016;Shortreed and Ertefaie, 2017;Bramante et al., 2021) despite the limitations of regularization.

Maximum likelihood estimation using cyclic coordinate descent
Following Genkin et al. (2007) and Mittal et al. (2014), we exploit a cyclic coordinate descent (CCD) algorithm to reduce the high-dimensional penalized likelihood optimization down to a large set of simple one-dimensional optimizations (Wu et al., 2008).This method cycles through each covariate and updates it using a Newton step while holding all other covariates as constants.The advantage of CCD is it only requires the calculation of scalar gradients and Hessians and avoids the inversion of large Hessian matrices in high-dimensional regression.
Specifically, for each one-dimensional optimization problem, we pick the β (new) j by maximizing g(β j ) = l(β j ) + π(β j ) while holding all other β j 's unchanged.The second-order Taylor approximation of the penalized log likelihood at current β j is: Then the new estimate β (new) j falls out We employ a trust region approach similar to Genkin et al. (2007) to restrict the the step size so that the quadratic remains a reasonable approximation to the objective and improve convergence.In particular, we update β j during iteration k by , and ( 12) where we update the trust region half-width as Note that both the negated log likelihood of the Cox model and the Fine-Gray model are convex in β β β , as well as the 1 and 2 penalty terms.Although the 1 -norm is nondifferentiable at origin, we can follow the approach of Wu et al. (2008) to compute the directional derivatives along each forward and backward coordinate direction for our objective function.In particular, we compute the directional derivatives in both directions by plugging in sgn(β j ) = +1 and sgn(β j ) = −1 when β (k) j = 0, and only update β j in a direction when the directional derivative is negative, otherwise we keep β (k+1) j = 0. Since the objective function is convex, it is impossible for both directional derivatives to be negative, but either direction with a negative directional derivative will result in a successful update.Although we lack of rigorous proof of convergence when employing a trust region, as the induced step sizes fail to meet the strict convergence conditions for this optimization problem (Xu and Yin, 2017), we have not observed this issue in our work.

Massive parallelization on GPUs
Parallelization through clusters and multi-core CPUs exhibits a number of drawbacks that makes these devices ill-suited for massive survival analysis, as discussed in the Introduction.As such, this paper exploits massive parallelization on GPUs through new fine-scale algorithm decomposition for speeding up large-scale computations.Here we begin with an overview on GPU computing and summarize its main strengths and weaknesses.
The modern GPU contains an array of multithreaded streaming multiprocessors (SMs), where hundreds to thousands of work threads execute simultaneously (Nickolls et al., 2008).Many threads group together as a thread block, in which threads communicate through shared memory and cooperate through barrier synchronization.Thread blocks are further grouped into kernel grids.The programmer specifies the number of threads per block and number of blocks forming the grid.In our code, we program this ensemble via CUDA, a parallel computing platform that allows generalpurpose computing on GPUs (GPGPU) using a familiar C-like syntax.
Understanding the memory hierarchy of GPUs is important for achieving optimal performance for parallel programs.Each thread has its own set of processor registers and local memory for thread-private variables, which provide the fastest memory access.Each thread block has a limited shared memory pool that is only visible to the threads within this block.All threads also have access to a large high-bandwidth, but off-chip (global) memory embedded on the GPU card.Shared memory provide high-speed access, while accessing global memory is hundreds of times slower (Micikevicius, 2009).
In our implementation, GPUs handle only the most computationally intensive tasks.When such a task is scheduled, relevant data are first copied from host CPU memory to the global memory on the GPU device.Then the GPU kernel is launched, which loads data to on-chip memory for defined operations and writes results to global memory.Finally, results are copied from the device back to the host.This parallel programming model has several limitations that we should keep in mind.First, we should minimize data transfer between the host and device because the transfer is extremely slow.
Second, accessing global memory on the GPU is also relatively slow, so we want to minimize the reads from global memory and the writes to global memory.When we do read or write from global memory, we want sequential threads to access sequential addresses in memory.In this manner, the GPU "coalesces" multiple memory requests into a smaller number of 128-byte transactions, and we want to over-subscribe each GPU core with multiple threads, so that the cores remain active through thread-context-switching and the latency in memory access can be hidden.Third, launching a kernel also has overhead on the order of microseconds, so it is preferred to combine a series of kernels into a larger "fused" one.Finally, contemporary GPUs issue single instructions to a "warp" of 32 threads simultaneously, such that all threads within a warp must execute the same instruction each clock cycle.When threads within a warp follow different data-dependent conditional branches, their execution becomes temporally serialized; this can cause a performance penalty.To avoid this issue, one attempts to minimize the number of diverging branches within a warp.

Tree-based parallel algorithms: reduction and scan
Here we review two useful building blocks for massively parallel algorithms: reduction and scan.
Reductions convert an array of elements into a single result.For example, if the reduction operator is addition, then the reduction takes an array [a 0 , a 1 , . . ., a n−1 ] and returns a single value ∑ n−1 i=0 a i .Reductions are useful for implementing log-likelihood calculations, since independent samples contribute additively to the model log-likelihood.Taking an array [a 0 , a 1 , . . ., a n−1 ], the scan operation returns the array [a 0 , a 0 + a 1 , . . ., ∑ n−1 i=0 a i ].If we start from the beginning of the input array as above, the resulting array is called a prefix sum.While the resulting array is called a suffix sum if we start from the end and proceed towards the beginning.Scans are useful in accumulating statistics about individuals in the risk set in survival analysis.Implementing a sequential version of a reduction or scan both require ~n additions on an array of length n.
[Figure 1 about here.] The parallel versions of reduction and scan use a tree-based approach shown in Figure 1.Note that effective parallelization of these types of binary tree traversals requires low latency sharing of partial sums across threads with appropriate synchronization, both aspects in which the large number of threads concurrently executing on the GPU greatly outperform multi-core CPU threads.
To obtain a parallel, work-efficient, and communication-avoiding prefix scan, we invoke the CUB library (Merrill and Garland, 2016).The efficiency of their prefix-scan approaches a simple copy operation, as their prefix-scan requires the optimal ~2n data movements: n reads and n writes to the global memory.The scan is constructed on two levels of organization: (1) a global device-wide scan and (2) a set of local block-wide scans within each thread block.The local block-wide scan utilize a reduce-then-scan strategy that can be visually resembled as an "hourglass" shape comprising an up-sweep and a down-sweep as shown in Figure 1c.In the up-sweep phase, we traverse the tree from leaves to root for computing the partial sums.Then the running prefixes are aggregated in the block-wide down-sweep traversing back up the tree from root using the partial sums computed in the up-sweep phase.The global scan implementation within CUB applies a single-pass chained-scan approach to achieve just ~2n global data movements.The global scan further propose a decoupled look-back strategy by assigning each thread block a status flag indicating one of the three status: (1) aggregate (partial sum) of the block is available; (2) prefix of the block is available; and (3) no information about the block is available for other processor.Then each block will perform the computation conditional on its predecessor's status flag.We refer readers to Merrill and Garland (2016) for more details on this algorithm.In this paper, we extend these parallel algorithms for Cox models and Fine-Gray models based on the implementation of reduction and scan from the CUB library.

GPU massive parallelization for parameter estimation
CCD is an inherently serial algorithm in which each iteration is based on the result of the last iteration.As we mentioned earlier, even within an iteration t, CCD cycles through each covariate j for j = 1, 2, . . ., P one by one and the computation work for the next covariate cannot begin until the current update finishes.This serial algorithm however can still benefit greatly from parallelization by exploiting fine-grain problem decomposition within each iteration.Within each iterate's covariate update, careful benchmarks reveal that over 95% of the runtime lies in computing the log-likelihood gradient g (β j ) and Hessian g (β j ).
To understand the computational work, let δ = (δ 1 , . . ., δ N ) be an N-dimensional column vector and M be an N × N loading matrix with entries Making these substitutions in Equation 3, we arrive at where we define forming the logarithm (log) as element-wise operation, and S pre [ν] as the prefix sum of arbitrary vector ν.Then the unidimensional gradient and Hessians under the Cox proportional hazards model falls out as g (β j ) = δ X j − δ G and ( 16) where and vector X j is the j-th column of X.Note that we define here multiplication (×) and division (/) as element-wise operations as well.
While matrix-vector multiplication involving M takes O(N 2 ) operations, identifying the cumulative structure reduces the time complexity to linear by calculating prefix sum S pre [ν] in series, and parallelization further reduces the time complexity to O(log 2 N) due to parallel scan's tree-based structure (Harris et al., 2007).Finally, the vector-vector multiplication involving δ can be calculated as an element-wise multiplication in parallel in constant time and a reduction through a binary-tree in O(log 2 N).
Under the Fine-Gray model, let W = (w 1 , . . ., w N ) be an N-dimensional column vector of precomputed censoring weights described in previous section, and N as an N × N loading matrix with entries  16and 17 yields the unidimensional gradient and Hessians under the Fine-Gray model where we define S suf [ν] as suffix sum of vector ν.
It is worth noting that the risk set under the competing risk setting consists of two disjoint sets due to multiple types of event, thus a single pass scan furnishes neither the numerator nor denominator.
Instead, the numerator and denominator of Equation 21 and 22 can be computed through a forward (prefix) scan S pre [ν] plus a backward (suffix) scan S suf [ν] together (Kawaguchi et al., 2020).
In summary, we can decompose the calculation of the gradient and Hessian for β j in the following four sequential steps: (1) Read in non-zero x i j for i = 1, 2, . . ., N and update [Xβ β β ] i as Then Here we should keep in mind that X is generally sparse such that many elements x i j are zeros, and exponentiation is an expensive operation in floating-point.
(2) Scans: (a) Under the Cox model, we perform three forward scans that take the three output vectors of (1) as input, and return S pre [exp (Xβ (b) Under the Fine-Gray model, we perform three forward scans and three backward scans that take the three output vectors of (1) as input, and return S pre [exp (Xβ (3) An element-wise transformation that takes the output vectors of (2) as well as the indicator vector δ as input, and outputs two new vectors δ × G and δ × (H − G×G).
(4) Two reductions that take the output vectors of (3) as input, and output two double summations δ G and g (β j ) = δ (H − G×G).Note that the first term δ X j in gradient g (β j ) can be precomputed and the value does not change during CCD.
Although parallel computing of the above numerical operations is much faster than serial evaluation, one crucial limitation is that a memory transaction involving reading or writing from global memory may take up to two orders of magnitude more time than a regular numerical operation applied to the value sitting in the limited number of on-chip registers within a GPU (or CPU for that matter) (Holbrook et al., 2021).In order to minimize memory transactions, we fuse several of our operations together for both our serial and parallel implementations.First, we fuse the multiple scans in step (2) together as a tuple-scan, which takes a tuple of three vectors as input, and outputs a tuple of three (or six) vectors.Note that in Fine-Gray model, we perform the forward tuple-scan and the backward tuple-scan on the same tuple of three vectors, but read the input in two opposite directions simultaneously.Similarly, we can fuse the multiple reductions in step (4) as a tuple-reduction.
Since the element-wise transformations in step (3) take O(1) time with GPU parallelization, step (3) and step (4) can be regarded as a transformation-reduction.Finally, since both scan and reduction parallelization share the same binary-tree structure, we further fuse steps ( 2) -( 4) together into a single kernel.Through fusion, for example, the output tuple from the scans never need to be written to global memory, nor read back for the later transformation-reductions. The fused kernel saves 2/3 of the reads and writes than executing three separated kernels.It is also worth noting that the computational work of the gradient and Hessian evaluation share a similar structure and even some component such as G, such that we have fused the computation of gradient and Hessian together to circumvent unnecessary kernel overhead and facilitate data reuse of these intermediate terms.
To exploit the sparsity of the design matrix X , we parallelize the transformation in step (1) using a sparse CUDA kernel, which only reads in and processes the non-zero entries while keeping other entries as zeros all the time during CCD updates.This sparse kernel saves data movement as well as reduces memory bandwidth requirements significantly when X is sparse, which is common in real-world scenarios.
[Figure 2  line represent a thread block and recall that all threads within a block can access a shared memory that is a low-latency and on-chip (Nickolls et al., 2008) and is useful for performing the efficient scan in step (2) and reduction in step (4).The threads in parallel first read the values of tuple {exp x i β β β , x i j exp x i β β β , x 2 i j exp x i β β β } for the current covariate column j from global memory and then conduct the single-pass adaptive look-back tuple-scan (Merrill and Garland, 2016) utilizing a reduce-then-scan strategy.Next, the threads read in the values of δ and perform the transformation with the on-chip cumulative sums and δ .The threads then perform a binary-tree tuple-reduction using shared memory, and one thread from each block writes its partial aggregates to global memory.
Finally, a single-block reduction kernel sums over the G partial aggregates and writes the gradient g (β j ) and Hessian g (β j ) back to global memory.

about here.]
When CCD processes the j-th covariate, we only need to update β j and corresponding vector Xβ β β if ∆β j = 0. Recall that the computation of ∆β j requires g (β j ) and g (β j ) when regularization applies.Since both gradient g (β j ) and Hessian g (β j ) are computed on the GPU, we avoid P data transfers between GPU and CPU in one CCD cycle by using a GPU kernel to check if ∆β j = 0 and then update β j and Xβ β β if needed.Figure 3 details the workflow to implement CCD using GPU parallelization.
It is worth noting that thread-divergent branches in a CUDA kernel can substantially impact performance as execution gets temporally serialized.However, this is not an issue in our implementation because such branch divergence penalties only occur when threads within the same warp, but not across warps, need to execute alternative instructions.The branches in scan and reduction kernel in CUB are mainly due to slightly different instructions for the first processing tile and the last processing tile.Thus the divergence occurs across warps and does not have an impact on performance penalty.

Multi-stream cross-validation
We use k-fold cross-validation to search for the optimum tuning parameters γ or τ that controls the strength of regularization.We search for different values for the tuning parameter.For each value, the procedures are as follows: (1) Randomly split data into k partitions.
(2) For each of the k partitions, we fit survival models via CCD on the remaining k − 1 partitions and compute the predictive log-likelihood of this partition using the estimated β β β .
(3) Average out-of-sample likelihood across all k folds.
Finally, we select the tuning parameter with the smallest average out-of-sample likelihood as the desired optimal value.
We further improve the efficiency of cross-validation using multi-stream GPU and multi-threaded CPU approaches.Here, a stream denotes a sequence of operations (kernels) that execute in issueorder on a GPU (Rennich, 2011).Instead of the fitting k partitioned models serially in a single (default) stream, we fit the k models across s streams in parallel, where 1 < s ≤ k and each GPU stream is scheduled by an independent CPU thread.Likewise, for the multi-threaded CPU approach, s CPU threads evaluate the k models in parallel.

Comparison with an alternative massive parallelization of Cox models
In Ko et al. (2022), the authors presented sample PyTorch code for parallelizing proximal gradient descent on modest-sized 1 regularized dense Cox regression.Here we provide a qualitative comparison of our method with this alternative approach, focusing on the per-cycle cost of cyclic coordinate descent and proximal gradient descent.
A single iteration of proximal gradient descent on 1 −regularized Cox regression requires the following steps: (1) A matrix-vector multiplication Xβ β β .
(3) A scan on the output vector of (2).
Without considering the specific parallel library, the above steps contain O(NP) ) operations, respectively.Meanwhile, one sweep of coordinate descent implemented in Cyclops requires no more than O(NP) operations (the worst case is that the data is dense), as each of the four steps outlined in Section 2.7 only at most requires O(N) operations.Additionally, step (4) in the proximal gradient descent approach described earlier can also be reduced to O(N) by utilizing the same trick discussed in Section 2.7.
It is important to note that the number of iterations required to achieve an equivalent termination criterion is highly dependent on the data and is beyond the scope of our manuscript.
We would also like to emphasize that one of the main feature of our implementation is that it supports and benefits from a sparse data format when available, as discussed in Section 2.7.

Results
We examine the performance of GPU vs CPU computing for fitting our massive sample-size survival models.To accomplish this, we conduct a series of synthetic experiments to investigate the relative compute-time of our parallelization across different sample sizes.We then reproduce a real-world study using our GPU implementation under a Cox model and extend the study to the competing risks setting.If not specified, we use a system equipped with a 10 core 3.3 GHz Intel(R) Xeon(R) W-2155 CPU and an NVIDIA Quadro GV100 with 5120 CUDA cores and 32GB RAM that can achieve up to 7.4 Tflops double-precision point performance.

Synthetic experiments
We simulate indicator data X with N = 100, 000 to 1, 000, 000 samples and P = 1000 covariates with sparsity of 95%, where we randomly choose 5% of the entries uniformly to be 1s.We then draw β j ∼ N(0, 1) × Bernoulli(0.80)∀ j and where the Bernoulli(0.80)specifies that, on average, 80% of the β j are set to 0 to induce sparsity, and T i represent the time-to-event time for individual i.For generating competing risk times, we first draw: where β β β 1 is the regression parameter of primary event and β β β 2 is the regression parameter of competing event (Kawaguchi et al., 2020).Then we follow the design of Fine and Gray (Fine and Gray, 1999), where the cumulative incidence function (CIF) of primary event is a unit exponential mixture with mass 1 − p at ∞ when x i = 0: and draw survival time of competing event T 2i using an exponential distribution with rate exp(x i β β β 2 ).
We set p = 0.5 in practice.
[Figure 5 about here.] We fit these simulants under a fixed 1 penalty with γ = √ 2 on a single CPU core and the default CUDA stream.To provide a comprehensive comparison for the performance gains, we conduct the experiments on two systems with different technical specifications.System 1 is equipped with a 3.3 GHz Intel(R) Xeon(R) W-2155 CPU (launch date 2017) and an NVIDIA Quadro GV100 (2018) with 5120 CUDA cores and 32GB RAM that can achieve up to 7.4 Tflops double-precision point performance.System 2 is equipped with a 2.2 GHz Intel(R) Xeon(R) Silver 4214 CPU ( 2019) and an NVIDIA A100 (2020) with 6912 CUDA cores and 80GB RAM that can achieve up to 9.7 Tflops double-precision point performance.Figure 5 presents runtimes comparisons across computing devices with a fixed number of covariates (P = 1000) with 95% sparsity and varying sample size N.We report both the total model fitting time and the time for computing gradients and Hessians, where the latter is our target of parallelization.On system 1 which is equipped with a powerful CPU, we see that the GPU parallelization delivers up to a 42-fold speedup for both Cox and Fine-Gray models in terms of computing gradient and Hessians.Despite additional data transfer and device initialization, GPU parallelization is still 35 × faster for this Cox model and 39 × faster for this Fine-Gray model relatively to our CPU implementation with one million samples.On system 2 which is equipped with a more powerful GPU, we see that the GPU parallelization achieves up to a 52-fold speedup for this Cox model and a 70-fold speedup for this Fine-Gray model.The data transfer between host and device during CCD updates only accounts for ∼ 1% of the total model fitting time.We can see a rapid increase of runtimes on the CPU with increasing sample size, while the GPU approach continues to yield relatively shorter runtimes across varying sample sizes, as the devices is still not fully occupied.

Multi-stream cross-validated experiments
We further use these synthetic experiments to explore the performance of our approach using multithreaded CPU and multi-stream GPU computing by simultaneously searching for an optimal strength of regularization.Here, we use 10-fold cross-validation with 10 repetitions per fold, resulting in 100 cross-validation replicates to estimate an optimal γ under 1 regularization.On the GPU, we distribute the 100 replicates to s CUDA streams driven by s CPU threads.We also allow each of the CPU threads to process the 100 replicates directly on the CPU to demonstrate the performance of corresponding multi-threaded CPU parallelization.Figure 6 shows the runtimes of 1 regularized Cox regression on varying number s of threads or streams.Generally, our parallelization achieves more speedup through multi-stream GPU computing than through multi-core CPU threads alone.
For example, the runtime of fitting a Cox model on 1, 000, 000 samples reduces from nearly 9 hours across 8 CPU threads to 14 minutes across 8 CUDA streams.We also observe that the curves flatten out as number of CUDA streams and CPU threads increases, and this pattern is particularly obvious in the experiments on smaller sample sizes and on the CPU.This pattern suggests that there is both less computation to parallelize with relatively small sample sizes and that multi-core CPU parallelization remains limited by the smaller memory bandwidth available to the CPU.

Cardiovascular effectiveness of antihypertensive drug classes
The large-scale evidence generation and evaluation across a network of databases for hypertension (LEGEND-HTN) study (Suchard et al., 2019)  [Table 2 about here.] Here we examine patients initiating ACEIs and THZs, where the outcome is hospitalization for heart failure from the Optum ® de-identified Electronic Health Record dataset (Optum EHR).Op-tum EHR represents data from 85 million individuals that are commercially or Medicare insured in the United States.The data contain medical claims, pharmacy claims, laboratory tests, vital signs, body measurements, and information derived from clinical notes.A total of 1, 014, 618 patients are included in our study, 75.8% of whom initiated an ACEI and 24.2% of whom initiated a THZ.
We consider the main treatment covariate (ACEI or THZ exposure), in addition to 9, 811 baseline patient characteristic covariates.Table 2 presents a small selection of major patient baseline characteristics.From all baseline characteristics, we build a propensity score model and match ACEI and THZ new-users in a 1 : 1 ratio.A total of 434, 866 patients were kept for further analysis after propensity score matching.We find ACEI new-users are more likely to be male, have diabetes, hyperlipidemia or heart disease comparing with patients initiating a THZ before propensity score matching.The THZ and ACEi cohorts are well-balanced on all 9, 811 baseline patient characteristics after matching, identified through low standardized difference of population proportions.The average sparsity of patient characteristic covariates is 97.11%, which means 2.89% of the entries are non-zero, occupying about 2.96GB RAM in sparse matrix format.We first fit a Cox proportional hazards model to estimate the HR between THZ and ACEi initiation for the risk of hospitalization for heart failure.We also fit a Fine-Gray model in which we consider acute myocardial infarction as a competing risk of hospitalization for heart failure, given myocardial infarction substantially elevates the future risk of heart failure.We include all patient characteristics as additional covariates in the Cox and Fine-Gray models with 1 regularization on all variables except the treatment variable to achieve a limited form of adjustment for possible residual confounding, and again employ a 10-fold 10-replicate cross-validation to search for optimal tuning parameters.The analyses require almost two days to fit the regularized Cox model and more than three days to fit the regularized Fine-Gray model using eight CPU cores, while only taking 3.87 hours and 8.57 hours for the regularized Cox model and the regularized Fine-Gray model with our GPU implementation.Figure 7 reports the runtimes for regularized Cox and Fine-Gray models on varying numbers of CUDA streams versus the corresponding multi-core CPU parallelization.
Through massive parallelization, we find that initializing with a THZ shows better effectiveness than initializing with an ACEI in terms of hospitalization for heart failure risk under both the Cox model (HR 0.83, 95% bootstrapped percentile interval [BPI] 0.72 − 0.94) and the Fine-Gray model (HR 0.83, 95% BPI 0.71−0.95),where we provide the BPIs as crude measures of sampling variability given the challenges in constructing nominal confidence intervals for regularized models (Casella et al., 2010;Chatterjee and Lahiri, 2011).There are 215 and 134 out of 9,811 baseline covariates with non-zero effect sizes in the Cox and Fine-Gray models, respectively; these covariates include age, gender, hyperlipidemia, diabetes, osteoarthritis, and heart disease.While our estimates remain in line with the LEGEND-HTN study, they are reassuring in two ways.First, massive parallelization has enabled us to provide additional adjustment for possible residual confounding due to unobserved imbalance after propensity score matching without overfitting through cross-validation.Second, a consistent estimate after controlling for an obvious competing risk reduces concern over bias from informative censoring.Neither including thousands of baseline covariates nor handling a competing risk at this scale were possible in the original LEGEND-HTN study due to their erroneous time requirements; both are now feasible.

Discussion
This paper presents a time-and memory-efficient GPU implementation of regularized Cox and Fine-Gray regression models for analyzing large-scale time-to-event data with competing risks.We efficiently implement it in the open-source R package Cyclops (Suchard et al., 2013).In simulation studies, our GPU implementation is 35 − 70 times faster than the equivalent CPU implementation with up to 1 million samples.In our real-world example with ∼ 400,000 hypertension patients and ∼ 9,000 covariates, massive parallelization reduces the total runtimes of both regularized Cox regression and regularized Fine-Gray regression with cross-validation from a few days on multi-core CPUs to few hours on a GPU.
The observed speed-up is a combination of algorithmic advances as well as hardware optimization.First, we observe that the cumulative structure of the risk set can be computed by a single-pass expensive memory transactions and kernel overheads.Finally, when optimize an existing program, one should consider off-loading the most computationally intensive routines to the GPU in stages, before rewriting the whole codebase.

Acknowledgments
This research was supported through US National Institutes of Health grant R01 AI153044 and we gratefully acknowledge support from NVIDIA Corporation with the donation of parallel computing resources.

Disclosures
MJS is an employee and share-holder of Johnson & Johnson.The remaining authors report there are no competing interests to declare.

SUPPLEMENTARY MATERIALS
Cyclops: R-package Cyclops containing code to perform the analysis described in the article is available at https://github.com/OHDSI/Cyclops/tree/gpu_cox.R scripts for executing the experiments: R scripts for executing the experiments in Section 3. are available at https://github.com/suchard-group/gpu_survival_analysis_supplement.
Unfortunately, we are unable to provide access to the Optum de-identified Electronic Health Records due to data licensing constraints.However, we provide the cohorts definition that can reproduce our real-world analysis for reader who do license the Optum de-identified Electronic Health Records data source.
Zhou, H., Lange, K., and Suchard, M. A. (2010).Graphics processing units and high-dimensional optimization.Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 25(3), 311.show the naive binary tree-based approach for reduction and scan, respectively.Subfigure (c) presents a work-efficient scan algorithm using reduce-then-scan strategy, which includes an up-sweep phase for accumulating the partial sums and a down-sweep phase for aggregating the prefix sums.The tree-based approach for reduction and scan generally requires much data communication between threads but this remains low latency in shared memory, and thus is suitable for GPU parallelization.hods proportional hazards model We first establish notation under a typical survival analysis setting.Suppose there are n observed individ- 17 The cumulative distribution function of survival times gives the probability that the event of interest has 18 1 Figure 2: Fused kernel for evaluating the gradient and Hessian for β j .fused the single-pass scan with decoupling look-back (represented by the "hourglass"), the element-wise transformation (represented by the rectangle), and reduction (represented by the upside down triangle) together in a fused kernel.Specifically, each of G blocks (showed as a box in dashed line) reads a batch of triplets from global memory, performs a single-pass adaptive tuple-scan and a transformationreduction to compute local partial sum of gradients and Hessians in shared memory, then efficiently adds the pair of partial sums in a binary-tree tuple-reduction within a single block and writes the resulting pair of gradient and Hessian to global memory.
We first establish notation under a typical survival analysis setting.Suppose there are n observed individ- Massive parallelization of massive sample-size survival analysis so N is an upper triangular, binary matrix and N[exp (Xβ β β ) × W] is a suffix sum if we arrange the observations by their observed time Y i in decreasing order.Then making the following substitutions in Equation perform three element-wise embarrassingly parallel transformations that read from the new estimate of [Xβ β β ], and output [exp (Xβ β β )], [exp (Xβ β β )×X j ] and [exp (Xβ β β )×X j ×X j ].
Figure 2 illustrates our fused kernel for evaluating the log-likelihood gradient and Hessian on the GPU.In this kernel, we spawn S = B × IPT × G threads, where B is the number of concurrent threads forming a thread block, IPT is the number of work-items (elements of input) evaluated per thread, and G = provided real-world evidence on the comparative effectiveness and safety of five first-line antihypertensive drug classes using a retrospective, comparative new-user cohort design.Specifically, LEGEND-HTN studied the relative risk of 55 health outcomes of interest , including three major cardiovascular events (acute myocardial infarction, hospitalization for heart failure, and stroke), six secondary effectiveness outcomes, and 46 safety outcomes.Within each of nine observational health data sources, LEGEND-HTN employed propensity score matching or stratification for confounding adjustment and Cox proportional hazards modeling for hazard ratio (HR) estimation between new-users of each of the different drug classes.Interestingly, LEGEND-HTN identified that new-users of thiazide or thiazide-like diuretics (THZs) have a lower risk as compared to new-users of angiotensin-converting enzyme inhibitors (ACEIs) in terms of several effectiveness and safety outcomes, even though ACEIs are the most commonly initiated monotherapy across databases.

Figure 1 :
Figure1: Tree-based parallel implementation of reduction and scan.Each grey box (column) represents a thread.Each thread runs a series of steps, and some steps must wait for results from other threads.Subfigures (a) and (b) show the naive binary tree-based approach for reduction and scan, respectively.Subfigure (c) presents a work-efficient scan algorithm using reduce-then-scan strategy, which includes an up-sweep phase for accumulating the partial sums and a down-sweep phase for aggregating the prefix sums.The tree-based approach for reduction and scan generally requires much data communication between threads but this remains low latency in shared memory, and thus is suitable for GPU parallelization.
notation under a typical survival analysis setting.Suppose there are n observed individ-12 uals available in a study.For individual i = 1,...,N, let Y i = min(T i ,C i ) represent their survival time, where T i 13 and C i are the time-to-event time and right-censoring time, respectively.Let d i be the indicator variable such 14 that d i = 1 if we observe the event occurrence of individual i and d i = 0 if the individual i is censored.Let x i 15 be a P-dimensional vector of time-independent covariates for individual i.The survival data then consist of16 triplets {Y i , d i , x i } n i=1 .17Thecumulative distribution function of survival times gives the probability that the event of interest has notation under a typical survival analysis setting.Suppose there are n observed individ-12 uals available in a study.For individual i = 1,...,N, let Y i = min(T i ,C i ) represent their survival time, where T i 13 and C i are the time-to-event time and right-censoring time, respectively.Let d i be the indicator variable such 14 that d i = 1 if we observe the event occurrence of individual i and d i = 0 if the individual i is censored.Let x i 15 be a P-dimensional vector of time-independent covariates for individual i.The survival data then consist of 16 triplets {Y i , d i , x i } n i=1 .17 The cumulative distribution function of survival times gives the probability that the event of interest has 18 1 rds: Graphic processing unit; Cox proportional hazards model; Fine-Gray model; Regularized Survival analysis.
4) t establish notation under a typical survival analysis setting.Suppose there are n observed individle in a study.For individual i = 1,...,N, let Y i = min(T i ,C i ) represent their survival time, where T i the time-to-event time and right-censoring time, respectively.Let d i be the indicator variable such if we observe the event occurrence of individual i and d i = 0 if the individual i is censored.Let x i ensional vector of time-independent covariates for individual i.The survival data then consist of , d i , x i } n i=1 .mulative distribution function of survival times gives the probability that the event of interest has 1

12
uals available in a study.For individual i = 1,...,N, let Y i = min(T i ,C i ) represent their survival time, where T i 13 and C i are the time-to-event time and right-censoring time, respectively.Let d i be the indicator variable such 14 that d i = 1 if we observe the event occurrence of individual i and d i = 0 if the individual i is censored.Let x i 15 be a P-dimensional vector of time-independent covariates for individual i.The survival data then consist of 16 triplets

[[[Figure 3 :
Figure3: The workflow of implementing CCD using GPU parallelization: for each j = 1, . . ., P, a sparse kernel reads in the entries of [Xβ β β ] for which x i j = 0 and writes the updated tuple of vectors to the global memory of the GPU, then a fused kernel performs a tuple-scan and a transformationreduction to compute the gradient and Hessian of the log-likelihood with respect to β j .Finally a conditional kernel computes ∆β j and updates β j and Xβ β β if ∆β j = 0. Blue arrows represent data transactions to or from global memory.No data transfer between the GPU and CPU is needed until CCD finishes a complete cycle.

Figure 4 :
Figure4: Speedup of the fused kernel and the partially-fused kernels over separated kernels for steps (2) -(4) in Section 2.7.The sample sizes range between N = 10 5 to 10 6 .Solid line shows the speedup of the fused kernel over separated kernels, and dashed line shows the speedup of the partially-fused kernel over separated kernels.

Figure 7 :
Figure7: Runtimes (in hours) for 1 regularized Cox and Fine-Gray models using multi-stream GPU and multi-core CPU computing.The data contain 434, 866 new-users of THZs and ACEIs, each with 9, 811 baseline patient characteristics covariates.We add a 1 penalty on all baseline covariates and use a 10-fold 10-replicate cross-validation to search for optimum tuning parameters.
Fine and Gray (1999)the regular risk set that includes the observations that have an observed time equalling or after Y i and R 2 (Y i ) includes the observations that have observed the competing event before time Y i .Here we follow the design of weighted score functions for incomplete data with right censoring inFine and Gray (1999)

)
Recall that the risk set R 1 (Y i ) contains all observations that have an observed event time equalling or after Y i .Thus if we arrange the observations by their observed time Y i in decreasing order, matrix M is clearly a lower triangular, binary matrix, and matrix-vector multiplication M[exp (Xβ β β )] becomes a prefix sum over the elements in exp (Xβ β β ), where we define exponentiation (exp) as element-wise operation.For example, given Y i > Y i , the set R(Y i ) consists of all the observations from R(Y i ) and the set {t

Table 2 :
Baseline hypertensive patient characteristics for new-user of THZ and ACEi in the Optum EHR database.We conduct a propensity score matching with the matching ratio of 1 : 1. Less standardised difference of population proportions after matching indicate improved balance between two new-user cohorts in terms of confounders.THZ = thiazide or thiazide-like diuretics.ACEi = angiotensin-converting enzyme inhibitors.