Evidence for generalizability of edaravone efficacy using a novel machine learning risk-based subgroup analysis tool

Abstract Introduction: The edaravone development program for amyotrophic lateral sclerosis (ALS) included trials MCI186-16 (Study 16) and MCI186-19 (Study 19). A cohort enrichment strategy was based on a Study 16 post hoc analysis and applied to Study 19 to elucidate a treatment effect in that study. To determine whether the Study 19 results could be generalized to a broader ALS population, we used a machine learning (ML) model to create a novel risk-based subgroup analysis tool. Methods: A validated ML model was used to rank order all Study 16 participants by predicted time to 50% expected vital capacity. Subjects were stratified into nearest-neighbor risk-based subgroups that were systematically expanded to include the entire Study 16 population. For each subgroup, a statistical analysis generated heat maps that revealed statistically significant effect sizes. Results: A broad region of the Study 16 heat map with significant effect sizes was identified, including up to 70% of the trial population. Incorporating participants identified in the cohort enrichment strategy yielded a broad group comprising 76% of the original participants with a statistically significant treatment effect. This broad group spanned the full range of the functional score progression observed in Study 16. Conclusions: This analysis, applying predictions derived using an ML model to a novel methodology for subgroup identification, ascertained a statistically significant edaravone treatment effect in a cohort of participants with broader disease characteristics than the Study 19 inclusion criteria. This novel methodology may assist clinical interpretation of study results and potentially inform efficient future clinical trial design strategies.


Introduction
Amyotrophic lateral sclerosis (ALS) is an incurable, progressive, and fatal neuromuscular disease characterized by degeneration of motor neurons in the brain and spinal cord (1,2). ALS patients typically live 2-5 years from disease onset, and 50% die within 30 months of symptom onset (1,3,4). Current treatment includes disease-modifying therapy, symptom management, and palliative care (5).
Two therapies-riluzole and edaravone-have been approved for treating ALS in a number of countries (6,7). Riluzole, in use since the mid-1990s in the United States (US) and Europe, was shown to prolong survival by 2-3 months (7,8). Radicava V R (edaravone) is approved for use in Japan, the US, South Korea, Canada, and Switzerland. Edaravone slows the rate of functional loss by 33% over 24 weeks as measured by scores on the revised ALS Functional Rating Scale (ALSFRS-R, 6,9).
The conduct of ALS clinical trials has been notoriously difficult, complicated by heterogeneity of disease progression and other factors (10,11). Such heterogeneity restricts the ability to identify well-defined patient subgroups and potentially increases type II error resulting from excessive variance in the endpoint assessing efficacy. In the edaravone clinical development program, an enrichment strategy was adopted to study a patient population in which a significant treatment effect might be demonstrated (11)(12)(13). Study 16, an early phase 3, 24-week edaravone trial, showed numerical results favoring edaravone on the primary endpoint, change from baseline in ALSFRS-R score, but the effect was not statistically significant (12). Post hoc analysis of Study 16 identified 2 populations in which significant treatment effects were demonstrated (12,13). The characteristics of these populations were defined in two steps (steps 1 and 2, Table 1). Analysis of the Step 1 and 2 populations showed that these patients experienced significantly less decline in ALSFRS-R score with edaravone versus placebo (12,13). For this reason, the Step 2 post hoc criteria were utilized as inclusion criteria for the subsequent phase 3 trial, Study 19 (9). The primary objective of Study 19 was to confirm the efficacy of edaravone versus placebo in ALS patients based on the 24-week change in the ALSFRS-R score.
Study 19 demonstrated a significant effect of edaravone in slowing functional decline (9). For example, the least squares mean ± standard error (SE) values for the change in ALSFRS-R score during 24 weeks were À7.50 ± 0.66 for placebo versus À5.01 ± 0.64 for edaravone. The betweengroup difference was 2.49 ± 0.76 (p ¼ 0.0013). The enrichment strategy used for Study 19 had several important features. First, it likely reduced the clinical heterogeneity of the study population and the proportion of slowly progressing patients (11,13). Second, the result of a combined decrease in heterogeneity and increase in measurable treatment effect increased the possibility of achieving a statistically significant difference in outcome measures between treatment arms (11,13). Study 19 imposed stringent selection criteria, substantially reducing the eligible pool of patients. This effect is exemplified by the proportion of patients in the Study 16 group selected using those criteria (68/ 199, 34%). Third, the selection criteria implemented for Study 19 helped identify patients more likely to demonstrate a significant treatment effect. What remains unclear is whether this methodology uncovered a subgroup of patients with an elevated edaravone response or simply demonstrated a treatment effect common to a broad range of ALS patients in a mathematically optimized subgroup. Whether Study 19 results can be generalized to a broader ALS population would be valuable for clinicians to know when managing patients in realworld practice. To explore this, we applied a machine learning (ML) model to Study 16 outcomes data, stratifying patients based on predicted outcomes to create a novel, risk-based subgroup analysis tool.

Studies 16 and 19
Data from edaravone clinical Studies 16 and 19 were used as the source for analyses. Study 16 was a randomized, double-blind, parallel-group, placebo-controlled study that included a 24-week double-blind period followed by a 24-week openlabel active-treatment extension period (12). Eligible patients fulfilled the following criteria: (1) 20-75 years of age, (2) a diagnosis of "definite," "probable," or "probable laboratory-supported" ALS according to the El Escorial and revised Airlie House criteria, (3) forced vital capacity (FVC) !70% predicted, (4) less than 3 years disease duration from first ALS symptom, (5) change in ALSFRS-R score of À1 to À4 during a 12-week observation period prior to the start of the study, and (6) ALS of grade 1 or 2 in the Japan ALS Severity Classification. After the observation period, eligible patients were randomly assigned 1:1 to edaravone or placebo for 24 weeks. The intent-to-treat population included 101 edaravone and 104 placebo patients. Study 19 was also a randomized, 24-week double-blind, parallel-group, placebo-controlled study that included 69 edaravone and 68 placebo patients. Eligibility criteria were developed following a post hoc analysis of Study 16. Age and change in ALSFRS-R score during an observation period were the same, but Step Original designation Criteria Step 1 EESP A score of !2 points for all items in the ALSFRS-R at baseline FVC !80% at baseline Identification of the "efficacy-expected subpopulation" Step 2 dpEESP2y A diagnosis of "definite" or "probable" ALS according to the El Escorial revised Airlie House diagnostic criteria Being within 2 years of initial ALS symptom onset at the time of giving informed consent Identification of the "greater-efficacyexpected subpopulation" within the post hoc EESP ALS: amyotrophic lateral sclerosis; ALSFRS-R: revised ALS Functional Rating Scale; dpEESP2y: greater-efficacy-expected subpopulation with a "definite" or "probable" ALS diagnosis and disease onset <2 years; EESP: efficacy-expected subpopulation; FVC: forced vital capacity.
Study 19 included an El Escorial diagnosis of "definite" or "probable," FVC !80% , less than 2 years disease duration, and scores !2 on all ALSFRS-R items, with a respiratory score !4. In both studies, edaravone was administered once daily in 60-mg doses via 60-minute intravenous infusions. Infusions were given for 14 days for the first cycle and for 10 of the 14-day periods for all subsequent cycles. Each treatment cycle was followed by a 14-day drug-free period. The primary efficacy endpoint in both studies was the change in ALSFRS-R from baseline to the end of week 24.

Machine learning model
This analysis utilized a previously validated gradient boosting machine model to predict the log-hazard risk of patients progressing to 50% expected vital capacity (VC50) . This model was developed using the clinical trial records of 4600 unique patients included in the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT) database (the "internal" data set) (14). VC50 was reached during the period of their respective trials by 1926 (42%) of the patients. The model has an average area under the receiver operating curve (AUC) measured at the 1-year time point in a 10-fold internal cross-validation of 0.868, and an AUC of 0.923 using the placebo arm of the contemporary, external 6-month-long BENEFIT-ALS clinical trial (14,15).

Detectable effect cluster analysis
The protocol for detectable effect cluster (DEC) analysis is outlined in Figure 1. Using the VC50 model, Study 16 patients were rankordered by their predicted log-hazards. Patients with low risk were listed first and patients with high risk were listed last. A preliminary analysis stratified the patients into tertiles based on this ordering. A more detailed analysis was performed by stratifying patients into 50 subgroups using the same ordering. For both the preliminary and detailed analyses, each of the risk-based subgroups was systematically expanded in 2% increments until the entire Study 16 population was included. For example, the low-risk tertile initially included 33% of patients. This group was analyzed to assess treatment effect. The group was then expanded to include 35% of patients, adding the patients with risk scores just greater than the initial group. This group was reanalyzed to assess treatment effect. The process was continued until the initial lowrisk group was expanded to the entire full analysis set (FAS). The moderate-and high-risk tertiles were similarly expanded in 2% increments.
To visualize the results of the detailed analysis, a matrix was constructed in which each cell defined a subgroup with distinct upper and lower values for predicted log-hazard. One possible subgroup might be patients with log-hazard in the range from the 40th to 60th percentile of the FAS. Another could be from the 36th to 52nd percentile. All possible nearest-neighbor subgroups were created, for a total of 1275 groups (51 Â 50 Ä 2). While every cell includes trial participants that are contiguous by log-hazard rank, note that the subgroups are not distinct and that patients will appear in multiple subgroups.
A statistical analysis was performed for each subgroup as described in the original Study 16 statistical analysis plan. A heat map was developed that revealed combinations of treatment effect and mean square error (MSE) with statistically significant effect sizes (p < 0.05).

Fifty percent expected vital capacity predictions
Predicted VC50 log-hazard values for each patient in studies 16 and 19 were generated using each individual's baseline data. The distribution of VC50 risk in studies 16 and 19 is shown in Figure 2. The Study 16 population was divided into tertiles, as shown in Figure 2(A). The tertile populations were expanded in a stepwise fashion to gradually include the full study population. For example, the low-risk tertile was expanded to include patients with higher and higher risk scores. The series of Study 16 FAS VC50 subgroups thus created were then analyzed according to effect size ( Figure 3). The analysis revealed a population with improved treatment effect that included approximately 70% of the Study 16 FAS (Figure 3, peak in the low-risk line graph, identified by the arrow).   Detectable effect cluster analysis DEC analysis was used, as described in Methods, to analyze 1275 subgroups of the Study 16 FAS population. The DEC analysis matrix identified a broad region of the Study 16 heat map with significant effect sizes (Figure 4). The VC50 largest population (designated VC50lp) with statistically significant effect size included 70% of the Study 16 FAS population (Figure 4, green, boxed cell). For orientation, the smallest subgroups are located along the diagonal spanning from the lower left corner of the heat map to the upper right corner. The incremental nearest-neighbor 2% expansions are located vertically up from each of the initial subgroups. When constructed this way, the upper left corner of the matrix represents the FAS. Every subgroup between the subgroups along the diagonal and the FAS includes an ever-increasing number of nearest-neighbor participants. It should be noted that in Figure 4, the heat map shows clear areas above the diagonal with no apparent subgroup-these are areas that contained too few participants in either arm before a meaningful comparison and were not included in the matrix for clarity. While each subgroup includes a unique set of participants, all subgroups, except for the initial 50 subgroups, overlap with many of the other subgroups.
When the VC50lp was combined with nonoverlapping participants identified in the Study 16 Step 1 population (VC50lp þ Step 1), 76.4% (n ¼ 152) of the original participants were identified with a statistically significant treatment effect ( Table 2). Study 16 Step 1 and Step 2 achieved significant effect sizes primarily by an increase in treatment effect, with a modest effect on MSE. In contrast, the VC50lp population achieved a significant effect size mainly by a decrease in heterogeneity, as evidenced by a substantially lower MSE, while having a modest increasing treatment effect. Overall, the VC50lp þ Step 1 population displayed intermediate improvement in both treatment effect and MSE.

Baseline characteristics of Studies 16 and 19 full analysis set and Study 16 subgroups
The distribution of VC50 risk profiles in Studies 16 and 19 were similar (Figure 2(B)). Compared with Study 16, Study 19 had fewer patients in the high-and low-risk groups, so the tails of the distribution were less pronounced. Table 3 includes populations of the Study 19 FAS (N ¼ 134), Study 16 FAS (N ¼ 199), VC50lp þ Step 1 (n ¼ 152), VC50lp (n ¼ 139), and those "not selected" (n ¼ 47), which comprised the FAS patients minus patients in Step 1 and VC50lp.
The "not selected" subgroup exhibited lower baseline ALSFRS-R (mean 39.8 points) and percent expected vital capacity (VC; mean 79%) than the other four groups. Compared with the Study 16 and 19 FAS, the VC50lp þ Step 1 and VC50lp populations had similar baseline characteristics for sex, age, body mass index, days since symptom onset, and ALSFRS-R score (Table 3). However, pulmonary function (i.e. percent expected FVC) appeared to be slightly greater in the VC50lp þ Step 1 and VC50lp populations. There were no deaths during the trial in the VC50lp þ Step 1 or VC50lp populations compared with 3 deaths in the "not selected" subgroup. Importantly, while the mean baseline ALSFRS-R and percent expected VC values for the "not selected' subgroup were decidedly lower than the  other four groups, a simple separation of the "not selected" group could not be made due to the marked overlap in the ranges of the values. Although the mean percent expected VC was similar between the Study 19 FAS and the VC50lp þ Step 1 population, the range of values for percent expected VC included lower levels of pulmonary function in the VC50lp þ Step 1 population due to the Study 19 inclusion criterion of >80% expected FVC. Importantly, while the mean and median were similar, the range of baseline ALSFRS-R scores included lower values in the VC50lp þ Step 1 population versus the Study 19 FAS. This latter effect on baseline ALSFRS-R was certainly influenced by the Study 19 inclusion criteria of !2 points for all items in the ALSFRS-R at baseline.
The distribution of changes in ALSFRS-R score from baseline to week 24 in the VC50lp þ Step 1 placebo population was compared with the full Study 16 placebo population. The VC50lp þ Step 1 placebo population spanned the entire range of ALSFRS-R scores observed for the Study 16 placebo population ( Figure 5).

Discussion
Strategic enrichment strategies in ALS clinical trials are designed to overcome statistical challenges related to the inclusion of slow progressors and disease heterogeneity. However, questions remain about the applicability of results obtained using restrictive inclusion criteria in a registration trial to the broader patient population in realworld situations.
When machine learning techniques were applied to the Study 16 population data, a statistically significant edaravone treatment effect was found in a patient cohort broader than the Study 19 inclusion criteria. For example, the VC50lp þ Step 1 population experienced a greater number of days since symptom onset (mean 625 ± 242) than Study 19 (mean 403 ± 170). These differences are likely due to differences in disease duration entry criteria for the two phase 3 studies, which were 3 years for Study 16 and 2 years for Study 19. Also, the VC50lp þ Step 1 population had a lower baseline range of % expected VC values (75.2%-143.4%) than Study 19 (80.2-151.0%), likely due to differences in FVC entry criteria for the 2 studies, which were !70% expected for Study 16 and !80% expected for Study 19. Moreover, the range of ALSFRS-R scores was broader in the VC50lp þ Step 1 population (29-47) than for Study 19 (36-47) and Study 16 had no restrictions in ALSFRS-R item scores. Thus, the VC50lp þ Step 1 population included patients with a broader range of ALS disease characteristics at baseline than those of the Study 19 patient population. These findings are consistent with a previous Study 19 subgroup analysis, which indicated that the effect of edaravone on disease progression is more generalizable than the Study 19 inclusion criteria might otherwise indicate (16).
As a post hoc analysis, this analysis is subject to the limitations inherent in such methodologies. For example, these analyses were not prespecified in Study 16. In addition, although p-values were calculated for each of the nearest-neighbor riskbased groups in the DEC analysis, the study was not powered to detect significant differences between the edaravone and placebo groups for each of these populations. Thus, the current analysis does not control for multiplicity in the DEC analysis p-value calculations.
Analysis of disease progression in the Study 16 placebo group and the VC50lp þ Step 1 population indicated that the latter population included a wide range of disease progression from slow to rapid decline in the ALSFRS-R score. Thus, the VC50lp þ Step 1 population, which exhibited an edaravone treatment effect, included patients with a range of disease progression rates reflective of the full Study 16 population. The wider question of generalizability beyond the Study 16 inclusion criteria remains to be determined in real-world analyses (17).
Characteristics of the 47 patients in the "Study 16 not selected" population warrant further discussion (Table 3, last column on right). These patients experienced fewer days since symptom onset, lower baseline ALSFRS-R scores, lower baseline percent expected VC, and included all 3 of the deaths in Study 16. However, separating these patients based on baseline functional score or VC would be very difficult, as the ranges of the "not selected" group overlapped with the ranges of patients included in the other groups.
We initially hypothesized that a model predicting functional decline would be useful for creating groups of patients at a similar point in their disease progression, particularly if the endpoint is ALSFRS-R. However, if Patient A progresses from 46 to 40 points and Patient B progresses from 28 to 22 points, they progress at a similar rate, but Patient B is clearly more impaired and in a more advanced state of the disease than Patient A. In contrast, the VC50 model returns a prediction that can be viewed as a disease milestone, 50% vital capacity. When we apply DEC analysis using the time to 50% VC model, we effectively group patients by risk of achieving this disease milestone. Rather than grouping patients by predicted change in ALSFRS-R who may be in different states of the disease, the VC50 model succeeds in grouping patients who are in similar disease states.
This modeling methodology may assist the clinical interpretation of study results beyond inclusion criteria and may inform future, more efficient clinical trial design strategies for ALS. For example, the Study 16 steps 1 and 2 subgroups and the VC50lp subgroup demonstrate alternate strategies for defining a study population. Steps 1 and 2 achieved statistical significance by increasing the treatment effect while having relatively little impact on lowering the variance. In contrast, the VC50lp subgroup boosted the treatment effect to a lesser degree but had a more dramatic effect in lowering the variance. Also, regarding strategies for trial design, Step 2 included 34% of the Study 16 population, whereas the VC50lp included 70% of the Study 16 population. This latter consideration will have a significant effect on trial recruitment time. Finally, this methodology confirms the possibility of identifying a subpopulation within a failed clinical trial that could form the basis for a subsequent successful trial.