SOD1-ALS-Browser: a web-utility for investigating the clinical phenotype in SOD1 amyotrophic lateral sclerosis

Abstract Objective Variants in the superoxide dismutase (SOD1) gene are among the most common genetic causes of amyotrophic lateral sclerosis. Reflecting the wide spectrum of putatively deleterious variants that have been reported to date, it has become clear that SOD1-linked ALS presents a highly variable age at symptom onset and disease duration. Methods Here we describe an open access web tool for comparative phenotype analysis in ALS: https://sod1-als-browser.rosalind.kcl.ac.uk/. The tool contains a built-in dataset of clinical information from 1383 people with ALS harboring a SOD1 variant resulting in one of 162 unique amino acid sequence alterations and from a non-SOD1 comparator ALS cohort of 13,469 individuals. We present two examples of analyses possible with this tool, testing how the ALS phenotype relates to SOD1 variants that alter amino acid residue hydrophobicity and to distinct variants at the 94th residue of SOD1, where six are sampled. Results and conclusions The tool provides immediate access to the datasets and enables bespoke analysis of phenotypic trends associated with different protein variants, including the option for users to upload their own datasets for integration with the server data. The tool can be used to study SOD1-ALS and provides an analytical framework to study the differences between other user-uploaded ALS groups and our large reference database of SOD1 and non-SOD1 ALS. The tool is designed to be useful for clinicians and researchers, including those without programming expertise, and is highly flexible in the analyses that can be conducted.


Background
Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease characterized by dysfunction and death of motor neurons leading to progressive muscle weakness and paralysis (1).Its clinical presentation can vary greatly.For example, although people most frequently develop the first symptoms between 55 and 65 years of age, the disease can onset across all stages of adulthood.Similarly, time from symptom onset until death is a median of 3 years for ALS but some people die within a year of onset, and 5-10% of people survive for more than 10 years (2)(3)(4).
A plethora of genetic factors can affect the risk of ALS or its phenotype, and mutations in specific genes can lead to distinct clinical outcomes.For example, a hexanucleotide repeat expansion in the C9orf72 gene is the most common known cause of ALS, and carriers of this mutation typically develop ALS earlier and with faster progressing symptoms than sporadic ALS patients (5)(6)(7)(8).Furthermore, different mutations within the same gene can also lead to distinct forms of the disease.For example, over 180 variants in the superoxide dismutase (SOD1) gene (9)(10)(11) have been found in ALS patients.As well as affecting ALS risk, some of these variants have distinct effects on clinical features such as the age of onset of motor symptoms and disease duration.For example, p.A5V and p.H44R have a marked effect on disease duration while p.G38R is associated with an early onset (10,(12)(13)(14).Being able to characterize how genetic variants affect the clinical phenotype is essential for optimal development and design of healthcare, treatments, and trial stratification.However, the multitude of genetic factors involved in ALS and their rarity are great challenges for their individual study.To address these limitations, focussing on SOD1 given the recent gene therapy trials (15), we recently collated data from the literature and specialized ALS centers globally on approximately 15,000 people with ALS, over 1000 of whom harbored a variant in the SOD1 gene (10).
In this paper, we describe a web tool (https:// sod1-als-browser.rosalind.kcl.ac.uk/) with upload facilities to allow people to perform comparative and bespoke phenotype analyses using data from a database of almost 15,000 people with ALS without the need for informatics proficiency.The tool currently allows users to define and select subgroups of patients with or without variants in SOD1, to stratify by individual or groups of SOD1 variants, and to upload data to combine with our database in the analysis.To show the potential of this tool and how to use it, we present two example case studies that leverage the data from our recent publication which is accessible to all users.The first example builds upon research suggesting that variants affecting protein hydrophobicity promote aggregation of mutant SOD1 16 and tests how alterations in amino acid hydrophobicity affect the ALS phenotype.The second example focuses specifically on variation at the 94 th amino acid residue of SOD1, which is a site containing multiple reported variants, testing how the phenotype differs for each variant sampled.

Dataset
The tool enables access to a dataset of 14,852 people with ALS, 1383 of whom harbor a potentially deleterious non-synonymous SOD1 gene variant (N without SOD1 variant ¼ 13,469).A total of 162 unique amino acid variants (canonical SOD1 sequence IDs: ENSEMBL ¼ ENST000002 70142.11,UniProt ¼ P00441) are represented within these data (see Figure 1; Table S1).The dataset is further described within our previous publication (10) and a summary of the disease characteristics associated with the 49 variants harbored by at least 5 people is provided on the site.

Functionality
Survival analysis methods can be performed for (1) age at symptom onset, and (2) disease duration from symptom onset (with a corresponding censor variable indicating survival status).Kaplan-Meier and Cox proportional-hazards (CPH) approaches are both implemented and relevant descriptive statistics for the analyzed sample are given by strata.Differences between strata in univariate analyses are examined using the log-rank test; global and pairwise log-rank tests are performed when more than two strata are defined.Analyses using CPH models are performed whenever two or more strata are defined or when a single stratum is specified and the user selects at least one of several covariates which can be included in the regression model.Available covariates are clinical diagnosis, family disease history, sex, age of onset, site of onset, and sample source (continent of origin).Users can pick which covariates are included in the analysis depending on requirements and associations between selected survival analysis strata and available covariates can be tested.
Various analysis options are provided.The user can model survival for any number of individual SOD1 variants (including a "no variant" option) and variants can be collapsed into groups of interest (including an "any other SOD1 variant" option).We include three pre-defined options for grouping variants: by functional location (10) in the protein (across the dimer interface, electrostatic loop, zinc loop, and other) or according to the gene exon from which variants are transcribed.The final pre-defined analysis compares people with any SOD1 variant versus the "no variant" group.
Users can further customize the analysis.They can filter by continent of origin and opt to stratify the analysis by sex, family history, site of onset, clinical diagnosis, and country or continent of origin.Time-dependent CPH analyses are also possible, allowing users to define timepoints at which the data are split.This functionality allows time-dependent coefficients to be modeled and enables analysis constrained to a certain timeframe (e.g.only the first 12 months from disease onset).
We allow users to upload supplemental data that is appended to the native sample, enriching the analysis possible within the tool.There are no restrictions regarding records that can be uploaded as supplemental data; users may provide data associated with SOD1 variants both present in and absent from the built-in dataset or provide data from other groups of patients (e.g. for variants from other genes).Formatting instructions for supplemental data are provided on the site.
The results of the user"s analyses are presented on the website, and we provide options to (1) download these within an HTML report and (2) download publication-ready versions of the figures produced, with customizable formatting.

Examples of use
Here we present two examples of analyses possible within this tool.We examined differences in age of onset and disease duration between the strata of each example using Kaplan-Meier analyses and the log-rank test, and CPH models were applied with robust variance estimation as implemented by coxph to examine differences between strata before and after controlling for possible covariates.In the CPH models, we controlled for sex and age of onset when analyzing disease duration, and sex only when analyzing age of onset.
Case study 1 examined whether changes to amino acid hydrophobicity influenced the age of ALS onset or disease duration from onset until death.Amino acids were grouped into three hydrophobicity categories (28): hydrophobic (Amino acid IUPAC code ( 16): F, M, I, L, V), hydrophilic (D, E, H, K, R, N, Q), and intermediate (Y, W, P, G, A, S, T, C).Variants resulting in an amino acid substitution were then categorized based on the hydrophobicity group of the wild type and mutant amino acid; Table S1 presents the assignment of groups and data availability across variants.To specifically examine the consequence of changes in hydrophobicity, three sets of analyses were conducted, each respective to variants occurring in residues that are hydrophilic, intermediate, or hydrophobic in the wild-type protein.In each analysis, variants resulting in altered hydrophobicity were compared against variants where the mutant and wild type amino acids remained in the same hydrophobicity group.The p.A5V variant was excluded from these analyses since it is characterized by a particularly aggressive phenotype and accounted for the majority of records (n ¼ 312) in the "intermediate to hydrophobic" category.A broader hydrophobicity analysis across all groups was also conducted.
Case study 2 examined trends associated with variation at the 94 th amino acid residue of SOD1, coding for a glycine in the wild type protein.Six variants were present at this locus.We first analyzed differences in age of onset and disease duration associated with having any p.G94 variant vs any other SOD1 variant.Second, we compared p.G94 variants individually to non-p.G94 variants, aggregating across p.G94R, p.G94S, and p.G94V since they each contained fewer than 5 records.Table 1 summarizes characteristics of the data from both case studies.

Amino acid hydrophobicity analysis
In case study 1, we examined how the ALS phenotype varied by changes in amino acid hydrophobicity.Across all amino acid substitutions sampled: 42.86% were variants that remained in the same hydrophobicity category as wild type SOD1; 42.11% were variants with a hydrophilic or hydrophobic amino acid in the wild type and an intermediate amino acid in the mutant protein; 12.59% were variants with an intermediate amino acid becoming hydrophilic or hydrophobic; and 2.44% were variants with substitutions from hydrophilic to hydrophobic amino acids or vice versa (see Table 1).
Age at symptom onset appeared roughly comparable across variants in all categories of the hydrophobicity analyses (see Table 2; Table S2; Figure 2), with all groups having a mean age of onset between 46 and 51 years (Table 1).
Disease duration analysis (see Table S3 presents an additional CPH model comparing all hydrophobicity groups relative to substitutions in residues with intermediate to intermediate amino acid substitutions.The analysis indicated that disease duration was shortest in this and the hydrophobic to hydrophilic substitution groups.

p.G94 amino acid residue analysis
In case study 2, we examined trends associated with variation in the 94 th SOD1 residue.p.G94A was the most frequent variant at this locus and 5 other variants occurred in the dataset (see Table 1).This case study showed variant-specific trends in age of onset and disease duration, which were not discernable aggregating across p.G94 variants, when compared with non-p.G94 SOD1 variants (see Table 2; Figure 2).
Age of onset was earlier than in the non-p.G94 SOD1 variant reference category only in the p.G94C (p value: log-rank test ¼ 6.73 Â 10 À4 ; CPH model ¼ 7.39 Â 10 À4 ) and p.G94R/S/V (p value: log-rank test ¼ 5.66 Â 10 À4 ; CPH model ¼ 9.34 Â 10 À8 ) groups; this difference appears considerable since the median age of onset for non-p.G94 SOD1 is over 10 years later than median onset in these two groups (see Table 1; Figure 2(E)).The disease duration analysis indicated that only the variant was associated with shorter time to death (p value: log-rank test ¼ 5.95 Â 10 À3 ; CPH model ¼ 3.00 Â 10 À5 ).Inspection of hazard ratios suggests that p.G94C trended toward longer disease duration compared to non-p.G94 variants even after controlling for age of onset and sex (p value: log-rank test ¼ 0.0672; CPH model ¼ 0.100).Although the median disease duration was longer for variants in the p.G94D and p.G94R/S/V variant groups, data were insufficient to test the association.

Discussion
We have developed a web-tool to facilitate bespoke investigations of the impact of SOD1 variants upon the ALS phenotype, using survival analysis approaches.We have provided two examples of this tool"s utility, examining differences in ALS age at symptom onset and disease duration according to (1) variants of varying impact upon residue hydrophobicity across SOD1 and (2) distinct variants at the 94 th SOD1 residue.
This online facility has key benefits for research on the heterogenous ALS phenotype.First, it permits a user-friendly interface for performing survival analysis, with various options for customization in accordance with the user"s needs.Second, it provides access to a large in-built SOD1-ALS cohort and non-SOD1 comparator population, which can be further enriched if users provide their own supplementary data.
The hydrophobicity analysis suggested that substitution variants altering residue hydrophobicity from hydrophobic to intermediate or hydrophilic are associated with a shorter disease prognosis compared to variants in residues remaining hydrophobic across wild type and mutant SOD1.This aligns well with evidence that altered hydrophobicity promotes aggregation of the SOD1 protein (29), and may reflect greater destabilization and misfolding of SOD1 when variants cause more extreme alterations in hydrophobicity (30)(31)(32).Interestingly, variants of intermediate to intermediate amino acid substitutions were characterized by particularly short disease duration.Hydrophobic to hydrophilic amino acid substitutions and vice versa were, notably, infrequent relative to other substitutions.Given that these would represent the most extreme hydrophobicity alterations, this could indicate a potential survivorship bias and that these substitutions may be sufficiently deleterious to be evolutionarily suppressed.This appears reasonable since SOD1 is highly conserved, with deficiency being linked to severe and early onset phenotypes (33)(34)(35), and on the basis of variants in these hydrophobicity groups being entirely absent from the gnomAD v2.1.1 population database (36) (see Table S4).
Analysis of the variants at p.G94 emphasized the extent to which individual SOD1 variants differentially influence the phenotype.Grouping together all p.G94 variants suggested that age of ALS onset and disease duration are comparable for people with variants at this residue and those with non-p.G94 SOD1 variants.Only by examining variants individually did we observe that p.G94A was associated with shorter, and p.G94C trended toward longer, disease duration than nonp.G94 SOD1-ALS.Likewise, p.G94C and the aggregation of p.G94R, p.G94S, and p.G94V were indicative of substantially earlier age of onset.These findings are consistent with the results of our previous analysis of SOD1-ALS, emphasizing distinction between trends in age of onset and disease duration across individual variants (10).They highlight particularly the importance of making available resources to allow variant-level analyses of the ALS phenotype associated with variation in SOD1.
The tool is not without limitation.Most notable is that a number of the 162 SOD1 variants sampled are harbored by very few individuals and thus are not sufficient for individual variant analysis with the native dataset alone.However, this issue can be somewhat circumvented by aggregating rarer variants into a single analysis stratum, and by the possibility of increasing the dataset with user-supplied data.
Certain considerations apply when providing supplementary data to the tool.Firstly, CPH models may only include covariates that are available in the native dataset.Second, records from supplementary data may overlap with native dataset.To reduce this possibility, the tool will automatically flag any people among the supplementary dataset who may be a duplicate of a person in the built-in dataset, checking for matches by country of origin, SOD1 amino acid change (if the user indicates that one is present), age of onset, site of onset, sex, and disease duration (if not censored).Users can also consult the cohort description provided and contact ALSoD (https://alsod.ac.uk/ (37);) with any concerns.
Overall, the open-access web-utility we provide (https://sod1-als-browser.rosalind.kcl.ac.uk/) has a potentially substantial benefit for ALS disease research and direct translational use for the design of patient stratification approaches, as well as being useful for mutation adjudication committees needing to make decisions on likely disease course with limited data.It permits an array of analysis options which can be readily implemented by users without any programming knowledge, and can be enriched by the provision of a supplementary dataset.Accordingly, this tool allows clinicians and researchers to circumvent many possible barriers they may otherwise face, for instance, regarding insufficient data availability or in preparing these data for analysis.The potential translational benefit of this tool is substantial, facilitating growth in understanding of the ALS phenotype which may aid the design and implementation of effective healthcare, treatments, and clinical trials.

Figure 1 .
Figure 1.Variant characteristics for the native dataset.Panel A: The canonical SOD1 amino acid sequence (bold) and variants recorded at each residue, denoted using IUPAC amino acid nomenclature (16), where "X" indicates protein truncating variants.Alternating background shading indicates residues encoded from different exons of the SOD1 gene.Panel B: The number of variants with at least a certain number of records available across different thresholds.

Figure 2 .
Figure 2. Kaplan-Meier survival curves for age of onset and disease duration analyses.Analysis shown: Panels A-B: trends associated with wild type and variant amino acid hydrophobicity; Panels C-D: Any SOD1 p.G94 variant versus non-p.G94 SOD1 variants (OtherVariant).Panels E-F: individual p.G94 variants versus non-p.G94 SOD1 variants.Panels A, C, and E are for age of onset analysis, and B, D, and F describe disease duration.Panels A and B display all hydrophobicity groups in a single figure for each analysis and confidence intervals are not displayed to maximize visual clarity; Figure S1 visualizes trends in age of onset and disease duration for these groups after stratifying across panels according to the hydrophobicity group of the wild type residue.

Table 2 ;
Figure 2) however, suggested that alterations in amino acid hydrophobicity may affect disease prognosis following onset.Analysis of variants at residues which are hydrophobic in wild type SOD1 indicated that disease duration was shorter in substitutions to hydrophilic amino acids (p value: log-rank test ¼ 1.25 Â 10 À5 ; CPH model ¼ 9.19 Â 10 À8 ) and that substitution into intermediate amino acids also tended toward shorter disease duration (p value: log-rank test ¼ 0.0202; CPH model ¼

Table 1 .
Data summary for case studies.For disease duration analysis, restricted mean and estimates are obtained from the survival curve and the standard error (SE) and 95% confidence interval (CI) indicates certainty in this estimate; the distribution of disease duration in non-censored individuals is reported directly.In age of onset analysis no person is censored, therefore estimates correspond with the raw descriptive statistics.Ã Minimum and maximum shown because disease duration was only available for two people.SD

Table 2 .
Inferential statistics for survival analyses across case studies.Bold values denote nominal p values <0.05.Ã controlling for sex in the age of onset analysis and for sex and age of onset in the disease duration analysis.# Hazard ratios greater than 1 indicate earlier age of onset/shorter disease duration in the non-reference group.† No p.G94S variants were available for the disease duration analysis.CPH ¼ Cox proportional-hazards.
College London, or the Department of Health and Social Care.AI is funded by South London and Maudsley NHS Foundation Trust, MND Scotland, Motor Neurone Disease Association, National Institute for Health and Care Research, Spastic Paraplegia Foundation, Rosetrees Trust, Darby Rimmer MND Foundation, the Medical Research Council (UKRI) and Alzheimer's Research UK.MK is supported by Darby Rimmer MND Foundation and Spastic Paraplegia Foundation.GH is supported by King's College London DRIVE-Health Centre for Doctoral Training, the Medical Research Council and the Perron Institute for Neurological and Translational Science.