Ensemble machine learning identifies genetic loci associated with future worsening of disability in people with multiple sclerosis

Data, study cohort, and inclusion criteria

Using prospective data pooled from the multi-centre (Brisbane, Newcastle, Geelong and Western Victoria, and Tasmania) Australian Longitudinal Cohort Study (the AusLong Study (https://www.menzies.utas.edu.au/research/diseases-and-health-issues/research-projects/the-auslong-study-of-factors-that-contribute-to-the-development-and-progression-of-ms))54 of MS, we analysed 279 prospectively assessed first demyelination event (FDE) participants enrolled between 2003 and 200655. The AusLong Study (https://www.menzies.utas.edu.au/research/diseases-and-health-issues/research-projects/the-auslong-study-of-factors-that-contribute-to-the-development-and-progression-of-ms)54 has ethical approval from the Tasmanian Health and Medical Research Ethics Committee (ref: H0010499, 01/-5/2009); the Queensland Institute of Medical Research Human Research Ethics Committee (ref: P1252, 22/05/2009); the Royal Brisbane and Women’s Hospital Human Research Ethics Committee (ref: HREC/09/QRBW/299, 19/10/2009); the Hunter New England Human Research Ethics Committee (ref: 09/04/15/5.04, HREC/09/HNE/139, SSA/09/HNE/140, 10/08/2009); and the Barwon Health Human Research Ethics Committee (ref: BH 09/24, BH 03/46, 04/08/2009). All experiments (blood collection, genotyping, and clinical examinations) were conducted in accordance with the guidelines of each committee at the participating centres. Written informed consent was obtained from all subjects and/or their legal guardian(s) in accordance with the Declaration of Helsinki56. EDSS scores were acquired prospectively at intervals up to 15 years post FDE by trained and certified neurologists, and a validated telephone EDSS was obtained at yearly computer-assisted telephone interviews from 2 to 3 years post FDE. Initial data extraction (n = 279 cases) was done using the revised 2017 McDonalds criteria57 in which cases were defined at their last review as either remaining as clinically isolated syndrome (CIS), relapsing-onset MS (ROMS), secondary progressive MS (SPMS), or progressive-onset MS (POMS). The selection criteria for the final cohort were done as illustrated in Fig. 1.

Figure 1
figure 1

A flow chart of AUSLONG data extraction and case selection criteria.

Genotyping, imputation, and quality control

The Illumina Infinium Global Screening Array-24 v2.0 BeadChip was used to genotype DNA samples from AusLong Study participants. Genotypes were called using Illumina GenomeStudio software. Strict quality control was conducted according to established protocols58. In brief, samples were excluded for three reasons: a call rate ≤ 99%, duplicate discordance, or gender error. Further, variants were excluded based on a call rate ≤ 99% or deviation from Hardy–Weinberg equilibrium with p < 1.0 × 10–6. Two principal components analyses were conducted, one excluding HapMap samples to identify population outliers, and one including HapMap samples to help interpret the outliers58. To maximise genetic coverage, the dataset were imputed using the algorithm implemented in IMPUTE version 459 using 1000 Genomes phase 360 as the reference genotype panel (GRCh37/hg19). Genetic variants having an imputation score ≤ 0.5 and minor allele frequency (MAF) ≤ 0.05 were discarded. For the remaining variants, those that were previously identified as being related to MS risk by IMSGC (https://doi.org/10.1126/science.aav7188?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed)39 were extracted (n = 208 of 232 SNPs) and considered in the association analysis. To be clearer, this study uses the IMSGC (https://doi.org/10.1126/science.aav7188?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed)39 risk SNP list as a reference source to identify MS risk variants that may also contribute to the risk of worsening of disability.

Imputation of missing EDSS measures

Imputation of missing EDSS (n = 471 of 3065) was based on a Bayesian approach using the JointAI R-package61. Conditional on the observed SNPs genotypes, the EDSS scores were considered missing at random. The imputation model is a cumulative logit mixed-effect proportional odds model1 defined on 8 disability states (1 = EDSS [0–1.5], 2 = EDSS [2–2.5], 3 = EDSS [3–3.5], 4 = EDSS [4–4.5], 5 = EDSS [5–5.5], 6 = EDSS [6–6.5], 7 = EDSS [7–7.5], and 8 = EDSS [8–9.5]). Based on the results from previous studies23,25,62,63,64,65, clinical and environmental factors including sex, age at disease onset,BMI, titre of Epstein–Barr Nuclear Antigen IgG (EBNA), smoking status, hospital anxiety depression scores (HADS), and previous EDSS scores (EDSSPREV), were used as “analysis variables” to impute EDSS levels, whereas latitude (study site), vitamin D supplementation status (VitD status), and MS disease course subtype (MSTYPE) were used as “auxiliary variables” to inform the imputation of any missing value(s) found in the “analysis variables”. These variables were included in the imputation model following their importance in predicting worsening of disability25. In the cumulative logit mixed model, we posit that

$$\mathrm{logit}\left(P\left({y}_{ij}>k\right)\right)={\alpha }_{k}+{Z}_{ij}^{T}\beta +{\zeta }_{ij}^{T}{b}_{i}, k\in 1, \dots , K,$$


$${\gamma }_{1}, {\delta }_{1}, \dots , {\delta }_{K-1}\sim N\left({\mu }_{\gamma }, {\sigma }_{\gamma }^{2}\right),$$

$${\mu }_{y}\sim N\left(0, {\sigma }_{u}\right),$$

$${\sigma }_{\gamma }^{2}\sim \Gamma (\varepsilon , \varepsilon )),$$

$${\gamma }_{k}\sim {\gamma }_{k-1}+\mathrm{exp}\left({\delta }_{k-1}\right), k=2,\dots , K,$$

where \({y}_{ij}\) is the EDSS level for subject \(i\) at visit \(j\), \({\gamma }_{k}\) are 7 intercepts representing the levels of EDSS (\(\mathrm{i}.\mathrm{e}., k=2,\dots ,8\)), \({Z}_{ij}^{T}\) is a fixed-effect design matrix containing the clinical and environmental covariates including time-varying effects (BMI, HADS, VitD status, and MSTYPE), with a corresponding vector of fixed effects regression coefficients \(\beta\); and \({\zeta }_{ij}^{T}\) is a design matrix containing random effects, \({b}_{i}\) are random deviations from the overall intercepts \({\gamma }_{k}\); \({\mu }_{y}\) and \({\sigma }_{\gamma }^{2}\) are hyperpriors61.

Outcome measures, analysis endpoint, and data structure

Based on the study design, a first-order Markov’s assumption for continuous-time EDSS transitions was considered1,66,67,68, and defined as: “the current EDSS state (EDSS score) depends on the previous states (EDSSPREV), and all covariate histories”. In other words, using 8 categories (listed above) of the newly imputed EDSS score, we considered a continuous-time evolution for each disability state, wherein the state at the previous observation is retained until the current visit. Note that an observation may also represent a transition to a different state before arriving at the current state, or a repeated observation of the same underlying state at the end of follow-up.

Using these assumptions, we transformed the data and defined our clinical endpoint to capture continuous-time transitions in disability states as: y = 1 denoting “worsening” events (transitions from a lower to a higher disability state) made by an individual from study entry, and y = 0 denoting “improved” events (transitions from a higher to a lower disability state). All stable-state transitions or stable disease (no change in EDSS) were excluded as these were considered non-informative censoring events, and could lead to likelihood drainage, and potentially alter the results. Therefore, only informative (“improved”) events were censored. The event status for the \(i\)th subject at the \(j\)th visit was defined as

$${y}_{ij}=\left\{\begin{array}{l}1,\quad if\, worsening \,events, \\ \\ 0, \quad otherwise \end{array}\right..$$

Since individuals entered the study at different times, we defined the time-to-worsening of disability as the time to switch disability states. Specifically, it is the continuous time elapsed since MS diagnosis until the current observation. This was achieved using the “msm2Surv” function in mstate R-package69. To enable comparison of baseline hazards, the start time for all cases was set to zero at study entry.

Statistical analysis

All statistical analyses were conducted before (after) imputation of EDSS, respectively. To identify risk SNPs that predicted the time-to-worsening, and/or associated with future worsening events over time, a three-stage process was employed.

Stage 1: variable selection, risk estimation, and prognostic modeling

Variable selection

We randomly split the genotype data into 75% training (n = 202), and 25% test cohorts (n = 67) as depicted in Fig. 2. Utilising the training cohort, we first performed a global test to examine the added prognostic values of all SNPs (\(n\) = 208) that passed the QC stage using the Goeman’s “globaltest” R-package70. Specifically, we tested the hypothesis (\({H}_{0}: {\beta }_{1}={\beta }_{2}=,\dots ,{\beta }_{208}=0\) versus \({H}_{a}: {\beta }_{1}\ne {\beta }_{2}\ne ,\dots ,{\beta }_{208}\ne 0\)) of no additional prognostic values of MS related genetic variants on the risk of worsening, conditional on the effects of clinical and environmental modifiers of disease (mentioned above). The significance level for this test was set to \(p\) < 0.0571. Following the global test results, we applied three widely used penalised multivariable Cox models namely: least absolute shrinkage and a selection operator (LASSO), elastic net (ENET), and non-negative garrotte combined with sure independent screening (NNG-SIS), with tenfold cross-validation (CV) to select important SNPs. Because a SNP can affect one or multiple EDSS transition steps with effects in different directions, we added interactions with EDSSPREV. Utilising the training cohort, LASSO and ENET were fitted using the Goeman’s penalised R-package72, and NNG-SIS using customised survival functions73. SNPs having zero effect sizes were discarded.

Figure 2
figure 2

Outline of ensemble learning, and genetic risk prediction model construction. PT progressive transitions (“worsening” events); RT regressive transitions (“improved” events).

Risk estimation and prognostic modelling

For the remaining SNPs selected across LASSO, ENET and NNG-SIS, a time-dependent multivariable Cox model with backward elimination (α = 0.05) was further employed to identify candidate SNPs8. This was achieved using the “mfp” R function74. Unbiased effect sizes for the candidate SNPs were then estimated in a random effects Cox model using the “coxme” R function75. In the Cox model, we posit that

$${\lambda }_{i}\left(t|x,b\right)={\lambda }_{0}\left(t\right)\mathrm{exp}\left({X}_{i}^{T}\beta +{\zeta }_{i}^{T}b\right),$$


$$b\sim N\left(0,\Sigma \left(\theta \right)\right),$$

where \(\lambda \left(.\right)\) is the hazard function for the \(i\)th subject at time \(t\), \({\lambda }_{0}\) is an unspecified baseline hazard function; \({X}_{i}^{T}\) is a fixed-effects design matrix containing SNPs dosages (including interactions with EDSSPREV), with a corresponding vector of fixed-effects regression coefficients \(\beta\); \({\zeta }_{i}^{T}\) is a design matrix of random effects, with a corresponding vector of random effects estimates \(b\); and \(\theta\) is the variance of the random effects. Note that the random components are subject identifiers nested within MSTYPE.

Stage 2: constructing genetic risk ensembles

To classify PwMS according to their disability status, we trained widely used RF and GBM models using the candidate genetic variants selected from stage 1 and compared their performance with MErf and MEgbm. The time-dynamic area under the receiver operating characteristics curve (AUC) was used to assessed model performance. This was achieved using internal and external fivefold cross-validation on the training cohort. That is, we split the entire training cohort into dynamic lagged training and internal validation sets (Fig. 2), such that predictors in the current visit were used to predict outcomes in the next visit51. The prediction model is a mixed-effects logistic model with random intercepts (subject identifiers nested within MSTYPE), and random slopes (random time effects). Next, mixed-effects logistics decision trees were constructed, and genetic decision rules were extracted using the “inTrees” (interpretable trees) algorithm in the MEml R-package51. To ensure non-redundant rule sets, we selected all rules of length between 2 and 5, with frequency < 0.01, and error < 0.25 in predicting future worsening events. In the mixed-effects logistic model used to classify MS subjects, we posit that

$${y}_{ij}\sim Bern\left({\mu }_{ij}\right), {y}_{ij}=\left\{\begin{array}{l}1, \quad if\, worsening\, status\\ 0, \quad otherwise \end{array}\right.,$$

$$\mathrm{logit}(\frac{{\mu }_{ij}}{1-{\mu }_{ij}})={(\beta }_{0}+{b}_{0i})+{X}_{i}^{T}\delta +{(b}_{1i}+{\beta }_{1}{t}_{ij}),$$


$${b}_{0i},{b}_{1i}\sim N\left(0,\Sigma \right), j=1,\dots {m}_{i}, i=1,\dots , N,$$

$$\sum_{i=1}^{N}{m}_{i}=2786; N=269,$$

where \({y}_{ij}\) is the event status for the \(i\)th subject at the \(j\)th time point; \({b}_{0i}\) are random deviations from the overall intercept (bias term) \({\beta }_{0}\); \({b}_{1i}\) are random deviations from the overall slope \({\beta }_{1}\), while \(\Sigma\) is the variance–covariance matrix for the random effects. \({\beta }_{1}\) is the regression effect of the observation time (\({t}_{ij}\)) since diagnosis; \({X}_{i}^{T}\) is a fixed-effect design matrix containing SNP dosages (including interactions with EDSSPREV), with a corresponding vector of fixed-effects regression coefficients \(\delta\).

Stage 3: validation of the ensembles and their prediction rules

To evaluate the ensembles and the generated decision rules obtained in stage 2, we assessed externally the performance of the ensembles on the test cohort. Time-dynamic ROC (receiver operating characteristic curves) analysis was used to assess how well each model predicted future worsening events. The importance of the SNPs and their genetic decision rules in predicting future worsening of disability was estimated and evaluated on the training and test cohorts, respectively. We prioritised SNPs based on average Gini impurity and relative influence; and by their deleteriousness in the human genome using combined annotation dependent depletion (CADD) scores76.

Functional annotation and gene enrichment analysis

Utilising the candidate prognostic variants, functional annotation and gene enrichment analyses were further conducted using the FUMA software as per the online manual77. The following parameters were used to further identify independent lead SNPs: maximum p-value for lead SNPs < 0.05, maximum p-value for annotation < 0.05, r2-threshold to define LD structure of lead SNPs ≥ 0.6, MAF ≥ 0.01, and maximum distance between LD blocks d < 250 kb. Because the raw p-values were derived from a multivariable analysis, p < 0.05 was used as the threshold cut-off.