Bayesian penalized model for classification and selection of functional predictors using longitudinal MRI data from ADNI

ABSTRACT The main goal of this paper is to employ longitudinal trajectories in a significant number of sub-regional brain volumetric MRI data as statistical predictors for Alzheimer's disease (AD) classification. We use logistic regression in a Bayesian framework that includes many functional predictors. The direct sampling of regression coefficients from the Bayesian logistic model is difficult due to its complicated likelihood function. In high-dimensional scenarios, the selection of predictors is paramount with the introduction of either spike-and-slab priors, non-local priors, or Horseshoe priors. We seek to avoid the complicated Metropolis-Hastings approach and to develop an easily implementable Gibbs sampler. In addition, the Bayesian estimation provides proper estimates of the model parameters, which are also useful for building inference. Another advantage of working with logistic regression is that it calculates the log of odds of relative risk for AD compared to normal control based on the selected longitudinal predictors, rather than simply classifying patients based on cross-sectional estimates. Ultimately, however, we combine approaches and use a probability threshold to classify individual patients. We employ 49 functional predictors consisting of volumetric estimates of brain sub-regions, chosen for their established clinical significance. Moreover, the use of spike-and-slab priors ensures that many redundant predictors are dropped from the model.


Introduction
The research literature on applied mathematical approaches and classification methods using longitudinal MRI data has seen massive growth over the past decade. Among the broad range of methods applied with variable degrees of success, several warrants mention them. Misra et al. (2009) implemented a highdimensional pattern recognition method to baseline and longitudinal MRI scans to predict conversion from MCI to AD over a 15-month period. Zhang and Shen (2012) used a multi-kernel SVM for classification of patients between MCI and AD, achieving 78.4% accuracy, 79% sensitivity, and 78% specificity. Lee et al. (2016) applied logistic regression in predicting conversion from MCI to Alzheimer's, using fused lasso regularization to select important features. Seixas et al. (2014) proposed a Bayesian network decision model for detecting AD and MCI which considered the uncertainty and causality behind different disease stages. Their Bayesian network used a blended effect of expert knowledge and data-oriented modelling, and the parameters were estimated using an EM algorithm. Adaszewski et al. (2013) employed classical group analyses and automated SVM classification of longitudinal MRI data at the voxel level. Arlt et al. (2013) studied the correlation between the test scores over time with fully automated MRI-based volume at the baseline. However, few studies to date have developed methods that increase the sensitivity, accuracy, and specificity of classification in AD diagnosis or progression to more than 80%.
Classification using longitudinal data can be a challenge with a large number of predictors. The first significant approach to handle longitudinal predictors is to consider each multiple-occasion observation as a single function observed over a time interval. Functional predictors have a high correlation with adjacent measurements, and the observational space is highdimensional. The number of predictors required for estimation often exceeds the number of observations, thus introducing the problem of dimensionality. A regression framework is frequently the most suitable to model all possible longitudinal effects across ROIs, where the proposed method will select the important predictors. Moreover, many biomedical studies have shown that a limited number of specific brain regions or ROIs are essential for AD classification. Thus, dimension reduction techniques can be applied, and classification can be limited to the reduced feature set. Zhu et al. (2010) advanced a method for classification and selection of functional predictors that entails calculation of functional principle component scores for each functional predictor, followed by the use of these scores to classify each individual observation. They proposed using Gaussian priors for selection and created a hybrid Metropolis-Hastings/Gibbs sampler algorithm. Although the method reported in the present study is inspired by this method, we develop a simple Gibbs sampler where MCMC samples are drawn from standard distributions. We also focus on applying penalized regression for dimension reduction. In the Bayesian variable selection literature, the spike-and-slab prior has widespread applications due to its superior selection power. McCulloch (1993, 1997) initially proposed that each coefficient β can be modelled either from the 'spike' distribution, where most of its mass is concentrated around zero, or from the 'slab' distribution, which resembles a diffuse distribution. Instead of imposing the spike-and-slab prior directly on regression coefficients, Ishwaran and Rao (2005) introduced a method in which they placed a spikeand-slab prior on the variance of Gaussian priors. The Bayesian variable selection methods also include different Bayesian regularization methods, such as Bayesian Lasso (Park & Casella, 2008), Bayesian Group Lasso, and Bayesian elastic net (Li & Lin, 2010). We employ a Bayesian group lasso algorithm blended with a spikeand-slab prior obtained from Xu and Ghosh (2015). The group structure among coefficients in our model comes from functional smoothing of the coefficients, and group lasso facilitates the selection of the important functional predictors. Thus, our proposed method takes the idea of Bayesian variable selection to a generalized functional linear model with binary responses.
The fundamental challenge of this work is to perform logistic regression in a Bayesian framework while using a large number of functional predictors. The direct sampling of regression coefficients from the Bayesian logistic model is difficult due to its complicated likelihood function. In high-dimensional scenarios, selection of predictors becomes crucial with the introduction of either a spike-and-slab prior, non-local priors, or horseshoe priors. For all such priors, the full posterior distribution of regression coefficients is analytically inconvenient. We obtain the Pólya-gamma augmentation method with priors proposed by Xu and Ghosh (2015), which yields full conditional samples from standard distributions. Our aim is to avoid the complications of Metropolis-Hastings and to develop an easily implementable Gibbs sampler. In addition, Bayesian estimation provides proper estimates of the model parameters, which are also useful for building inference. The key advantage of this method is that it calculates the log of odds of AD with respect to CN based on the selected longitudinal predictors. Moreover, we use a probability threshold for classifying individual patients to validate our modelling performance. We obtained the data used in the paper from the ADNI server. The volumetric MRI brain data includes parcellated sub-regions of the whole brain, with separate subdivisions for the left and right hemispheres. Volumetric measurements of brain sub-regions across multiple occasions over time demonstrate differential patterns of brain atrophy between AD patients and normal aging people. Because not all brain regions are as closely related to AD, the redundant features derived from the unrelated brain regions can be removed by limiting the selection to brain sub-regions important to classification. The problem of identifying important brain sub-regions from a large number of functional predictors or longitudinal measurements is far from simple. Various variable selection methods have been designed for single-time-point data with respective target variables. We apply a Bayesian variable selection method to select longitudinal features or functional predictors for our data set. We work with 49 functional predictors consisting of longitudinal volumetric measurements in different sub-regional brain ROIs. The use of the spike-and-slab prior ensures that a large number of redundant predictors are dropped from the model. The ROI sub-regions selected by our method will be helpful for future studies to detect the progression of dementia.
The paper is organized as follows. In Section 2, we introduce Bayesian variable selection with a spike-andslab prior. Section 3 discusses functional smoothing of the longitudinal predictors. In Section 4, we introduce our methodology and algorithm for simultaneous selection and classification. Theoretical properties and consistency results are shown in Section 5. We then discuss the application results with simulated data and real data in Sections 6 and 7. Finally, Section 8 covers the overall development and limitations of the methodology.

Bayesian variable selection
We will briefly discuss about Bayesian variable selection below.

Spike-slab prior
A Bayesian model with a spike-and-slab prior can be constructed as follows: where 0 is a p-dimensional zero vector, Γ is the p × p diagonal matrix diag(γ 1 , . . . , γ p ), π is the prior measure for γ = (γ 1 , . . . , γ p ) and μ is the prior measure for σ 2 . Ishwaran and Rao (2005) proposed this setup and developed optimal properties based on the prior choice of (β/γ ).
A popular version of the spike-and-slab model, introduced by McCulloch (1993, 1997), identifies zero and non-zero β i 's by using zero-one indicator variables γ i and assuming a scale mixture of two normal distributions: The value for τ 2 i > 0 is some suitably small value, while c i > 0 is some suitably large value. γ i = 1 represents β i 's which are significant, and these coefficients have large posterior hypervariances and large posterior β i values. The opposite occurs when γ i = 0. The prior hierarchy for β is completed by assuming a prior for γ i . When τ 2 i tends to zero we provide more masses on 0, as the prior for insignificant β's. The prior distribution for the regression coefficients can then be written as: with I 0 point mass at 0 coefficients; and ν 2 is the limit for c 2 i τ 2 i when τ 2 i tends to zero and c 2 i is large enough.

Bayesian group lasso
We discussed extensively about Bayesian Group Lasso in introduction. The form of Bayesian Group lasso we extensively worked with was initiated in Xu and Ghosh (2015). A multivariate zero-inflated mixture prior can bring sparsity in group level which is elaborately discussed in Xu and Ghosh (2015). The hierarchical structure with independent spike-and-slab prior for each β g is as follows.
Y|X, β, σ 2 ∼ N(Xβ, σ 2 I), where δ 0 (β g ) denotes point mass at 0. The mixing probability π 0 can be defined as a function of the number of predictors to impose more sparsity as the feature size increases. The choice of λ is very critical for Xu and Ghosh's prior setup. Large values of λ produce biased estimates, while very small λ values impose diffuse distribution for the slab part. Xu and Ghosh (2015) mentioned an empirical Bayes approach to estimate λ. Due to intractability of marginal likelihood, they proposed a Monte Carlo EM algorithm for the estimation of λ. Moreover, they showed theoretically and numerically that the median thresholding of posterior β g samples provides exact zero estimates for insignificant group predictors.

Functional smoothing for longitudinal data
Classification with the selection of significant functional predictors is challenging. Researchers commonly observe high correlation values between functional predictors. In this paper, we work with the assumptions of independence between predictors; hence, later we propose a corresponding prior in the coefficient space.The main advantage of using functional predictors is that it allows us to measure time trends present in data. We start our methodology by smoothing functional observations using a cubic basis spline. We restrict our data set to patients with at least four time period observations, such that smoothed curves are comparable. James (2002) used a similar approach to obtain the estimates of a generalized linear model with functional predictors.
Let us assume that we observe n patients with their functional observations and each patient has p functions. We assume that not all p functional observations are important. Let x ij (t) be the j-th function observed from the i-th patient. Let T be the compact domain of x ij (t) and x ij (t) ∈ L 2 [T]. With the functional predictors (x i1 (t), . . . , x ip (t)), we assume that we have binary response variable y i which takes value 0 and 1. We also assume that the predictors have been centred in this work, so that we can ignore the intercept term. Therefore, we have the following logistic regression equation: Next, we construct an orthonormal basis φ k (t) that can be used to decompose the functional predictors and the corresponding logistic regression coefficients, such as where c ijk and β jk are the coefficients of x ij (t) and β j (t) with respect to the k-th orthonormal basis φ k (t). For notational convenience, we denote the basis coefficients as β jk . These are different from the functional coefficients β j (t). We use cubic basis splines as the orthonormal basis for our simulation examples and real data applications. Hence, the choice of q completely depends on the number of internal knots used in basis spline constructions. The j-th component in equation (1) can thus be written as To fit the discrete observations x ij (t), we assume that, at any given time t, instead of x ij (t), we observe X ij (t): where e(t) is a zero-mean Gaussian process. We use the same basis function expansion for X ij (t) of the form where φ(t) is the q-dimensional spline basis at time t for j-th function, and c ij are the q-dimensional spline coefficients for the j-th predictor from i-th patient. We use ordinary least square estimates for estimating spline coefficients. A simple linear smoother is obtained by minimizing the least squares criterion x ij − Φc ij 2 aŝ Once the orthonormal basis coefficients have been estimated, we can combine (1), (2) and (3) by plugginĝ where β j = T β j (t)φ (t) dt, the coefficient vector for the j-th functional predictor. Here, vector c i has its first element as 1 and rest of the spline coefficients for i-th patient, and β contains intercept of the model as its first element. We use no intercept form for our real data and simulation application where c i = (ĉ i1 , . . . ,ĉ ip ) does not have first element as 1 and ) has group structure with each group size = q. Our selection method drops the redundant β's and will select the important coefficient groups.
Functional principal component (FPC) analysis is another popular method that can be applied here. Instead of least square basis estimates, one can work with FPC scores for classification. Zhu et al. (2010) also used FPC scores in their classification model, and they selected the functional predictors whose FPC scores were significant. MÜLLER (2005) extended the applicability of FPC analysis for modelling longitudinal data. Specifically, FPC scores can be used when we have few repeated and irregularly observed data points. In our functional smoothing method, we expanded the functional observation with spline basis functions and used the basis coefficients for classification. The same intuition can also be applied for FPC scores. For functional component analysis, we assume that longitudinal observations are from a smooth random function X(t) and its mean function is μ(t) = E X(t) and covariance function G(s, t) = cov(X(s), X(t)). The covariance function can be represented as where φ k 's are eigenfunctions and λ k 's are eigenvalues. Then, the underline process can be written as: where ξ k 's are frequently referred to as FPC scores. These scores can be used later in the classification model. We do not work with an infinite number of scores; instead, the above sum is approximated with a finite K that explains the majority of the variance in functional observations. For most cases, the first two FPC scores are enough to build a good classification model. In this paper, we work with the basis spline smoothing method due to its ease of implementation in statistical software. In R, we have the splines package, which fits cubic basis splines on longitudinal data with equally placed knots. We do not investigate any findings using FPC scores instead of basis spline coefficients, as our main focus is on the classification algorithm, and basis spline coefficients work very well for our classification model.

Classification using Pólya-gamma augmentation
In the following, we discuss Polson et al.'s (2013) algorithm; these authors showed how a Gaussian variance mixture distribution with a Pólya-gamma mixing density can approximate logit likelihood. We start by defining Pólya-gamma density-Random variable X ∼ PG(b, c), a Pólya-gamma distribution with parameters b > 0 and c ∈ , if where g k ∼ Gamma(b, 1) are independent gamma random variables and d = means equality in distribution. Polson et al.'s (2013) main result parametrized the log-odds of logistic likelihood as mixtures of Gaussian with respect to Pólya-gamma distribution. The fundamental integral result, which is easily integrated into the Gaussian prior hierarchy is that, for b > 0, where κ = a − b/2 and ω ∼ PG(b, 0). The introduction of latent variables (ω 1 , . . . , ω n ) later helped us in deriving conjugate posterior distribution. R package BayesLogit has an efficient algorithm to sample from Pólya-gamma distribution and it was proposed by Windle et al. (2014).

Selection using Bayesian group lasso
As we discussed in the Section 2.2, Meier et al. (2008) developed group lasso for logistic regression in a frequentist setup. In our model, we have p numbers of functional predictors (x i1 (t), . . . , x ip (t) with binary response y i ∈ {0, 1}, and each group has q levels. We can write our model as According to Meier et al. (2008) method, the logistic group lasso estimator with basis spline coefficients would look likê Before moving on to our proposed Bayesian method, we want to mention a similar model presented by Zhu et al. (2010): they used latent variables for Bayesian logistic regression, and FPC scores represented the functional predictors. They proposed a normal prior for the concatenation coefficients, which are the same as our coefficients β j 's. Now, motivated by Polson et al.'s (2013) integral result, we construct a Bayesian prior formulation targeted to handle binary logistic regression. Equation 4.4 has a Bernoulli likelihood function with logit link. We propose a spike-and-slab prior motivated by Xu and Ghosh (2015) with a zero-inflated mixture prior, which helps us in selecting the important group coefficients. As previously described, we introduce latent variables (ω 1 , . . . , ω n ) to take advantage of the integral identity described in Equation (5). Our prior setup is

Gibbs sampler
The likelihood for i-th observation is: where κ i = y i − 0.5 and ω i ∼ PG(1, 0). If we consider all n independent observations, given ω i we can write the joint likelihood as Due to the introduction of Pólya-gamma augmentation, we can derive a block Gibbs sampler with a posterior distribution of β j 's. The same method is derived in Xu and Ghosh (2015) for continuous Y in linear model setup. The blocks Gibbs sampler was introduced by Hobert and Geyer (1998). To build this sampler, we start with some notations. Let β (j) denote the β vector without j-th group, and the corresponding design matrix can be written as where C j is the corresponding design matrix for β j .
When β j = 0, (j) ) Hence the posterior full conditional of β j is a spike-and-slab distribution, where l j = p(β j = 0|rest). Now we will find the probability l j : Hence, (8) The posterior full conditional distributions of other parameters are stated below, and the derivations of the posteriors are described in appendix.
for all j = 1,. . . , p. Let, G j define whether a certain group is selected or not Then, (10) We will sample our augmented variables ω = (ω 1 , . . . , ω n ) using the posterior samples of β: Finally, we are left with the values of λ. λ is the most crucial parameter for our model and should be treated carefully. A large λ shrinks most of the group coefficients towards zero and produces biased estimates. In our real data analysis, we try to control λ by assigning a different range of values. Xu and Ghosh (2015) proposed a Monte Carlo EM algorithm for estimating λ.
The following is the k-th EM update for λ from their paper The expected value of τ 2 j |y for binary response y is intractable. In other words, this expected value can be calculated by taking mean of posterior samples of τ 2 j .

Marginal prior for β j
We first study the marginal priors of β j 's to examine the theoretical properties of the Bayesian group lasso estimators. We aim to establish the connection between β j group priors and existing Group Lasso penalization methods. We integrate out τ 2 j from β j priors. The marginal priors for β j 's are calculated based on Xu and Ghosh (2015)'s work with extension to binary response instead of continuous response. For β j = 0, The marginal prior for β j 's are also spike-slab with point mass at 0 and the slab part consists of a Multi-Laplace distribution which is same as the one considered in Bayesian group lasso (Casella et al., 2010) or matches with penalization mentioned in Bayesian Adaptive Lasso (Leng et al., 2014).
(12) Combining spike and slab both, the components facilitates variable selection at group level and shrinks the coefficients of the selected groups.

Median thresholding as posterior estimates
We previously discussed obtaining the selected group coefficient estimation through median thresholding of the MCMC sample. Xu and Ghosh (2015) generalized the median thresholding proposed by Johnstone and Silverman (2004) for multivariate spike-and-slab prior. Johnstone and Silverman (2004) showed median thresholding, under a spike-and-slab prior for normal means, has some desirable properties. In this section, we generalize this idea to a binary classification problem and show that the posterior median estimator serves as group variable selection by obtaining a zero coefficient for the redundant groups. We further demonstrate the posterior median as a soft thresholding estimator that is consistent in model selection and has an optimal asymptotic estimation rate.
Focusing on only one group, then Xu and Ghosh (2015) proposed the following theorem on Median thresholding: where Z is an m-dimensional random variable, and γ (.) and f (.) are both density functions for m-dimensional random vectors. f (t) is maximized at t = 0. Let Med(μ i |z) denote the marginal posterior median of μ i given data. By definition, Theorem 5.1: Suppose π 0 > c 1+c , then there exists a threshold t(π 0 ) > 0, such that when z 2 < t, Next, we focus on our problem setup. If we assume β j follows a Gaussian prior, β j ∼ N(0, B j ) and the design matrix satisfies the condition C j ΩC (j) = 0. Then the posterior estimates of β j |rest are: According to Theorem 5.1, assuming π 0 > c 1+c , then there exists t(π 0 ) > 0, such that the marginal posterior median of β jk under prior (6) satisfies Med(β jk |β j ) = 0 for any 1 ≤ k ≤ q, when β j 2 < t. We can interpret this result in the context of the same explanation provided by Xu and Ghosh (2015): the median estimator of the j-th group of regression coefficients is zero when the norm of the posterior estimates under any other prior distribution is less than a certain threshold.

Posterior Median as soft thresholding:
We assume that C j ΩC j = nI q and C matrix is group wise orthogonal with C j ΩC (j) = 0. We are considering the model defined in (6) with fixed τ 2 j,n and it depends on n. In this set-up, the posterior distribution of β j will be similar to the one derived in the previous section: where D j,n = 1 1+nτ 2 j,n and l j = π 0 Then, the marginal posterior distribution for β jk (1 ≤ k ≤ q) conditional on the observed data is a spike-andslab distribution, where C jk is the k-th vector of the C j -th group matrix. The corresponding soft thresholding estimator iŝ where z + is the positive part of z and Q j = Φ −1 ( 1 2(1−min( 1 2 ,l j )) ). Our results also follow Xu and Ghosh (2015)'s work to show the soft thresholding. One should especially note that the term D j,n depends on τ 2 j,n which controls the shrinkage factor. Oracle Property: Let β 0 , β j 0 , β 0 jk be the true values β, β j , β jk , respectively. The index vector of true model is A = (I( β j 2 = 0), j = 1, . . . , p), and the index vector model selected by certain thresholding estimatorβ j is A n = (I( β j 2 = 0), j = 1, . . . , p). Model selection consistency is attained if and only if lim n P(A n→∞ = A) = 1.

Posterior consistency
In this section, we conduct a theoretical investigation regarding the convergence of the group lasso estimator model to the true model. To show model consistency, we refer to the results and theorems mentioned in the paper titled 'On the consistency of Bayesian variable selection for high dimensional binary regression and classification' by Jiang (2006). In this paper, the author set up Bayesian variable selection similar to Smith and Kohn (1996) by introducing a selection indicator vector γ = (γ 1 , . . . , γ p ) where γ i = 0/1. The corresponding prior setup is as follows: We can establish a direct connection between our model and the above penalized regression. We reparametrize the groups coefficient vector β j = γ j b j where γ j , j = 1, . . . , p are the selection indicator 0/1 valued. As in Section 5.1 we have shown the marginal prior of β j follows a Multi-Laplace distribution. We can place a Bernoulli prior in γ j , The marginal prior distribution of β j is the same as that in Equation (12). Next, we study the asymptotic results as n → ∞. Let y be the binary response and c is the corresponding basis coefficients for any given subject. Let the true model be of the form μ o (c) = e pn j=1 c j β j 1+ pn j=1 c j β j = ψ( p n j=1 c j β j ), where β j is a qX1 vector with p n (↑ n) number of group vectors present in the model. As described by Jiang (2006), we assume that the data dimension satisfies 1 ≺ p n and log(p n ) ≺ n, where a n ≺ b n represents a n = o(b n ), or lim n→∞ a n b n = 0. We assume sparsity of the regression coefficients on the group level, i.e., lim n→∞ p n j=1 β j 2 < ∞, which implies that only a limited number of group coefficients are nonzero. We further assume c j 2 ≤ 1, j = 1, . . . , p n for simplicity.
Next we define consistency using Jiang's (2006) description of density function, and measure the distance between two density functions with Hellinger dis- The below definitions are quoted from Jiang's (2006) article.
Definition 5.1: Suppose D n is i.i.d. sample based on density f 0 . The posterior π n (.|D n ) is asymptotically consistent for f 0 over Hellinger neighbourhood if for any > 0, as n → ∞ (Density Consistency).
Next we define consistency in classification from Jiang (2006) Combining Propositions 1 and 3 from Jiang (2006), under conditions I, S and L, density consistency directly implies classification consistency. The proof follows by checking conditions I, S and L from Jiang's paper (Jiang, 2006), since our prior satisfies his prior setup. To have density consistency and classification consistency for posterior estimates, we need to check whether our prior setup follows Jiang's conditions. The motivation for the proof and the technique of checking conditions to establish the theorem were discussed in theses Majumder (2017) and Shi (2017).

Condition I: (On inverse link function ψ)
Denote w(u) as the log odds function w(u) = log[ψ(u)/(1 − ψ(u))]. The derivative of the log odds w (u) is continuous and satisfies the following boundaries condition when the size of the domain increases: sup |u|≤C |w (u)| ≤ C q for some q ≥ 0, for all large enough C.
We checked for the conditions; corresponding proofs are in the appendix.

Simulation results
We assess the performance of our proposed simultaneous classification and selection methodology with simulated data sets. We apply our method to both simulated and real data. We compare the results from our Bayesian method with those from a frequentist group lasso selection method for binary response. To the best of our knowledge, no other Bayesian method reported in the literature is as convenient and efficient as the presently proposed method. The following section reports the method testing by creating three different examples with varying numbers of predictors. We generate a binary response with simulated functional predictors; there are a significant number of inessential predictors.

Example
We first generate functional predictors x ij (t) using a 10-dimensional Fourier basis φ 0 (t) = 1 and φ k (t) = √ 2 cos(kπ t), k = 1, . . . , 9, adding an error term. We work with a similar simulation set up mentioned in Fan et al. (2015), as Fan's model setup is also based on functional predictors. We generate our predictors as follows: where φ(t k ) = (φ 0 (t k ), φ 1 (t k ), . . . , φ 10 (t k )) . We take σ = 0.5 and we generate 200 i.i.d observations using 20 functional predictors. Each predictor is observed at 50 time points, and time points are equally distributed between 0 and 1. θ ij and ijk are independently sampled. It is easier to understand the set up notationally as 'i' varies from 1 to 200, 'j' varies from 1 to 20 and 'k' varies from 1 to 50. We construct a cubic basis spline on (0 = t 1 , . . . , t 50 = 1) with four internal knots equally spaced at 20%, 40%, 60% and 80% quantiles. We use R-package 'splines' and the 'bs' function to construct the basis matrix φ. With 4 internal knots, plus intercept and degree = 3, we end up having eight columns in the basis matrix for each predictor, i.e., q = 8. To validate classification and selection performance, we use 75% of the observations as training data, and the remaining 25% for testing purposes. We repeat this process 100 times to limit sampling bias in data and concatenate all results considering the 100 repetitions.

Example 2 and 3
In example 2, we increase the number of predictors from 20 to 50 while maintaining 200 observations with 50 time points for each observation. The functional predictor generation in Example 3 follows the same method as in Example 1, but generates 500 observations with 100 functional predictors and 20 time points for each observation. We use three internal knots to smooth the predictors.
In both cases, we chose the second and final predictor, i.e., β 2 (t)and β p (t) as non-zero, and the rest of the coefficients are zero. We generate the binary response y ∈ (0, 1) from a Bernoulli distribution using the set of pre-assigned β. In all of the examples, 75% of the data is used for training and 100 repetitions are used to normalize sampling bias. We obtain 20,000 Gibbs samples, and the first one-third of these samples are discarded as a burn-in period. All the parameter estimates are obtained using the remaining samples. As Xu and Ghosh (2015) showed that median thresholding gives exact 0 estimates for the redundant group coefficients, we apply a posterior median on posterior samples to obtain β estimates. We choose a = 1, b = 1 as the initial parameter values for the prior distribution of π 0 and β = 0 is used as the initial choice for the first iteration. Although we have p numbers of functional predictors, the number of coefficients we need to estimate is p × q. In Example 2, we have p = 50, and with four internal knots for each function we obtain q = 8. Hence, the number of coefficients we need to estimate is 400 using 200 observations. From this perspective, our algorithm is applicable to 'large p, small n' conditions. The simulation results are presented below.

Example1 results
We obtain a 100% true positive rate and a 0% false positive rate in terms of selection, i.e., the two nonzero coefficients are captured in all 100 iterations. Moreover, none of the predictors that originally had zero coefficients are selected. In terms of classification, our method shows 97% sensitivity, 93% specificity, 95% accuracy, and AUC = 0.99. Below are the rejection probability plots for β 2 (t) and β 1 (t), of which the first is nonzero and the second is zero in the true model. In addition, we plot the posterior median estimates of the coefficient function with respect to its true values. The ROC curve establishes the differentiating power of our method.

Example 2 and 3 results
In Example 2, we obtain a 100% true positive rate and a 0.73% false positive rate out of 100 repetitions, with 97% sensitivity and 95% specificity. In Example 3, we achieve a 100% true positive rate and a 0% false positive rate with 98% sensitivity and 97% specificity. We compare our simulation results with those of frequentist group lasso for logistic regression for all the setups above. Our methodology yields the best results in terms of classifying subjects into the right class, far exceeding frequentist group lasso. Although the frequentist group lasso approach successfully identifies the true significant predictors for the model, it also selects many redundant functional predictors that have zero effect on the true model. The false selection of predictors in the model is very high compared to that of our algorithm. The table below summarizes the numerical results of all three aforementioned examples, with comparisons to frequentist group lasso for logistic regression (Figure 1).

Application on ADNI MRI data
This section reports the results of the application of our proposed method to ADNI data. The MRI data used in all analyses was downloaded from the ADNI database (http://www.adni-info.org/). The fundamental goal of ADNI is to develop a large, standardized neuroimaging database with strong statistical power for research on potential biomarkers in AD incidence, diagnosis, and disease progression. ADNI data available at this time include three projects: ADNI-1, ADNI-GO, and ADNI-2. Starting in 2004, ADNI-1 collected prospective data on cognitive performance, brain structure, and biochemical changes every 6 months. Participants in ADNI-1 included 200 CN, 200 MCI, and 400 AD patients. Then, starting in 2009, ADNI-GO continued the longitudinal study of the existing patients from ADNI-1 and established a new cohort that included early MCI patients, who were enrolled to identify biomarkers manifesting at earlier stages of the disease. ADNI-GO and ADNI-2 together contain additional MRI sequences plus perfusion and diffusion tensor imaging. The volumetric estimation for our data set was performed using FreeSurfer by the UCSF/SF VA Medical Center.
Considerable research has been conducted to develop automatic approaches for patient classification into different clinical groups, with many ADNI studies identifying ROIs associated with different disease stages. A support vector machine (SVM) is a primary tool utilized in many studies to evaluate the patterns in training data sets and to create classifiers to identify new patients. Fan et al. (2008) used neuroimaging data to create a structural phenotypic score reflecting brain abnormalities associated with AD. In classifying AD vs. CN, a positive score in their framework identified ADlike structural brain patterns. Their classifier obtained 94.3% accuracy in AD vs. CN, although their approach used only left and right whole brain volumes as potential predictors. Some researchers have used Bayesian statistical methods in studying Alzheimer's data.  method, which they named automatic relevance determination (ARD) and predictive ARD, to classify AD patients. This method outperformed an SVM classifier. Yang et al. (2010) proposed a data-driven approach to the automatic classification of MRI scans based on disease stages. Their methodology was broadly divided into two parts. First, they extracted the potentially classifying features from normalized MRI scans using independent component analysis. Next, the separated independent coefficients were applied for the SVM classification of patients. In contrast to this approach, our proposed method selects important components and classifies patients simultaneously. Moreover, we consider multiple brain sub-regions to identify those potential regions whose longitudinal trajectories are specifically related to AD. Another seminal paper by Jack et al. (1999) used MRI-based measurements of hippocampal volume to assess the future risk of conversion from MCI to AD. A bivariate model included hippocampal volume and other factors like age and APOE genotype, but only hippocampal volume was identified as significant. Wang et al. (2014) employed a functional modelling approach using Haar wavelets and lasso regularization to find ROIs in voxel-level data. In that approach, large Haar wavelet coefficients were related to most important features, with a sparse structure among redundant features. The majority of these methods are based on SVM classification, which often uses kernel-based methods for functional smoothing. Casanova et al. (2011) utilized a penalized logistic regression approach, and they calculated estimates using coordinate-wise descent optimization techniques from the GLMNET library. Similarly, our method employs penalized logistic regression with group lasso penalty. However, our approach differs in its use of both functional predictors and a custom algorithm developed in-house.
We consider the longitudinal volume of various brain regions, such as the Para hippocampal gyrus, cerebellar cortices, entorhinal cortex, fusiform gyrus, and precuneus, among many others. Although the accessed ADNI data set includes corresponding volume, surface area, and cortical thickness information, we work with only the volume information to acquire uniformity over longitudinal predictors. Because the brain is divided into right and left hemispheres, the data includes sub-regional brain volumes for both hemispheres. Our main objective is to identify the brain sub-regions whose volumetric trajectories can differentiate AD patients from the normal aging control group. As mentioned in the introduction, dementia is associated with widespread brain atrophy, although the time course and magnitude of shrinkage varies across regions.
The initial sample includes 761 patients' data from the ADNI database, classified as AD, MCI, or CN throughout their visits for the study. We exclude all patients classified as MCI, and any AD or CN patients whose diagnostic status changed over time. This is because our model assumes that response does not depend on time. Of the remaining patients, we include those with data from at least four longitudinal measurement occasions. This yields 296 patients who have at least four data points and unchanging diagnoses of either AD or CN. The final sample is composed of 174 AD patients and 122 normally aging controls. All patients underwent a thorough initial clinical evaluation to measure baseline cognitive and medical scores, including MMSE, the 11-item Alzheimer Cognitive Subscale (ADAS11), and other standardized neuropsychological tests. In addition, at baseline, APOE genotyping information was obtained from patients. Longitudinal structural MRI scans were parcellated into sub-regional brain volumetric measurements. Our initial model includes 49 sub-regional brain volumes chosen by Dr. Andrew Bender, based on knowledge of the extant literature regarding atrophy patterns in AD. Although these 49 sub-regions are not assumed to change in uniform magnitude, the direction of change over time is hypothesized to be consistent (i.e., shrinking). Thus, the model includes 49 longitudinal predictors that we consider as functional predictors. We assume that not all predictors are potential candidates for classifying patients, and that the sparse assumption is valid. However, because some patients' visits were irregular, we do not have an equal number of time points across patients. We start by comparing the baseline measurements between the AD and CN groups, as shown in Tables 1 and 2. In the next stage, we smooth the longitudinal trajectories for the observed volumes of all brain sub-regions. A simple least squares approximation is sufficient, as we assume that the residuals of the true curve are independently and identically distributed with mean 0 and have constant variance. We use the cubic B-spline basis functions for spline smoothing of observed volumes. Three internal knots are used for spline smoothing with intercept, which gives us seven basis functions. We seek to ensure that the smoothed estimated curve is a good fit for each patient's observed curve. As we do not have a large number of data points for each patient, we do not consider controlling for potential overfitting of our estimated curve. Besides least squares smoothing, functional principle component scores can also be used for this analysis.
Prior to analysis, we scale the brain volumes to the corresponding patients' brain ICV measurement  by fitting a simple regression to adjust volume measurements for individual brain volume changes. The aim is to remove systematic variation in brain volumes due to differences in physical size. The formula we use is ROI adj = ROI vol − β 0 (ICV − ICV mean ), where β 0 is the regression coefficient by regressing ROI vol on ICV (Jack et al., 1998;Raz et al., 2005). We adjust or correct the volumes using the above method for each gender group: male and female. Next, we scale the corrected volumes between 0 and 1 to bring all brain regions onto the same scale. We then divide the data set into two parts: two-thirds of the patients are reserved for the training data set (n = 198), and the rest are kept for testing (n = 98). We gather the basis coefficients for each patient in the training data set and use them as predictors for classification. We initialize choice of β with all zero to start iterations. The π 0 probability has a Beta(a,b) distribution with a and b both set up as 1. As a first step, we examine λ using Pólya-Gamma transformation of our sample with a spike-slab penalty on the training data. After estimating λ, we evaluate the remainder of the algorithm on the training data with 30,000 MCMC samples. The first one-third of observations are left out as a burn-in period. We propose a spike-and-slab prior on the β coefficient, which transforms into posterior estimates of zero for most the functional predictors. We run our model 100 times with different training samples to nullify sampling bias in the training and test data. In the 100 iterations, the model does not consistently or uniformly select many of the brain sub-regions; therefore, we choose the brain regions that frequently appear as significant in each iteration. The median thresholding selects the left hippocampus, left lateral orbitofrontal cortex, and left posterior cingulate gyrus with 100% probability. Other brain regions that are selected as important are the right Para hippocampal gyrus, left caudate nucleus, left medial orbitofrontal cortex, left putamen, left superior temporal gyrus, left thalamus, right hippocampus, and right middle temporal gyrus. In Figure 2, we plot the brain volume changes of the left hippocampus, left lateral orbitofrontal cortex, and left posterior cingulate gyrus over time. Orange and green signify the normal aging and dementia group, respectively. The bold thick line represents the mean curve for the corresponding group. The plot shows that there are significant differences in volume between the groups, and our model identifies these regions as significant. In Figure 3, we plot the acceptance probability of the MCMC sample for the left hippocampus and left lateral orbitofrontal brain regions. The method classifies patients into the correct group with 77% accuracy. We achieve 72% sensitivity, 85% specificity, and a corresponding AUC of 0.87. We use the median predicted probability from the training sample as the threshold for classification validation. We also test the classification by adding clinical measurements such as the ADAS11 (11-item Alzheimer Cognitive subscale), MMSE scores, 'CDRSB,' 'RAVLT immediate,' and 'RAVLT forgetting,' measured over time. In this classification, we initially select longitudinal brain volumes that are significant, and then we  add the clinical variables. We achieve very high classification measures of 97% accuracy, 97% sensitivity, and 98% specificity. If we ignore the MMSE score and run the model with the rest of the functional predictors, we observe similar classification accuracy. In all scenarios, we model diseased patients as 1 and CN as 0 for the interpretation of classification sensitivity/specificity (Figure 4).
In addition to finding functional models of longitudinal trajectories in sub-regional brain volumes to differentiate between the AD and normal groups, we also apply our method for MCI converters vs. MCI nonconverters. We select patients who entered the study as MCI, and we assign the label of MCI nonconverter (MCI-nc) to those who did not transition to AD across all measurement occasions and a label of MCI converter (MCI-c) for any who did transition to AD. The total subsample includes 163 patients who were either MCI-c or MCI-nc. We use three-quarters of the patients to train our model. We note the significant brain ROIs that are selected after 100 iterations. Among the selected ROIs that contribute to classification are the right posterior cingulate gyrus, right superior parietal cortex, right thalamus, right isthmus cingulate gyrus, right fusiform gyrus, left thalamus, and left precuneus. However, the classification performance is not as good as compared to the previous model: 62% accuracy and 0.66 AUC. The biological explanation for this result is critical to acknowledge. The mean difference of functional predictors between MCI-c vs. MCI-nc is not significant for segmenting patients. Moreover, we also neglect some time points' data for this set of patients.

Discussion
This paper discusses the use of Bayesian group lasso penalization combined with Pólya-Gamma augmentation to build a simultaneous classification and selection method. The Bayesian spike-and-slab prior helps in identifying functional parameters generated from longitudinal trajectories of multiple brain ROIs, and discriminates the patient group from normal controls. The inclusion of Pólya-Gamma augmentation helps avoid the Metropolis-Hastings algorithm or the incorporation of other expensive sampling algorithms related to latent variables. We consider the longitudinal brain ROI volume measurements as functional predictors, and the cubic basis splines smooth the curves over time. The next steps include using those smoothed functional predictors as discriminating inputs with sparsity assumptions among them.
The consistency property of the posterior distributions provides a theoretical justification regarding the convergence of posterior samples which we sampled for simulation and real data analysis. The posterior distribution Π(θ|X 1 , . . . , X n ) is said to be consistent at θ 0 if it converges to θ 0 with some measure. It ensures if we generate enough observations from posterior we would get close to the true value. Our density consistency property ensures that derived posterior distribution will achieve classification consistency for the classification problem we are interested. We would like to mention Doob's theorem regarding this discussion which ensures posterior consistency in Bayesian literature by choosing proper prior. Some priors are problematic which could raise questions regarding any Bayesian methods. We believe our prior selection is reasonable such that posterior consistency holds at every point of the parameter space.
We assumed functional predictors are independent for this paper. In order to capture the dependency between functional predictors, one could introduce a proper prior with some correlation structure between group coefficients. This will increase the model complexity and moreover it would be practically hard to validate these dependency structures in real data analysis. Further research on this topic will definitely be an improvement upon current modeling proposal. Our proposed method performs well on simulated data sets, outperforming available frequentist methods. Furthermore, our method is applied on a data set that has a large number of predictors.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported by Directorate for Mathematical and Physical Sciences [1924724].