Literature DB >> 29664902

A general prediction model for the detection of ADHD and Autism using structural and functional MRI.

Bhaskar Sen¹, Neil C Borle¹, Russell Greiner^1,2, Matthew R G Brown^1,3.

Abstract

This work presents a novel method for learning a model that can diagnose Attention Deficit Hyperactivity Disorder (ADHD), as well as Autism, using structural texture and functional connectivity features obtained from 3-dimensional structural magnetic resonance imaging (MRI) and 4-dimensional resting-state functional magnetic resonance imaging (fMRI) scans of subjects. We explore a series of three learners: (1) The LeFMS learner first extracts features from the structural MRI images using the texture-based filters produced by a sparse autoencoder. These filters are then convolved with the original MRI image using an unsupervised convolutional network. The resulting features are used as input to a linear support vector machine (SVM) classifier. (2) The LeFMF learner produces a diagnostic model by first computing spatial non-stationary independent components of the fMRI scans, which it uses to decompose each subject's fMRI scan into the time courses of these common spatial components. These features can then be used with a learner by themselves or in combination with other features to produce the model. Regardless of which approach is used, the final set of features are input to a linear support vector machine (SVM) classifier. (3) Finally, the overall LeFMSF learner uses the combined features obtained from the two feature extraction processes in (1) and (2) above as input to an SVM classifier, achieving an accuracy of 0.673 on the ADHD-200 holdout data and 0.643 on the ABIDE holdout data. Both of these results, obtained with the same LeFMSF framework, are the best known, over all hold-out accuracies on these datasets when only using imaging data-exceeding previously-published results by 0.012 for ADHD and 0.042 for Autism. Our results show that combining multi-modal features can yield good classification accuracy for diagnosis of ADHD and Autism, which is an important step towards computer-aided diagnosis of these psychiatric diseases and perhaps others as well.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2018 PMID： 29664902 PMCID： PMC5903601 DOI： 10.1371/journal.pone.0194856

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

1 Introduction

Statistical machine learning methods have recently permeated disciplines such as Psychiatry, which specializes in the diagnosis and treatment of neuropsychiatric disorders [1]. The availability of large scale neuroimaging datasets has encouraged researchers to develop computer-aided tools and procedures for understanding the human brain and its disorders. Some studies use structural Magnetic Resonance Imaging (MRI) scans, which provide a non-invasive technique for obtaining a volumetric image of the brain anatomy, while others use functional MRI (fMRI) scans, which measure brain activity by detecting fluctuations in the Blood Oxygenation Level Dependent (BOLD) signal over time. Many neuroscientists are seeking ways to use such MRI and/or fMRI data to detect brain function disorders, such as Autism and Attention Deficit Hyperactivity Disorder (ADHD). While significant work has been done in association studies for finding discriminative group-level characteristics for ADHD [2-4] (resp., Autism [5, 6]), we are seeking a general learning tool that, when given relevant data, can learn an accurate predictive model, which can predict whether an individual subject has a particular disease. This paper explores biologically naive ways to learn combinations of state-of-the-art structural texture features (from structural MRI scans) and functional connectivity (from fMRI scans) to predict whether a subject has ADHD (resp., Autism). Specifically, our learned models use recently-developed unsupervised feature learning from images and independent component analysis of fMRI for prediction of the diseases. From brain images, the structural textures provide information about the spatial arrangements of voxel intensities in 3 dimensions, which in turn can describe neurological aspects of a subject’s brain. On the other hand, functional connectivity captures patterns of deviations from statistical independence between the time signals at distributed, and often spatially remote, neuronal regions [7-9]. Deviations from statistical independence are generally taken to indicate dynamic coupling and can be measured, for example, by estimating the correlation, independent components, etc. As a first step towards creating a generalized prediction tool for ADHD/Autism, we experiment with 3-D texture based models that describe structural arrangements of the brain by learning texture features from MRI scans. These features are then used by our LeFM learner. To use the fMRI data, we experiment with four different source separation techniques, that each decompose the scans into spatio-temporal components that are common across the subjects. (Each component specifies a timecourse. For each each, for each component, a spatial map is created based on how simliar each voxel’s time course is to the component time course. See Methods for details.) This, in turn, allows each subject’s fMRI scan to be described as a weighted sum of the common spatio-temporal components, with the subject’s weights then serving as features for our model. Our LeFM learner uses these fMRI-derived features to produce classifiers designed to predict whether a patient has a particular disease. Finally, our learner LeFM uses both sets of features (structural and functional) to produce a classifier designed to predict if a patient has a particular disease. Here, we use two publicly available, multi-site datasets, ADHD-200 and ABIDE (http://fcon_1000.projects.nitrc.org/indi/abide/), for developing and then testing our models. Since the publication of the ADHD-200 dataset, many researchers [10-13] have explored ways to improve the prediction accuracy of ADHD using this data; see also [14] for Autism. However, even the best prediction results are not at the level of being clinically useful. The main contribution of this work is our approach to feature extraction—i.e., a single method to obtain features, by applying digital image processing algorithms to structural and functional MRI scans, that can be used in learned models to accurately distinguish between ADHD versus Healthy (resp., Autism versus Healthy) subjects. In particular, While standard independent component analysis (ICA)-based source separation is used primarily in analyzing group-level activation differences in fMRI analysis, we devise a novel extension, and algorithm, that is applicable to prediction studies (Section 2.5). We show that the resulting LeFM algorithm, with its learned combination of these learned features, out-performs other fMRI-based algorithms for ADHD/Autism prediction, on these datasets. In addition to using ICA-based source separation in the context of a predictive study, we introduce the approach of using multiple decorrelation [15] for the purpose of fMRI source separation and compare it to three other commonly used source separation algorithms. Here, fMRI volumes are modeled as non-stationary signals and source separation based on second-order criteria is used to separate out common spatial activation maps and subject-specific time courses. Combining texture-based features learned from MRI and features extracted using fMRI source separation, we improve the prediction model accuracy for ADHD to 0.673 (compared to 0.661) and Autism to 0.643 (compared to 0.619). Our application of image processing algorithms for structural/functional MRI data suggests that biologically naive features can lead to effective classifiers of mental disorder (ADHD/Autism), suggesting this approach should be further investigated for these, and other disorders.

2 Materials and methods

This section presents the overall process of the diagnostic system; see Fig 1 We first describe the datasets (Section 2.1) and evaluation criteria (Section 2.2). Then we outline the preprocessing pipeline in Section 2.3. The remaining sections summarize diagnostic methodology from MRI scans (LeFM; Section 2.4), from fMRI scans (LeFM; Section 2.5), and from combined imaging features (LeFM; Section 2.6).

Fig 1

Overall pipeline, including feature extraction steps, for each of our methods: LeFM uses all of these steps (as it uses both MRI and fMRI data); LeFM does not use the red “fMRI feature extraction” box; and LeFM does not use the grey “MRI feature extraction” box.

In each case, in addition to the learner that utilizes the combined features, our system also learns the most effective features from the MRI data (grey box) and from the fMRI data (red box); those features are used to pre-process the test data (see boxes of the same color).

Overall pipeline, including feature extraction steps, for each of our methods: LeFM uses all of these steps (as it uses both MRI and fMRI data); LeFM does not use the red “fMRI feature extraction” box; and LeFM does not use the grey “MRI feature extraction” box.

2.1 Datasets

We used one multi-site dataset for producing, then evaluating, each model: ADHD-200 (including 8 sites) and ABIDE (including 17 sites). Each of the datasets included a structural scan (high resolution, for a single time point), and also one or more resting-state functional scans for each of the subjects. The spatial resolution of the structural MRI scans was 1mm × 1mm × 1mm. In these resting state functional scans, the subject did not perform any explicit task. That functional scan included between 76 to 261 time points for each ADHD-200 subject and between 82 and 320 time points for each ABIDE subject. Different subjects were scanned with different temporal resolutions: ranging from 1.5 seconds through 3 seconds in the ADHD-200 dataset, and from 1 seconds through 3 seconds in the ABIDE data. The field strength of the MRI scanners varied from 1.5T to 3T. Each data collection site used its own scanner(s) and its own MR scanning parameters. The demographics of subjects in these datasets can be found in Tables 1 and 2. More details are available at the ADHD-200 site (http://fcon_1000.projects.nitrc.org/indi/adhd200/) and ABIDE site (http://fcon_1000.projects.nitrc.org/indi/abide/).

Table 1

ADHD-200 data demographics.

Site abbreviations: Peking University (Peking), Kennedy Krieger Institute (KKI), NeuroIMAGE (NI), New York University (NYU), Oregon Health and Science University (Oregon), University of Pittsburgh (Pitt), Washington University in St. Louis (WashU).

	Peking	Brown	KKI	NI	NYU	Oregon	Pitt	WashU
Subjects	245	26	94	73	263	113	98	61
ADHD	130	26	33	50	163	71	9	0
Male/Female	174/71	9/17	64/30	43/30	171/92	61/52	53/45	33/28
Age Mean	11.7	14.54	10.22	17.64	11.45	0.10	15.08	11.47
Age STD	1.96	2.54	1.34	3.05	2.91	1.20	2.78	3.88

Table 2

ABIDE data demographics.

Site abbreviations: Carnegie Mellon University (CMU), California Institute of Technology (Caltech), Kennedy Krieger Institute (KKI), Ludwig Maximilians University Munich (LMU), New York University (NYU), Olin Institute of Living at Hartford Hospital (Olin), Oregon Health and Science University (Oregon), San Diego State University (SDSU), NeuroIMAGE (NI), Stanford University (Stanford), Trinity Centre for Health Sciences (Trinity), University of California, Los Angeles (UCLA), University of Leuven (Leuven), University of Michigan (UMich), University of Pittsburgh School of Medicine (Pitts), University of Utah School of Medicine (Utah), Yale University (Yale).

	CMU	Caltech	KKI	LMU	NYU	Olin	Oregon	Sdsu
Subjects	27	38	55	57	184	36	28	36
Autism	23	19	22	24	79	20	13	14
♂/♀	23/4	32/6	48/7	48/9	154/30	31/5	24/4	30/6
Age Mean	25.4	22.3	10.4	20	13.9	17.2	10	14.4
Age STD	4.5	4.1	1.4	9.1	5.1	3.2	1.8	1.5
	NI	Stanford	Trinity	UCLA	Leuven	UMich	Pitts	Utah	Yale
Subjects	30	40	49	108	64	145	57	101	56
Autism	15	20	24	62	29	68	30	58	28
♂/♀	24/6	17/3	19/5	52/10	24/5	59/9	25/5	51/7	26/2
Age Mean	29.5	9.5	16.6	12.7	21.4	13.8	17.9	24.5	12.4
Age STD	5.9	1.7	3.0	2.1	2.3	2.7	5.5	3.7	2.9

ADHD-200 data demographics.

ABIDE data demographics.

2.1.1 ADHD-200

The ADHD-200 dataset is a multi-site combination of neuroimages taken from 8 sites (of which we use 7). It is partitioned into two disjoint sets (training and holdout), where the prediction model is learned from only the training set, and the learned model’s accuracy is measured on the holdout set. The data from the excluded site (Brown University) are not used because subject labels (ADHD vs. control) are not available for that site and we require labels for our machine learning algorithms. Thus, our training set consists of 776 resting state scans: 491 were taken from healthy controls and 279 from patients. To balance our training set, we used all 279 patients and selected 279 healthy controls evenly taken from all the sites as our training set [11, 12, 16], which means that the baseline classification accuracy for the training set is 0.50. The ADHD-200 competition hold-out dataset (excluding Brown University data) consists of 171 subjects: 94 healthy subjects and 77 ADHD cases (baseline accuracy 0.5497). We only use this set for evaluating the quality of the final model; n.b., it is not used in any way during the training process. The ADHD-200 dataset also includes other non-imaging features for each subject, including gender, age, handedness, site of the imaging, IQ measure, etc. [17]. However, we only use imaging data for our experiments.

2.1.2 ABIDE

The ABIDE dataset consists of 1111 scans, taken from 17 sites consisting of 573 healthy controls and 538 patients with autism. To evaluate each learning model, we use 800 subjects (70%) for model training and 311 subjects (30%) for hold-out testing. We use the same case/control ratio (0.5157) for both training and test set. The ABIDE dataset provides an array of non-imaging information which includes age, gender, handedness, various IQ scores, site of the imaging and eyestat (which indicates whether the person kept their eyes open or not during the scan). Again, we only use imaging data for our experiments.

2.2 Evaluation criteria

For each task, we use 5-fold cross validation (CV) [18] over the training set to set the parameters, based on accuracy. That is, each set of parameter values is tested with 5-fold cross validation on the training set, and the set of values producing the best average accuracy across all 5 folds of the training dataset is selected for final testing on the held out test dataset. We report the training set CV accuracy as well as performance measures of the learned model applied to the holdout set: accuracy, sensitivity, specificity and J-statistics (Jstat = sensitivity + specificity − 1). We also used statistical tests to evaluate the hold-out results obtained by the various learners. For each dataset, we used a binomial test to compare our learners (LeFM and LeFM) to cross-validation baseline accuracies, to determine if each learner is significantly better than chance. We used two additional binomial tests to compare LeFM to the previous state-of-the-art classifiers for ADHD-200 and ABIDE. Finally, four McNemar tests were performed to determine if the results of LeFM were significantly better than those of LeFM or LeFM on the ADHD-200/ABIDE hold-out datasets.

2.3 Preprocessing pipeline

For preprocessing fMRI and MRI scans, we used SPM8 (http://www.fil.ion.ucl.ac.uk/spm/software/spm8/) and our own in-house MATLAB code. Preprocessing was identical to that used in Ghiassian et al. [13]. Preprocessing involved 6 steps: 6-parameter rigid body motion correction of functional scans Co-registration of functional scans to subject-specific structural scans to guide the spatial normalization step Non-linear spatial normalization (parameter estimation and spatial transformation) of structural images to the MNI T1 template (http://imaging.mrc-cbu.cam.ac.uk/imaging/Templates) Non-linear spatial normalization of previously co-registered functional image volumes (in step 2) to the MNI T1 template using warping parameters computed in the structural image normalization Spatial smoothing of functional image volumes with 8mm full-width half-maximum (FWHM) Gaussian kernel Z-normalization of each 3D volume’s intensities for structural and functional image to standardize the intensities of images scanned from different sites. Note that only steps 2, 3 and 6 are applicable to processing the MRI images. For more details about the preprocessing steps, see Ghiassian et al. [13].

2.4 Building a model from structural MRI features, LeFM

This section describes LeFM, which extracts the usable features from structural MRI images, which are then provided to a learning algorithm; see Fig 2. Section 2.7 below describes the Learner, and how we set its hyperparameters. Here, we focus on the grey box, which does this preprocessing: this involves creating 3D patches by extracting all 5 × 5 × 5 cubes from the MRI image (Section 2.4.2), which are given to a sparse autoencoder that learns to encode a set of reduced representations (Section 2.4.3) of the patches producing filters (Section 2.4.1), which are then finally used in a convolutional neural network (Section 2.4.4) with convolution (Section 2.4.1) and max pooling to produce a final reduced feature set.

Fig 2

Process of extracting MRI features to be provided to a learning algorithm.

This expands the “MRI feature extraction” box from Fig 1.

Process of extracting MRI features to be provided to a learning algorithm.

This expands the “MRI feature extraction” box from Fig 1. To understand this process better, we first define some common terms—Filters, Convolutions—that will be used later in this section. The rest of the section then describes the steps in Fig 2, in left-to-right order.

2.4.1 Background: Filters, convolution

While our application deals with 3D images, this section explains the basic ideas using simpler 2D examples. In image processing, filters are transformations that accentuate certain features within an image. Filters are generally defined on a neighborhood. For example, the 2D image filter is defined on 3 × 3 neighborhood. Below, we view h as a function—here h: {−1, 0, 1} × {−1, 0, 1} ↦ ℜ. We then “convolve” a filter against an image. More precisely, a convolution is a mathematical operation on two functions—an image and a filter (where are the integers from −d through + d, inclusive)—to produce a third function that is typically viewed as a modified version of I(x, y), giving the overlap between the two functions. In case of 2-dimensional convolution, I is a 2D image (where I(x, y) is the intensity at position (x, y)), and h(u, v) is the filter that is convolved with the image—such as the h from Eq 1. The result of the convolution can be interpreted as the similarity measures between each pixel of the image and the filter. For 2D images, this result, at location (x, y), is defined as where is the neighborhood over which the filter h is defined, and is the sigmoid function.

2.4.2 3D patch sampling

The 3D patch sampling process converts our MRI image into a set of smaller 3D images. This is done by defining a patch size (in our case 5 × 5 × 5) and moving this cube across the image in all dimensions (step size is 1 voxel), extracting the contents of the cube at each point to create a 3D patch. This produces one such patch for every possible 5 × 5 × 5 cube that can fit within the original image—yielding approximately as many patches as there are voxels. Note that “adjacent” patches will overlap.

2.4.3 Using a sparse autoencoder to generate filters

We then run a sparse autoencoder over the patches, to produce the filters required for convolution. To generate the filters for the convolutional neural network (CNN), we use a single layer unsupervised sparse autoencoder, that seeks to compress the 125-dimensional patches down to just 3-dimensions; see Fig 3. This produces 3 filters (based on the Encoding Weights); see left portion of Eq 1.

Fig 3

Autoencoder that tries to reconstruct the 5 × 5 × 5 patches (represented as a 125-D vector).

Autoencoder that tries to reconstruct the 5 × 5 × 5 patches (represented as a 125-D vector).

For each of the k = 3 filters, there is a link (“Encoding Weights” θ1) from each input node to each hidden node, and then from each hidden node to each Output node (“Decoding Weights”, θ2); we show only a subset for clarity. The sparsity aspect of the autoencoder encourages the network to learn different transformations in each of the nodes [19], which is meant to improve the robustness of the classifiers using the filters’ outputs as features. Formally, we seek the parameters [θ, θ, b, b] of the network (Fig 3) that minimizes: where D ∈ ℜ is the data matrix, where each row is a data point x ∈ ℜ, also h ∈ R is the hidden representation of the data (i.e., there are k nodes in the hidden layer), is reconstructed data, L(a,b) = ∑(a − b)2 is the squared loss, σ(s) is sigmoid function (Eq 2), ρ is sparsity parameter (e.g., ρ = 0.05), is average activation of the j hidden neuron, and is the Kullback-Leibler divergence. We used (internal) 5-fold cross-validation to select the hyperparameters β, λ and ρ, and solve Eq 3 using the L-BFGS package (http://users.iems.northwestern.edu/~nocedal/lbfgsb.html).

2.4.4 Single layer unsupervised convolutional neural net

After learning the k = 3 filters from the sparse autoencoder, the filters are then convolved throughout each entire MRI 3D image (size 79 x 95 x 68), producing a new 3D image (size 74 x 91 x 64) for each filter. (Fig 4 shows a 2D analogue of this.) As we use sigmoid activation functions σ(⋅), these features are a non-linear combination of voxels within the input image [20].

Fig 4

(a) Median axial slice from MRI scan and (b) the result after convolution with a filter.

We then use max-pooling after the convolution step. Here, for each volume (produced by each filter), we divide the entire volume into disjoint and exhaustive 5 × 5 × 5 sub-regions, and then compute the maximum value of each sub-region. This step reduces the sensitivity of the feature map to various distortions, for example, by introducing translational invariance for the features within the max-pooling region. We then collect these values into a single vector—here of size 3 × (15 × 18 × 13) = 10 530. We then describe each subject with a single vector that is the concatenation of all k of these size-10,530 vectors. This is the input to the base learning algorithm; see Fig 1.

2.4.5 Evaluating the model produced by LeFM

To evaluate the LeFM algorithm, we used 5-fold CV on the training set with internal 5-fold CV to obtain hyperparameters that maximize CV accuracy, and then (after finding the apparent best parameters and model), applying that learned model to the holdout set. The base learner is an SVM. We applied the resulting learned model to the holdout set, but found that it did not perform as well as state-of-the-art methods that also only use structural MRI features. This process was still relevant, as it identifies relevant features, which will be used within LeFM; see Section 2.6 below.

2.5 Building a model from fMRI features: LeFM

This section describes our approach to producing a model that can use blind source separation to create features from a subject’s fMRI scan, where those features are then used as input to an SVM to predict the subject’s diagnosis; see Fig 5 (which expands a portion of Fig 1). Sections 2.5.2, 2.5.3 and 2.5.4 describe how we process the initial fMRI data to where source separation can be applied, then Section 2.5.5 describes the different source separation techniques that we considered, to generate spatial maps. As seen in Fig 5, we use these spatial maps to produce the fMRI-based features that we ultimately use to train an SVM classifier. Fig 6 summarizes the complete LeFM algorithm; the following subsections discuss each step in more detail.

Fig 5

Detailed process of how a classifier is produced and used with fMRI scans.

This corresponds to the “Red Box” (i.e., the fMRI processing component) of Fig 1.

Fig 6

Detailed version of the fMRI data to the spatial maps process that elaborates the boxes “Dimensionality Reduction with 2-step PCA” and “Obtain Spatial maps Using PCA/k-PCA/ICA/NSD” in Fig 5.

The first three of four steps show how 2-step PCA is applied to the data, where the concatenation step is temporal concatenation (Fig 7). The first column shows that the scan from subject i is a T × V matrix . The next column, after “Subject Level PCA”, shows the i subject is described as , with dimension T × V. The subsequent “Concatentation” step produces a group matrix with dimension nT × V. The “Group Level PCA” step reduces this to a K × V matrix, which the “Source Separation” step factorizes to produce the mixing matrix size K × C. This final step could be the identify transformation, kernel-PCA, standard ICA or non-linear ICA; each of these might lead to different values of C.

Detailed process of how a classifier is produced and used with fMRI scans.

This corresponds to the “Red Box” (i.e., the fMRI processing component) of Fig 1.

Detailed version of the fMRI data to the spatial maps process that elaborates the boxes “Dimensionality Reduction with 2-step PCA” and “Obtain Spatial maps Using PCA/k-PCA/ICA/NSD” in Fig 5.

Fig 7

Temporal vs. spatial concatenation.

Each ‘fMRI data i” box (for i = 1..n) corresponds to one of the n subjects. This shows two ways we can concatenate these boxes from subjects in Fig 6 to produce X, Here, we use temporal concatenation.

As a high-level summary of the first 3 steps: an fMRI scan for a single subject has a large number of voxels, which are captured for each of a number of time points. Because combining all the fMRI scans from all subjects for source separation would be intractable, it is important to reduce the number of features while preserving the important subject-level variations in the data. To reduce the computational load on group level analysis, previous researchers have proposed various approaches for reducing the dimensionality before group level source separation analysis. We follow Calhoun et al. [21, 22] by using a standard 2-Step principal component analysis, applied to all fMRI images used in our study. The steps in Sections 2.5.2, 2.5.3, and 2.5.4 are implemented in the GIFT software package (http://mialab.mrn.org/software/gift/) from Calhouns’s group. This model is based on the (reasonable) assumption that there are common spatial sources for a set of fMRI scans in a particular study, where subjects differ based on temporal weights of each source. This 2-Step data reduction procedure captures the subject level variations and group level commonalities.

2.5.1 Dimensionality reduction using 2-step PCA

An fMRI scan for a single subject has a large number of voxels, which are captured for each of a number of time points. Because combining all the fMRI scans from all subjects for source separation would be intractable, it is important to reduce the number of features [21, 22] while preserving the important subject-level variations in the data. In order to reduce the computational load on group level analysis, we follow Calhoun et al. [21]’s 2-step principal component analysis (PCA) [21, 22], which applies PCA first to each subject separately and then to the all subjects together. We use the implementation of this procedure in the GIFT software package. This 2-step PCA procedure produces a set of components meant to capture the subject level variations and group level commonalities. Each component yields a “spatial map” [23]. This approach is based on the (reasonable) assumption that there are common spatial sources for a set of fMRI scans in a particular study, where subjects differ based on temporal weights of each source.

2.5.2 Subject level PCA

As shown in Fig 6, we start with a T × V matrix, matrix describing each of the n subjects. T is the number of time points. V is the number of voxels. PCA [21] is used (within GIFT) to reduce this to a T × V matrix by selecting T largest eigenvalues, to capture 99% of the variance. (For details, see the GIFT documentation http://mialab.mrn.org/software/gift/documentation.html).

2.5.3 Subject concatenation

To estimate a set of common components across all subjects, we vertically concatenate the PCA-rotated matrices produced for each subject by the above subject-specific PCA step (as shown in Fig 7). Note that vertical concatenation means we concatenate along the temporal direction (as opposed to spatial concatenation, which is not discussed here). Therefore, we are using source separation to decompose each fMRI scan into a set of spatial maps, where each map is derived from a time course defined by a PCA component. As seen in Fig 7, in this context, the inner product of voxel components and time series components reconstructs an approximation of the original fMRI images, so we can view this as a way of representing each voxel’s time series as a linear combination of more general, shared time series.

Temporal vs. spatial concatenation.

2.5.4 Group level PCA

The previous steps produced the Concatenated Matrix Y[( (3rd column in Fig 6). We next run PCA on this matrix, and again included enough rows K to capture 99% of the variance.

2.5.5 Source separation

Previous studies [21, 24] have explored brain regions that are strongly temporally coherent (co-activated during rest) using source separation techniques like PCA or independent component analysis (ICA). Here, PCA separates the fMRI brain scan into uncorrelated spatial maps or sources based on variations in the time, whereas ICA decomposes the brain fMRI scans into a common set of spatially independent components (a.k.a. sources or spatial maps) and their corresponding user-specific time courses. In this case, ICA assumes that the spatial maps have constant higher order statistics [21, 22, 25]. This paper compares four source separation methods, which are applied after 2-step PCA—including this simple version of ICA, as well as a more complex Non-stationary Spatial Sources Decomposition (NSD) technique—in terms of the accuracy of the models produced by the features obtained from each algorithm. To the best of our knowledge, this is the first study to fully investigate using NSD to produce predictive models from fMRI data. Principal Component Analysis (PCA) for fMRI The simplest of our 4 approaches just uses the “Representative Matrix” X[ as our spatial maps, without any further modification (that is, the 4th step of Fig 6 is just the identity transformation). In this case, we use internal cross-validation to determine the appropriate number of components that optimizes the accuracy of the final model, instead of preserving 99% of the variance. Kernel Principal Component Analysis (k-PCA) for fMRI Here, we use the Radial Basis Function kernel (RBF), with the associated matrix, where x and x are i and j column of X respectively. After computing the kernel similarity matrix, we can obtain the eigenvalue/eigenvector pair of the kernel matrix; the projection onto these eigenvectors produces spatial maps (rows of S). Finally, the inner product of each patient’s fMRI scan on these maps results in the time components (columns of A). The number of eigenvectors retained for this final inner product is again selected through internal cross-validation. Independent Component Analysis (ICA) The goal of the standard ICA algorithm is to find independent sources that maximize the log likelihood of the observed data, in the context of temporally concatenated fMRI data—i.e., given the X appearing in 4th column of Fig 6. We also suggest how one might interpret this process. The standard max likelihood estimate (MLE) interpretation of ICA starts with an observed data matrix X where each row of X (x) is a single observation. In particular, this model assumes that each x is a realization of the random variable x where x is a linear combination of independent source random variables, {s}. We also need a mixing matrix A that defines the contributions of each s in producing each x. Concretely, if we take any column of X (denoted x) we find the relationship where W = A† is pseudo-inverse of A and s is a column of S, the matrix whose rows s are each a realization of s. Note that A and S are never observed. In our fMRI setting, shown in Fig 8, X and S each have V columns and each time point in the concatenated data x is a linear combination of our unknown independent spatial maps S = [s, …, s]. To help understand ICA applied to temporally concatenated fMRI data, think of the data as snapshots of a someone performing different gestures. For each snapshot, we observe the brain activation associated with a particular gesture (our x), which can be viewed as a combination of what would be the brain activations associated with more primitive movements S such as rotating a wrist or lifting an arm. In light of this example, ICA is being used to decompose neurological activity into more fundamental components to identify a disease state such as ADHD or Autism.

Fig 8

Temporal concatenation with ICA elaborated from Fig 7.

This corresponds to the ICA decomposition X ≈ AS, except here, for illustrative purposes, we have replaced X with Y (the concatenated fMRI data from Fig 6 before the Group-level PCA).

Temporal concatenation with ICA elaborated from Fig 7.

This corresponds to the ICA decomposition X ≈ AS, except here, for illustrative purposes, we have replaced X with Y (the concatenated fMRI data from Fig 6 before the Group-level PCA). Now that we have defined our setting and provided some intuition, we now describe how we use MLE to obtain the (unobserved) independent components of S. Our ultimate goal is to find an optimal inverse mixing matrix W* that maximizes the likelihood of the observed data. Here, we assume that the distribution of the h source s is given by a density p(s), and that the joint distribution of the sources are independent of one another; i.e., The probability of the observed signals for j observation (x) is given by where is the transpose of the h column of W [26]. The W that maximizes the log-likelihood given the data X[, is Note that this p(⋅) can be any non-Gaussian distribution [27]. Here, we use GIFT’s default implementation and parameters, which is where each different source h = 1..K will have a different λ(. Note this does not depend on the location of the voxel. Given this W*, and the observed x for patient j, we then use Eq 5 to solve for the time course A. Non-Stationary Spatial Sources Decomposition (NSD) As mentioned earlier, the probability density of each source is parameterized by a constant value that does not change with location of voxels. NSD generalizes this, by allowing the probability density function for each component to be parameterized by the location of the voxels [28, 29]. Here, we define the distribution for the h component S as (Compare to Eq 9) We think that each of these spatial sources should be non-stationary (i.e., have different λ values at different (i, j, k) locations) as: One source (which corresponds to one row in S [see Fig 6]) may not have the same magnitude and variation throughout the whole brain due to in-homogeneous magnetic susceptibility that depends on the location of voxels in the scan. This differs from commonly used source separation models, which require that the sources have same variation for the whole brain scan. The strength and variability within a particular source depends on the brain tissue type in a particular brain region. For example, the activation values in grey matter and white matter (which depends on the amount of oxygenated blood flow in that tissue) would be different. In order to develop the theory for nonstationary source decomposition, consider a representative fMRI scan X[ ≈ A[ S[, where S is an K × V matrix of source components, where each row S (size 1 × V) specifies the contribution of voxel co-ordinates to r source. Each spatial activation refers to one row in S. Column h in matrix A will have the time courses for the corresponding h spatial component. In mathematical terms, suppose we have K independent spatial maps (each corresponding to a row in S) and V observations (each corresponding to a 3D brain scan at a time point). Then, we can formulate the covariance matrix at location r = (i, j, k) as R(r) = 〈x(r) x(r)〉 = A D(r) A where R amd D(r) are of size K × K. Further we let x(r) = X(:, N(r)) where N(r) represents any suitably chosen region around position r for which the signals are assumed to have same higher order statistics. For our experiments, we have chosen N(r) to be a 4 × 4 × 4 bounding box (e.g., if the point r = [200, 100, 50], then this box is defined by corners [199, 99, 49] and [202, 102, 52]). We found, empirically, that increasing the bounding box severely degraded the performance of the model. Assuming the sources are spatially non-stationary, and following [15, 30], However, as we do not have a perfect estimate for R(r), we estimate the covariance matrix R(r) for some spatial interval. We denote the sample estimates as , and define the measurement error as Suppose we have N samples of for r ∈ {r1, r2, .., r}, then we can estimate the parameters by and estimate the source components as Then for each patient, we can calculate the time courses A from Eq 11. This method decorrelates the spatial sources at different regions of the brain, which is desirable for the reasons described above.

2.5.6 Complete LeFM process

Fig 6 shows the steps. First, the SubjectLevelPCA reduces each subject’s [T × V] matrix I to Y = UI where U is the T × T reduction matrix. Note that T is determined by retaining the independent components that preserve 99% of the variance (see the GIFT software package documentation for complete details on individual and group level PCA reductions). The second step, Concatenation, concatenates {Y} over all of the n subjects, to obtain the matrix Y of size nT × V The third step, GroupLevelPCA, runs PCA on this Y to produce a new matrix X ≈ FY, where F is a K × nT reduction matrix. If we divide the F into n sub-blocks F, each of size K × T, then F corresponds to reduction matrix for subject i. We then consider various final steps: (1) Directly using the final F matrix as features. (2) Running kernel PCA on X to obtain a reduction matrix A. (3) Running (stationary) ICA on X to obtain the mixing matrix A. (4) Running NSD on X to obtain the mixing matrix A. In these last three variants we are using different blind source separation algorithms to express X ≈ A × S, where S is a [C × V] matrix with C spatial maps covering V voxels. Hence, for subject i, where , which is different for each patient (while all use the same S matrix). Here and are pseudo-inverses of U and F respectively. Thus, X is approximated as A S.

2.6 Building a model from multimodal features: LeFM

For the set of all n subjects, let be the feature matrix from the one-layer convolutional network on the MRI data, and be feature matrix from applying nonstationary ICA to the fMRI data. The combined feature matrix is , with f = f1 + f2 columns. This matrix was then divided into training set and holdout set by instances as in Section 2.1. The training set was then used to train a linear support vector machine.

2.7 Support vector machine classifier

Each of the approaches—LeFM, LeFM, LeFM—uses the Support Vector Machine (SVM) learning algorithm to generate their models, based on the respective features. LeFM and LeFM use linear kernels and LeFM uses an RBF kernel. This is based on the argument that when we use linear features (i.e., features are extracted using a linear transform) we use a non-linear classifier, and when we use non-linear features (i.e., features are extracted using a non-linear transform) we use a linear classifier. For the RBF kernel, we follow a standard practice of using internal cross-validation to set the parameter γ ∈ {0.1 × 2|i = 0..10}. The parameter C for all SVMs is left at the default of C = 1. Of course, once the model is produced by the learner, it can then be used to predict the class of a novel instance.

3 Results

This section provides the results of cross validation and holdout set evaluation for the three models considered: Section 3.1 discusses the results of the LeFM model; Section 3.2 discusses the results of the LeFM model; and Section 3.3 discusses the results of the LeFM model. We see that our LeFM model is not better than state-of-the-art MRI based models, in general, but our LeFM model outperforms other models that only use fMRI features, and our LeFM model outperforms all other image-based models, on this dataset.

3.1 Results using MRI texture-based features (LeFM)

The best 5-fold cross-validation accuracy achievable for ADHD-200 data is 0.635. When run on the hold-out data, this classifier yielded accuracy of 0.626, with sensitivity, specificity and Jstat of 0.420, 0.842 and 0.262 respectively. The best 5-fold cross-validation accuracy achievable for ABIDE data is 0.614. When run on the hold-out data, this classifier yielded accuracy of 0.617, with sensitivity, specificity and Jstat of 0.490, 0.730 and 0.220 respectively. Here, cross validation found that the optimal number of spatial maps for ADHD was 100, and for ABIDE, it was 175.

3.2 Results using fMRI source separation models (LeFM)

3.2.1 5-fold cross validation accuracy

To assess the predictive ability of a model trained only on fMRI features, we ran 5-fold cross validation (CV) over a varying number independent components. We then ran the version with the optimal number on the holdout set. Table 3 shows the results of this CV selection process for both the ADHD and Autism datasets. As shown, for the ADHD data, the model performed best with 45 independent components, achieving an accuracy of 0.6450 compared to the baseline of 0.50 with significance p = 1.83e-12 (via binomial test). For the Autism data, the model performed best with 45 components with a CV accuracy of 0.6225 relative to the baseline 0.5157 (p = 4.95e-10, via binomial test). Note that the 45 components for the ADHD data are unrelated to the 45 components for the Autism data. Fig 9 plots these values, and shows that NSD is better than the other feature extraction methods (PCA, kPCA and ICA).

Table 3

5-fold CV results for classification accuracy (and standard deviation) using various numbers of independent components.

	# ICs (SVM GAMMA)	5-Fold CV Accuracy	5-Fold CV STD
ADHD	30 (6.4)	0.6216	0.0399
	35 (3.2)	0.6378	0.0374
	40 (1.6)	0.6432	0.0390
	45 (3.2)	0.6450	0.0291
	50 (3.2)	0.6360	0.0261
	55 (3.2)	0.6378	0.0075
Autism	30(51.2)	0.5925	0.0158
	35 (51.2)	0.5825	0.0049
	40 (51.2)	0.6159	0.0078
	45 (51.2)	0.6225	0.0210
	50 (51.2)	0.6000	0.0130
	55 (51.2)	0.5987	0.0059

Note: SVM GAMMA refers to the RBF kernel variance parameter .

Fig 9

5-fold cross validation results of classification.

Note: SVM GAMMA refers to the RBF kernel variance parameter .

3.2.2 Model hold-out test accuracy

Using the optimal parameters selected from cross validation, the resulting hold-out accuracy for ADHD was 0.6491 (baseline 0.5497) and was 0.6233 for ABIDE (baseline 0.5157). The specificity, sensitivity and Jstat for ADHD (resp., ABIDE) are 0.8191, 0.4416 and 0.2607 (respectively, 0.6768, 0.5533 and 0.2301). To our knowledge, our holdout accuracies achieved on ADHD-200 test data and the ABIDE test data are the best known when only using fMRI scans.

3.2.3 Leave-one-out accuracy comparison

To compare our results with yet other systems, we also computed the leave-one-out accuracy of our ABIDE model; Table 4 shows that our LeFM was superior.

Table 4

Leave-one-out results for autism classification (ABIDE) using only imaging data.

Algorithm	Accuracy	Specificity	Sensitivity	J-Statistics
LeFM_F	0.614 ±0.017	0.648	0.578	0.226
Nielsen et al [14]	0.600 ±0.016	0.58	0.62	0.20

Note that Abraham et al. [31] report an “inter-site” cross-validation accuracy of 66.8±5.4% on a version of the dataset containing 871 subjects (patients are ommited because they did not pass a visual quality inspection). Here, Abraham et al. [31] use a 16-fold cross validation, done by training on 15 of 16 sites and testing on the remaining site in each fold. By contrast, our model was trained on a smaller set of only 800 instances—recall we trained on only 70% of the original 1111 instances, and left the remaining 30% (311) for testing. It is likely that their reportedly high accuracy is due in part to having a larger CV dataset where samples were heavily scrutinized so as to exclude those that did not meet their criteria.

Spatial maps from ADHD-200 data:

Figs 10 and 11 show the spatial maps found using multiple de-correlation—showing 9 axial slices for each. The component IC1 appears to be an artifact as it assigns very high values to all the voxels—i.e., as all the voxels in the brain are approximately of equal importance in this component, it is shared by all the voxels. We assume this is probably due to noise like motion, breathing and attention signals, which can modulate voxels throughout the brain [32]. We believe that IC3 also appears to be an artifact as it consists of regions from cerebrospinal fluids which means it might be a result of cardiac pulsatility artifacts [33]. Moreover, we found that removing these two artifacts did not change the performance of the model. The other components consist of some common default mode networks (regions that are shown to be active during rest [34]) and apparently some new resting state networks not previously identified in the literature. Some of the resting state networks found are consistent with [35]. For example, IC5 appears similar to the resting state network for peristriate area, and lateral and superior occipital gyrus, which are areas related to visual cortex and might represent spontaneous brain activities like day-dreaming [36, 37]. It is easy to connect this component to ADHD as ADHD patients are more likely to experience mind-wandering [38]. IC6 captured shared functional properties in the frontal and occipital lobe (responsible for planning and many areas of vision respectively). This area is very important for ADHD as well, as ADHD patients may suffer from lack of effective planning [39]. IC24 consists of pons (responsible for eye movement, sleep, and many other vegetal and automatic functions) regions and temporal lobe (for sensory processing, memory formation and higher order association processing). Their usefulness in prediction suggests that some of the physical signals captured by fMRI may also be indicative of a disease state.

Fig 10

Spatial maps (aka components) 1–25 for ADHD-200 using NSD.

Each 3×3 box shows 9 axial slices of one component. The colorbar is same as Fig 4.