| Literature DB >> 24782753 |
Benoit Da Mota1, Radu Tudoran2, Alexandru Costan2, Gaël Varoquaux1, Goetz Brasche3, Patricia Conrod4, Herve Lemaitre5, Tomas Paus6, Marcella Rietschel7, Vincent Frouin8, Jean-Baptiste Poline9, Gabriel Antoniu2, Bertrand Thirion1.
Abstract
Brain imaging is a natural intermediate phenotype to understand the link between genetic information and behavior or brain pathologies risk factors. Massive efforts have been made in the last few years to acquire high-dimensional neuroimaging and genetic data on large cohorts of subjects. The statistical analysis of such data is carried out with increasingly sophisticated techniques and represents a great computational challenge. Fortunately, increasing computational power in distributed architectures can be harnessed, if new neuroinformatics infrastructures are designed and training to use these new tools is provided. Combining a MapReduce framework (TomusBLOB) with machine learning algorithms (Scikit-learn library), we design a scalable analysis tool that can deal with non-parametric statistics on high-dimensional data. End-users describe the statistical procedure to perform and can then test the model on their own computers before running the very same code in the cloud at a larger scale. We illustrate the potential of our approach on real data with an experiment showing how the functional signal in subcortical brain regions can be significantly fit with genome-wide genotypes. This experiment demonstrates the scalability and the reliability of our framework in the cloud with a 2 weeks deployment on hundreds of virtual machines.Entities:
Keywords: cloud computing; fMRI; heritability; machine learning; neuroimaging-genetic
Year: 2014 PMID: 24782753 PMCID: PMC3986524 DOI: 10.3389/fninf.2014.00031
Source DB: PubMed Journal: Front Neuroinform ISSN: 1662-5196 Impact factor: 4.081
Figure 1Top: Representation of the computational framework: given the data, a permutation and a phenotype index together with a configuration file, a set of computations are performed, that involve two layers of cross-validation for setting the hyper-parameters and evaluate the accuracy of the model. This yields a statistical score associated with the given phenotype and permutation. Bottom: Example of complex configuration file that describes this set of operations. General parameters (Lines 1–3): The model contains covariates, the permutation test makes 10,000 iterations and only one permutation is performed in a task. Prediction score (Lines 4–7): The metrics for the cross-validated prediction score is R2, the cross-validation loop makes 10 iterations, 20% of the data are left out for the test set and the seed of the random generator was set to 0. Estimator pipeline (Lines 8–13): The first step consists in filtering collinear vectors, the second step selects the K best features and the final step is a ridge estimator. Parameters selection (Lines 14–16): Two parameters of the estimator have to be set: the K for the SelectKBest and the alpha of the Ridge regression. A set of 3 × 5 parameters are evaluated.
Figure 2Overview of the multi site deployment of a hierarchical Tomus-MapReduce compute engine. The end-user uploads the data and configures the statistical inference procedure on a webpage. The Splitter partitions the data and manages the workload. The compute engines retrieves job information trough the Windows Azure Queues. Compute engines perform the map and reduce jobs. The management deployment is informed of the progression via the Windows Azure Queues system and thus can manage the execution of the global reducer. The user downloads the results of the computation on the webpage of the experiment.
Figure 4Results of the real data analysis procedure. (Left) predictive accuracy of the model measured by cross-validation, in the 14 regions of interest, and associated statistical significance obtained in the permutation test. (Up right) distribution of the CV−R2 at chance level, obtained through a permutation procedure. The distribution of the max over all ROIs is used to obtain the family-wise error corrected significance of the test. (Bottom right) outline of the chosen ROIs.