| Literature DB >> 30204156 |
Minghui Wang1,2, Noam D Beckmann1,2, Panos Roussos1,2,3,4,5, Erming Wang1,2, Xianxiao Zhou1,2, Qian Wang1,2, Chen Ming1,2, Ryan Neff1,2, Weiping Ma1,2, John F Fullard1,3,4, Mads E Hauberg3,4,6,7,8, Jaroslav Bendl1,3,4, Mette A Peters9, Ben Logsdon9, Pei Wang1,2, Milind Mahajan1,2, Lara M Mangravite9, Eric B Dammer10,11, Duc M Duong10,11, James J Lah12,13, Nicholas T Seyfried10,11,12, Allan I Levey12,13, Joseph D Buxbaum1,3,4,14,15, Michelle Ehrlich16,17, Sam Gandy5,16,18, Pavel Katsel3,5, Vahram Haroutunian3,5,18,15, Eric Schadt1,2, Bin Zhang1,2.
Abstract
Alzheimer's disease (AD) affects half the US population over the age of 85 and is universally fatal following an average course of 10 years of progressive cognitive disability. Genetic and genome-wide association studies (GWAS) have identified about 33 risk factor genes for common, late-onset AD (LOAD), but these risk loci fail to account for the majority of affected cases and can neither provide clinically meaningful prediction of development of AD nor offer actionable mechanisms. This cohort study generated large-scale matched multi-Omics data in AD and control brains for exploring novel molecular underpinnings of AD. Specifically, we generated whole genome sequencing, whole exome sequencing, transcriptome sequencing and proteome profiling data from multiple regions of 364 postmortem control, mild cognitive impaired (MCI) and AD brains with rich clinical and pathophysiological data. All the data went through rigorous quality control. Both the raw and processed data are publicly available through the Synapse software platform.Entities:
Mesh:
Substances:
Year: 2018 PMID: 30204156 PMCID: PMC6132187 DOI: 10.1038/sdata.2018.185
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Demographics for the study population stratified by CDR.
| 0 | 44 | 32/12 | 82.4±9.4 | 2.1±2.9 | 2.1±1.3 |
| 0.5 | 47 | 24/23 | 81.7±11.4 | 3.2±5.2 | 2.3±1.5 |
| 1 | 38 | 26/12 | 84.9±10.6 | 6.3±6.3 | 3±1.8 |
| 2 | 49 | 31/18 | 86.5±7.7 | 9±8.5 | 4.5±1.7 |
| 3 | 78 | 56/22 | 86.4±8.5 | 9.8±7.8 | 4.7±1.5 |
| 4 | 47 | 30/17 | 87.7±7.8 | 10.6±7.7 | 4.6±1.6 |
| 5 | 61 | 39/22 | 82.3±10.8 | 18.6±11.4 | 5.1±1.3 |
Figure 1Cognitive and neuropathological trait phenotype distribution of the present study population.
(a) Bar-chart showing the number of female (F) and male (M) samples stratified by CERAD neuropathological category. (b) boxplot showing distribution of mean of neuritic plaque density in cortical brain regions for each class of CERAD neuropathological categories; c and d, bar-charts showing the number of samples with different Braak score (c) or CDR (d) stratified by CERAD neuropathological category.
Figure 2Summary of WES variants evaluation metrics.
The distribution of the number of variants, Ti/Tv ratio, alternate heterozygous/homozygous (Het/Hom) ratio and indel (Ins/Del) ratio, in all samples or three major ethnic groups in the present population.
Summary of the whole genome sequencing data.
| Sample number | 349 | 258 | 43 | 40 | 8 |
| Total raw bases (Gb) | 47,234.85 | 35,000.30 | 5,781.90 | 5,344.65 | 1,108.00 |
| Total mapped bases (Gb) | 46,905.02 | 34,759.08 | 5,738.92 | 5,306.18 | 1,100.84 |
| Mean raw bases per individual (Gb) | 135.34 | 135.66 | 134.46 | 133.62 | 138.5 |
| Mean mapped bases per individual (Gb) | 134.4 | 134.73 | 133.46 | 132.65 | 137.61 |
| Mean mapped depth (X) | 42.56 | 42.67 | 42.26 | 42 | 43.58 |
| breadth of coverage (% of genome) | 91.91 | 91.902 | 92.02 | 91.85 | 91.92 |
| Mean read length | 151 | 151 | 151 | 151 | 151 |
| No. of SNPs | 32,452,033 | 28,279,155 | 17,291,042 | 15,350,134 | 7,854,397 |
| bi-allelic | 31,714,836 | 27,648,823 | 16,979,800 | 15,081,052 | 7,773,748 |
| multi-allelic | 737,197 | 630,332 | 311,242 | 269,082 | 80,649 |
| Mean variant SNP sites per individual | 4,138,872 | 4,126,627 | 4,225,036 | 4,138,235 | 4,053,271 |
| Ti/Tv ratio | 2.08 | 2.08 | 2.08 | 2.08 | 2.08 |
| Indels | 5,339,491 | 4,621,397 | 2,569,428 | 2,225,159 | 936,046 |
| Mean variant Indel sites per individual | 264,136 | 262,706 | 273,735 | 265,023 | 254,236 |
| 3-prime UTR variant | 226,130 | 193,011 | 112,299 | 98,632 | 48,104 |
| 5-prime UTR premature start codon gain variant | 8,201 | 6,917 | 3,785 | 3,303 | 1,508 |
| 5-prime UTR variant | 43,102 | 36,417 | 20,741 | 18,279 | 8,727 |
| initiator codon variant | 26 | 20 | 10 | 13 | 6 |
| intergenic region | 12,968,846 | 11,190,326 | 6,909,837 | 6,158,825 | 3,186,627 |
| intragenic variant | 1,720,234 | 1,479,298 | 905,608 | 803,650 | 408,297 |
| intron variant | 12,273,865 | 10,521,791 | 6,335,143 | 5,592,120 | 2,802,259 |
| missense variant | 151,628 | 127,117 | 65,494 | 58,348 | 26,720 |
| missense variant&splice region variant | 3,525 | 2,919 | 1,467 | 1,286 | 564 |
| non coding exon variant | 140,928 | 123,097 | 77,928 | 69,429 | 37,377 |
| protein protein contact | 811 | 674 | 304 | 252 | 92 |
| splice acceptor variant&intron variant | 1,178 | 990 | 521 | 466 | 246 |
| splice acceptor variant&splice donor variant&intron variant | 15 | 9 | 7 | 5 | 4 |
| splice acceptor variant&splice region variant&intron variant | 11 | 8 | 6 | 3 | 1 |
| splice donor variant&intron variant | 1,622 | 1,383 | 771 | 689 | 355 |
| splice donor variant&splice region variant&intron variant | 9 | 5 | 3 | 2 | 0 |
| splice region variant | 995 | 853 | 505 | 441 | 205 |
| splice region variant&intron variant | 22,085 | 18,973 | 11,190 | 9,943 | 4,915 |
| splice region variant&non coding exon variant | 4,027 | 3,506 | 2,163 | 1,976 | 1,029 |
| splice region variant&stop retained variant | 6 | 5 | 2 | 3 | 1 |
| splice region variant&synonymous variant | 2,729 | 2,342 | 1,314 | 1,173 | 572 |
| start lost | 231 | 197 | 92 | 81 | 33 |
| start lost&splice region variant | 7 | 5 | 2 | 1 | 0 |
| stop gained | 2,595 | 2,165 | 911 | 787 | 340 |
| stop gained&splice region variant | 73 | 57 | 24 | 21 | 7 |
| stop lost | 158 | 137 | 84 | 74 | 46 |
| stop lost&splice region variant | 20 | 19 | 13 | 13 | 8 |
| stop retained variant | 80 | 70 | 47 | 42 | 32 |
| synonymous variant | 115,128 | 99,186 | 57,983 | 51,292 | 25,120 |
| upstream gene variant | 2,991,606 | 2,576,772 | 1,586,065 | 1,410,548 | 733,828 |
| downstream gene variant | 2,395,694 | 2,062,423 | 1,268,135 | 1,127,516 | 583,990 |
Figure 3Quality check of proteomic data.
(a) Histogram of the frequency of the proteins of each missing level. (b) Scatter plot of missing rate and mean abundance for each protein. The red spots indicate the cross sectional mean of each missing rate level with a yellow regression line. (c) Distribution of all individual median value within a batch. Panel 3a is the distribution of all batches in raw data. (d) The distribution of all batches in processed data. (e) Correlation of the duplicated samples. Red box represents the correlation of data imputed with the mixed effect model; green box represents the correlation of data imputed by observed protein median; blue box represents the correlation of data imputed by observed protein minimum; sky blue box represents the correlation of raw data.
Figure 4Overview of the sample alignment across data types.
A robust and efficient pipeline for quality control of multiple -omics data from the MSBB AD cohort.
Figure 5Gender imputation in the WGS data.
Density showing the distribution of plink F statistic computed from X-chromosome markers. Color denotes the sex from the sample annotation table. NA, sex information not available.
Figure 6Gender imputation based on sex-specific marker gene expression in the RNA-seq data.
XIST is a female specific gene, and RPS4Y1 is a male specific gene. The color denotes the sex by annotation while the shape denotes the sex by imputation. Samples highlighted in the ellipse area of the top right corner are considered ambiguous for sex imputation.
Figure 7Principal component analysis suggests race-mislabeled samples in the WGS data.
Scatter plot showing samples classified by the top principal components (PCs). Color denotes the race information from the sample annotation table.
Figure 8Identification of mis-labeled samples via genetic similarity test across different sequencing data types.
(a) Sample diagram showing two mis-labeled samples. The green lines link the sample pairs inferred to be duplicate or mono-zygote (MZ) twin but associated with different brains based on sample annotation. RNA-seq samples hB_RNA_10452 and hB_RNA_10392 were considered mislabeled for their brain source. (b) Network illustrating the sample relationship from a subset of 6 brains. Each node denotes a sample which is labeled by the brain id based on sample annotation. Each link connects a duplicate or MZ twin pair. Light green color denotes mis-labeled samples whose brain source can be re-mapped and red color denotes a sample without genetic related pair and hence is recommended to be excluded.
Figure 9Distribution of kinship coefficients estimated between sequencing data types for samples within and between brains.
(a) Distribution of kinship coefficients before sample QC. (b) Distribution of kinship coefficients after sample QC. The density has been scaled to a range between 0 and 1. Highly related (duplicate or mono-zygote twin) samples will have the estimated kinship coefficients close to 0.5, while unrelated samples will have the estimated kinship coefficients close to 0.