| Literature DB >> 31510706 |
Bojian Yin1, Marleen Balvert1,2, Rick A A van der Spek3, Bas E Dutilh2, Sander Bohté1, Jan Veldink3, Alexander Schönhuth1,2.
Abstract
MOTIVATION: Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease caused by aberrations in the genome. While several disease-causing variants have been identified, a major part of heritability remains unexplained. ALS is believed to have a complex genetic basis where non-additive combinations of variants constitute disease, which cannot be picked up using the linear models employed in classical genotype-phenotype association studies. Deep learning on the other hand is highly promising for identifying such complex relations. We therefore developed a deep-learning based approach for the classification of ALS patients versus healthy individuals from the Dutch cohort of the Project MinE dataset. Based on recent insight that regulatory regions harbor the majority of disease-associated variants, we employ a two-step approach: first promoter regions that are likely associated to ALS are identified, and second individuals are classified based on their genotype in the selected genomic regions. Both steps employ a deep convolutional neural network. The network architecture accounts for the structure of genome data by applying convolution only to parts of the data where this makes sense from a genomics perspective.Entities:
Mesh:
Year: 2019 PMID: 31510706 PMCID: PMC6612814 DOI: 10.1093/bioinformatics/btz369
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.An overview of the workflow. CV, cross validation; acc, accuracy
Network architecture of the classifier with input data from a single promoter region
| Layer type | Description | Output shape |
|---|---|---|
| Input | (64, 1) | |
| Convolution, BN and Act | 1 × 1 filter, | (64, 4) |
| 4 output channels | ||
| Convolution, BN and Act | 4 × 4 filter, | (61, 32) |
| 32 output channels | ||
| Reshape | Flatten | (1952, 1) |
| Dense, BN and Act | (148, 1) | |
| Dense, BN and Act | (16, 1) | |
| Output | Softmax | (2, 1) |
Note: The output shape is given as (width, channels). BN, batch normalization; Act, softplus activation.
Fig. 2.An overview of the network, where ‘GAP’ is global average pooling
Fig. 3.Change of tensor shape throughout Block 1 and Reshape
Fig. 4.Histograms of per-promoter region test accuracies for each chromosome
Promoter regions selected by the deep neural network for individual promoters
| Positions (range) | Gene | Acc | Prec | Recall | Acc LR | |
|---|---|---|---|---|---|---|
| Chr 7 | 108000837–108023643 | LAMB4 | 0.582 | 0.679 | 0.313 | 0.548 |
| 60061–93116 | LOC105375113 | 0.579 | 0.728 | 0.252 | 0.544 | |
| 108001655–108023688 | LAMB4 | 0.576 | 0.668 | 0.304 | 0.548 | |
| 157112225–157119549 | LOC105375607 | 0.576 | 0.745 | 0.230 | 0.506 | |
| 142254448–142274025 | TRY2P | 0.567 | 0.717 | 0.222 | 0.545 | |
| 72317552–72663411 | TYW1B | 0.566 | 0.672 | 0.259 | 0.507 | |
| 68920–94119 | LOC101929756 | 0.566 | 0.672 | 0.259 | 0.538 | |
| 76486174–76514143 | DTX2 | 0.562 | 0.719 | 0.210 | 0.522 | |
| Chr 9 | 136334910–136355183 | GPSM1 | 0.615 | 0.694 | 0.412 | 0.577 |
| 136336933–136357677 | GPSM1 | 0.611 | 0.693 | 0.398 | 0.571 | |
| 136361582–136378662 | SNAPC4 | 0.609 | 0.635 | 0.514 | 0.578 | |
| 136370707–136389390 | SNAPC4 | 0.603 | 0.677 | 0.395 | 0.565 | |
| 134626387–134643306 | COL5A1 | 0.600 | 0.710 | 0.338 | 0.5621 | |
| 98535635–98557461 | LOC105375972 | 0.595 | 0.753 | 0.282 | 0.561 | |
| 136324634–136347520 | GPSM1 | 0.581 | 0.653 | 0.345 | 0.540 | |
| 136382485–136403936 | SDCCAG3 | 0.581 | 0.673 | 0.314 | 0.537 | |
| Chr 17 | 67877749–67891928 | BPTF | 0.592 | 0.793 | 0.250 | 0.555 |
| 15614822–15661462 | TRIM16 | 0.591 | 0.793 | 0.250 | 0.561 | |
| 138726–158754 | DOC2B | 0.582 | 0.603 | 0.477 | 0.512 | |
| 139747–159851 | DOC2B | 0.579 | 0.593 | 0.506 | 0.510 | |
| 141150–160523 | DOC2B | 0.577 | 0.599 | 0.464 | 0.0.51 | |
| 55245746–55267188 | HLF | 0.577 | 0.747 | 0.234 | 0.578 | |
| 55247456–55268383 | HLF | 0.577 | 0.745 | 0.230 | 0.564 | |
| 139853–160092 | DOC2B | 0.573 | 0.601 | 0.434 | 0.505 | |
| Chr 22 | 19403582–19439165 | HIRA | 0.575 | 0.694 | 0.267 | 0.560 |
| 19404552–19439540 | HIRA | 0.567 | 0.694 | 0.267 | 0.558 | |
| 17230116–17293295 | CECR1 | 0.529 | 0.627 | 0.141 | 0.508 | |
| 19685895–19718733 | LINC00895 | 0.520 | 0.554 | 0.207 | 0.500 | |
| 17747340–17783351 | BID | 0.518 | 0.598 | 0.113 | 0.511 | |
| 17760620–17788896 | MIR3198-1 | 0.517 | 0.589 | 0.111 | 0.500 | |
| 19149580–19162696 | GSC2 | 0.517 | 0.647 | 0.074 | 0.500 | |
| 17541582–17566229 | CECR2 | 0.515 | 0.725 | 0.049 | 0.502 |
Note: Accuracy (Acc), precision (Prec) and recall obtained with Promoter-CNN are reported. Additionally, the accuracy for this promoter region obtained with logistic regression is reported (Acc LR).
Reported as ALS associated gene (http://alsod.iop.kcl.ac.uk/) (Abel ).
Reported to be associated with ALS and other neurodegenerative disorders (https://www.wikigenes.org/e/gene/e/2186.html).
Classification results obtained with four classification methods applied to chromosomes 7, 9, 17 and 22 independently and combined
| Classifier | Chr | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|---|
| Logistic Regression | 7 | 0.625 | 0.642 | 0.566 | 0.602 |
| 9 | 0.546 | 0.575 | 0.355 | 0.439 | |
| 17 |
| 0.670 | 0.539 | 0.598 | |
| 22 | 0.590 | 0.619 |
|
| |
| Promoter-CNN + ALS-Net | 7 |
| 0.667 |
|
|
| 9 |
| 0.698 |
|
| |
| 17 |
| 0.725 |
|
| |
| 22 | 0.617 |
| 0.410 | 0.517 | |
| All |
| 0.711 |
|
| |
| Promoter-CNN + Logistic Regression | 7 | 0.635 | 0.728 | 0.445 | 0.553 |
| 9 | 0.683 | 0.743 | 0.560 | 0.685 | |
| 17 | 0.642 | 0.734 | 0.445 | 0.554 | |
| 22 | 0.580 | 0.714 | 0.299 | 0.422 | |
| All | 0.739 | 0.759 | 0.699 | 0.728 | |
| Promoter-CNN + SVM | 7 | 0.550 | 0.750 | 0.151 | 0.252 |
| 9 | 0.598 |
| 0.266 | 0.397 | |
| 17 | 0.577 |
| 0.212 | 0.334 | |
| 22 | 0.521 | 0.743 | 0.267 | 0.393 | |
| All | 0.725 | 0.783 | 0.624 | 0.694 | |
| Promoter-CNN + Random Forest | 7 | 0.562 |
| 0.175 | 0.285 |
| 9 | 0.579 | 0.759 | 0.229 | 0.351 | |
| 17 | 0.645 | 0.762 | 0.420 | 0.542 | |
| 22 | 0.587 |
| 0.265 | 0.391 | |
| All | 0.596 |
| 0.249 | 0.381 | |
| Promoter-CNN + AdaBoost | 7 | 0.604 | 0.642 | 0.467 | 0.541 |
| 9 | 0.621 | 0.668 | 0.481 | 0.559 | |
| 17 | 0.599 | 0.633 | 0.472 | 0.401 | |
| 22 | 0.561 | 0.591 | 0.398 | 0.475 | |
| Al | 0.661 | 0.700 | 0.565 | 0.625 |
Note: The result of best performing model for the given (set of) chromosome(s) is denoted in italic, while the overall best score is indicated in bold. Chr, chromosome.
Fig. 5.Saliency maps averaged over (a) 100 randomly selected ALS patients and (b) 100 randomly selected healthy controls
Training accuracy, precision and recall for each cohort, obtained with Promoter-CNN + Logistic regression and Promoter-CNN + ALS-Net
| Classifier | Batch | Accuracy | Precision | Recall |
|---|---|---|---|---|
| Promoter-CNN + ALS-Net | C1 | 0.648 | 0.510 | 0.793 |
| C3 | 0.711 | 0.829 | 0.757 | |
| C5 | 0.934 | 0.000 | N/A | |
| C44 | 0.760 | 0.753 | 0.967 | |
| Promoter-CNN + Logistic Regression | C1 | 0.626 | 0.480 | 0.373 |
| C3 | 0.434 | 0.766 | 0.313 | |
| C5 | 0.990 | 0.000 | N/A | |
| C44 | 0.657 | 0.740 | 0.768 |
Note: In C5 there are no cases, implying that TP and FN counts are zero, which renders precision (=0) and recall (=undefined) statistics meaningless in the frame of a comparison.