| Literature DB >> 31440271 |
Peng Jiang1, Yaofei Hu1, Yiqi Wang1, Jin Zhang1, Qinghong Zhu1, Lin Bai2, Qiang Tong1, Tao Li1, Liang Zhao1,2.
Abstract
Ventricular septal defect (VSD) is a fatal congenital heart disease showing severe consequence in affected infants. Early diagnosis plays an important role, particularly through genetic variants. Existing panel-based approaches of variants mining suffer from shortage of large panels, costly sequencing, and missing rare variants. Although a trio-based method alleviates these limitations to some extent, it is agnostic to novel mutations and computational intensive. Considering these limitations, we are studying a novel variants mining algorithm from trio-based sequencing data and apply it on a VSD trio to identify associated mutations. Our approach starts with irrelevant k-mer filtering from sequences of a trio via a newly conceived coupled Bloom Filter, then corrects sequencing errors by using a statistical approach and extends kept k-mers into long sequences. These extended sequences are used as input for variants needed. Later, the obtained variants are comprehensively analyzed against existing databases to mine VSD-related mutations. Experiments show that our trio-based algorithm narrows down candidate coding genes and lncRNAs by about 10- and 5-folds comparing with single sequence-based approaches, respectively. Meanwhile, our algorithm is 10 times faster and 2 magnitudes memory-frugal compared with existing state-of-the-art approach. By applying our approach to a VSD trio, we fish out an unreported gene-CD80, a combination of two genes-MYBPC3 and TRDN and a lncRNA-NONHSAT096266.2, which are highly likely to be VSD-related.Entities:
Keywords: association study; k-mer filtering; long non-coding RNA; trio-sequencing; variant calling; ventricular septal defect
Year: 2019 PMID: 31440271 PMCID: PMC6694746 DOI: 10.3389/fgene.2019.00670
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Figure 1Variant-containing coding genes obtained from the trio that are associated with cardiovascular and neurodegenerative diseases. Panel (A) shows the 18 genes attached to the two categories [generated by using the STRING database (Szklarczyk et al., 2017)], panel (B) presents the connections between CD80- and CHD-related genes, and panel (C) illustrates the 3D structure of the mutated CD80 (PDB ID: 1I8L) discovered in this study. The 18 genes are fished out by using DAVID from OMIM and GAD databases, genes identified by OMIM are shaded by a polygon. Note that, all the genes identified by OMIM have also been confirmed by GAD.
Details of the 16 variant-containing coding genes identified from the trio.
| Disease | Gene | Chr. | Pos. | Reference | Alt. | Varianta | |Transcript|b | Coveragec | MAF |
|---|---|---|---|---|---|---|---|---|---|
| DYSF | 2 | 71665193 | C | T | MM | 14 | 36 | 0.52 | |
| MUC16 | 19 | 8888863 | T | C | MM | 4 | 55 | 0.52 | |
| P2RX6 | 22 | 21023596 | C | T | NM | 4 | 33 | 0.54 | |
| ZNF618 | 9 | 114050108 | C | T | MM | 4 | 35 | 0.57 | |
| CD80 | 3 | 119537362 | T | — | FSD | 3 | 39 | 0.56 | |
| CNOT2 | 12 | 70353914 | A | — | IFD | 3 | 15 | 0.40 | |
| Cardiovascular | ATXN1 | 6 | 16327634 | TGCTGC | — | IFD | 2 | 16 | 0.31 |
| RASA1 | 5 | 87383769 | A | G | MM | 2 | 21 | 0.57 | |
| MDFIC | 7 | 114922989 | C | G | MM | 2 | 18 | 0.66 | |
| CSGALNACT1 | 8 | 19458429 | A | T | MM | 2 | 27 | 0.62 | |
| PRDM7 | 16 | 90058384 | G | — | FSD | 1 | 37 | 0.64 | |
| MICALCL | 11 | 12294797 | — | CTCCTC | IFI | 1 | 11 | 0.36 | |
| GJB2 | 13 | 20189347 | G | — | FSD | 1 | 33 | 0.39 | |
| KRT35 | 17 | 41477614 | C | T | MM | 1 | 31 | 0.41 | |
| Neurodegenerative | PNPLA6 | 19 | 7556658 | C | A | MM | 5 | 26 | 0.46 |
| EPB41L1 | 20 | 36209768 | C | T | MM | 4 | 35 | 0.62 | |
| SYN2 | 3 | 12004751 | — | GCCCGCGCCGCA | IFI | 2 | 6 | 0.33 | |
| ATXN1 | 6 | 16327634 | TGCTGC | — | IFD | 2 | 20 | 0.31 | |
| ERO1B | 1 | 236235819 | G | A | MM | 1 | 21 | 0.52 |
aVariant classes include MM (missense mutation), NM (nonsense mutation), FSD (frame shift deletion), IFD (in frame deletion), IFI (in frame insertion), IFD (in frame deletion); baffected number of transcripts, creads coverage, and dmutant allele frequency.
CD80 interacting genes in GO and OMIM that are associated with cardiovascular diseases.
|
|
|
|
|
|
| PIK3CB | TGFB1 | |
|---|---|---|---|---|---|---|---|---|
| CD80-GO | PTEN | IL8 | CXCL10 | IL10 | CCL2 | CCR2 | THY1 | ERBB2 |
| PIK3CG | TLR3 | IL1B | CD34 | ANPEP | STAT3 | FASLG | VEGFA | |
| JUN | IL18 | NRP1 | IL6 | STAT1 | CD40 | CXCR3 | ||
| CD80-OMIM | AKT2 | ICAM1 | ITIH4 | PIK3CG | CD36 | CD40LG | NRP1 | IL10 |
| IL6 | CD40 | SCARB1 | INS | IFNA1 | LMNA | IL4 | VEGFA | |
| IL18 | PTEN |
Genes in italic are experimentally determined that have interactions with CD80, whereas the rest are computationally predicted.
The top 10 lncRNAs obtained from the trio that have potential relation with VSD.
| NONCODE ID | Chr. | Pos. | Ref. | Alt. | FPKM | Dis.(bp) |
|---|---|---|---|---|---|---|
| NONHSAT096266.2 | 4 | 47846448 | C | T | 13.97 | 1000 |
| NONHSAT232531.1 | 12 | 131923913 | G | A | 2.08 | 1000 |
| NONHSAT001273.2 | 1 | 19074132 | T | C | 1.99 | 1000 |
| NONHSAT180457.1 | 19 | 44259481 | T | C | 1.54 | 1000 |
| NONHSAT235401.1 | 15 | 66701900 | G | A | 1.49 | 1000 |
| NONHSAT244710.1 | 22 | 19179167 | A | G | 1.15 | 1000 |
| NONHSAT010771.2 | 1 | 247332298 | AC | A | 0.89 | 1000 |
| NONHSAT022678.2 | 11 | 71452973 | T | C | 0.37 | 1000 |
| NONHSAT229850.1 | 11 | 64779243 | GAAAAAA | G | 0.32 | 1000 |
| NONHSAT246250.1 | 3 | 9396926 | G | A | 0.22 | 1000 |
FPKM: fragments per kilobase of exon per million reads mapped (Mortazavi et al., 2008).
Figure 2Variants profile of the trio. Panels (A and C) are the distribution of k-mer count ratio between the parents and the child, panel (B) is the ratio distribution of the k-mers having a count less than 20, and panel (D) is the overall distribution of ratio against count. Note that ratio distribution in panel (B) is clustered into bins, viz., r = 0 (denoted as “0”), 0 < r ≤ 0.1 (denoted as “0.1”), 0.1 < r ≤ 0.2 (denoted as “0.2”), and 0.2 < r ≤ 0.3 (denoted as “0.3”), where r is a ratio and the count is in log scale.
k-mer filtering.
|
|
k-mer extension.
|
|
|
|