| Literature DB >> 34764310 |
Hongxu Ding1,2, Ioannis Anastopoulos3,4, Andrew D Bailey3,4, Joshua Stuart5,6, Benedict Paten7,8.
Abstract
The characteristic ionic currents of nucleotide kmers are commonly used in analyzing nanopore sequencing readouts. We present a graph convolutional network-based deep learning framework for predicting kmer characteristic ionic currents from corresponding chemical structures. We show such a framework can generalize the chemical information of the 5-methyl group from thymine to cytosine by correctly predicting 5-methylcytosine-containing DNA 6mers, thus shedding light on the de novo detection of nucleotide modifications.Entities:
Mesh:
Substances:
Year: 2021 PMID: 34764310 PMCID: PMC8586022 DOI: 10.1038/s41467-021-26929-x
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1Predicting kmer characteristic ionic currents from chemical structures.
A Graphic overview of the proposed deep learning framework for DNA analysis. B Goodness of fit of DNA canonical random downsample, base-dropout, position-dropout, and model combination analyses. Specifically, “downsample” denotes the random dropout experiment, where we create random train-test splits. “Base” denotes base-dropout experiment, where we drop DNA 6mers that contain a specific base in any given position during training. “Position” denotes positional base-dropout experiment, where we drop DNA 6mers that contain a specific base in a given position during training. As for “combine,” we drop DNA 6mers that contain both of the specified bases during training. C Goodness of fit of 5mC-containing DNA 6mer imputation analysis. D Goodness of fit of de novo 5mC-containing DNA 6mer prediction. C and 5mC refer to the goodness of fit of canonical DNA 6mers and 5mC-containing DNA 6mers, respectively. In B–D, Train (red) and Test (blue) refer to the goodness of fit of the training and test DNA 6mers, respectively. E Predictive accuracy of C/5mC status quantified by balanced accuracy. Nanopolish, predictive analysis with the nanopolish model as baseline control. De Novo, predictive analysis with 5mC-containing DNA 6mer models described in (D), which were predicted from canonical training. 0.01–0.9, predictive analysis with different imputation 5mC-containing DNA 6mer models as described in (C). FAB39088 (cyan) and FAF01164 (purple) refer to two independent NA12878 cell line native genomic DNA nanopore sequencing datasets. Throughout (B–E), the median, minimum/maximum (excluding outliers), and first/third quartile values were shown by the boxplots.
Fig. 2Visualizing the encoding of chemical structures.
A–C Atom similarity matrix, tSNE visualization, and chemical structure of the example canonical DNA 6mer CGACGT. In A, B, atoms were numbered and colored based on the chemical structure in (C). Carbon, nitrogen, oxygen, and phosphorus were colored as black, blue, red, and orange, respectively. Specifically, in A, nucleobases were highlighted by dashed boxes. D–F Atom similarity matrix, tSNE visualization, and chemical structure of the example 5mC-containing DNA 6mer GT(5mC)AGA. In D, E, atoms were numbered and colored based on the chemical structure in (F). Carbon, nitrogen, oxygen, and phosphorus were colored as black, blue, red, and orange, respectively. Specifically, in D, E, methyl group carbon atoms (#38 in T and #58 in 5mC) were highlighted.
SMILES strings of individual nucleotides.
| Nucleotide | SMILES string |
|---|---|
| A (DNA) | OP(=O)(O)OCC1OC(N3C=NC2=C(N)N=CN=C23)CC1 |
| T (DNA) | OP(=O)(O)OCC1OC(N2C(=O)NC(=O)C(C)=C2)CC1 |
| C (DNA) | OP(=O)(O)OCC1OC(N2C(=O)N=C(N)C=C2)CC1 |
| G (DNA) | OP(=O)(O)OCC1OC(N2C=NC3=C2N=C(N)NC3=O)CC1 |
| 5mC (DNA) | OP(=O)(O)OCC1OC(N2C(=O)N=C(N)C(C)=C2)CC1 |
| 6mA (DNA) | OP(=O)(O)OCC1OC(N3C=NC2=C(NC)N=CN=C23)CC1 |
| A (RNA) | OP(=O)(O)OCC1OC(N3C=NC2=C(N)N=CN=C23)C(O)C1 |
| U (RNA) | OP(=O)(O)OCC1OC(N2C(=O)NC(=O)C=C2)C(O)C1 |
| C (RNA) | OP(=O)(O)OCC1OC(N2C(=O)N=C(N)C=C2)C(O)C1 |
| G (RNA) | OP(=O)(O)OCC1OC(N2C=NC3=C2N=C(N)NC3=O)C(O)C1 |
| 6mA (RNA) | OP(=O)(O)OCC1OC(N3C=NC2=C(NC)N=CN=C23)C(O)C1 |
| 2mG (RNA) | OP(=O)(O)OCC1OC(N2C=NC3=C2N=C(NC)NC3=O)C(O)C1 |
Atom chemical properties included in the study.
| Feature | Description |
|---|---|
| Carbon | 1 if the atom is carbon, 0 otherwise (boolean) |
| Nitrogen | 1 if the atom is nitrogen, 0 otherwise (boolean) |
| Oxygen | 1 if the atom is oxygen, 0 otherwise (boolean) |
| Phosphorus | 1 if the atom is phosphorus, 0 otherwise (boolean) |
| Atom degree | Total number of covalent bonds around an atom (integer) |
| Implicit valence | It equals the valence of the atom minus the valence calculated from the bond connections (integer) |
| Number of hydrogens | Total count of hydrogens (integer) |
| Aromaticity | 1 if atom in an aromatic ring, 0 otherwise (boolean) |
Hyperparameters searched in the study.
| Parameters | Space searched | ATGC DNA | AUGC RNA | A(6mA)UGC RNA |
|---|---|---|---|---|
| The number of GCN layers | {2, 3, 4, 5, 6} | 4 | 4 | 6 |
| The number of CNN layers | {2, 3, 4, 5, 6} | 3 | 5 | 6 |
| The kernel size for the CNN layers | {2, 4, 10, 20} | 10 | 10 | 10 |
| The number of nodes in the dense (NN) layer | {32, 128, 512, 2048, 8192} | 8192 | 8192 | 8192 |