| Literature DB >> 31639043 |
Rongjie Wang1, Tianyi Zang2, Yadong Wang3.
Abstract
BACKGROUND: In recent years, with the development of high-throughput genome sequencing technologies, a large amount of genome data has been generated, which has caused widespread concern about data storage and transmission costs. However, how to effectively compression genome sequences data remains an unsolved problem.Entities:
Keywords: Compression; Human mitochondrial genomes; Machine learning
Mesh:
Year: 2019 PMID: 31639043 PMCID: PMC6805717 DOI: 10.1186/s40246-019-0225-3
Source DB: PubMed Journal: Hum Genomics ISSN: 1473-9542 Impact factor: 4.639
Fig. 1The architecture of the DeepDNA model. Firstly, the input genome sequence is transformed into one-hot 4-dimensions bit matrix; A convolution layer activated by a rectified linear units acts as a local feature extractor, its output is a matrix with column matrix of the convolution filter and the row matrix of the position in the input sequence; A max-pooling procedure is used to reduce the size of the output matrix and only preserve the main features; The subsequent Long Short-Term Memory network (LSTM) layer is considered as acting the role of capturing sequence long-term features; A flattened fully connected layer is to collect LSTM outputs; The last layer performs a sigmoid non-linear transformation to a vector that serves as probability predictions of the sequence base
Fig. 2Arithmetic encoding process. It illustrates the sequence label determination process when encoding a sequence ’CGTA’, assume that the probability values of each base: p(A)=p(T)=0.2, p(C)=0.5, p(G)=0.1
Fig. 3The training loss function values (bpb) as the number of training mini-batches for DeepDNA model. 700 human mitochondrial genome sequences were trained, and the input length of the base sequence was 64, and the output was the classification of the corresponding four nucleotides
Results for DeepDNA and the other methods compression for 100 human mitochondrial genomes
| Dataset | Total size | Gzip | MFCompress | DMcompress | DeepDNA |
|---|---|---|---|---|---|
| (nucleotides) | (bpb) | (bpb) | (bpb) | (bpb) | |
| 100 human | |||||
| Mitochondrial genomes | 1,656,779 | 1.45 | 0.07 | 0.07 | 0.03 |
The measure of space occupied is evaluated in bits per base (bpb)
Detailed results for DeepDNA and the other methods on randomly selected five sequences from 100 human mitochondrial genome sequences
| Genome ID | Gzip (bpb) | MFCompress (bpb) | DMcompress (bpb) | DeepDNA (bpb) |
|---|---|---|---|---|
| KF162105.1 | 2.63 | 2.09 | 2.07 | 0.01 |
| MF058266.1 | 2.64 | 2.09 | 2.07 | 0.05 |
| KC911416.1 | 2.64 | 2.09 | 2.06 | 0.01 |
| AY339411.1 | 2.63 | 2.09 | 2.07 | 0.01 |
| JQ702777.1 | 2.64 | 2.08 | 2.06 | 0.04 |
The measure of space occupied is evaluated in bits per base (bpb)