| Literature DB >> 27299343 |
Boseon Byeon1, Igor Kovalchuk1.
Abstract
The usefulness and the utility of the next generation sequencing (NGS) technology are based on the assumption that the DNA or cDNA cleavage required to generate short sequence reads is random. Several previous reports suggest the existence of sequencing bias of NGS reads. To address this question in greater detail, we analyze NGS data from four organisms with different GC content, Plasmodium falciparum (19.39%), Arabidopsis thaliana (36.03%), Homo sapiens (40.91%) and Streptomyces coelicolor (72.00%). Using machine learning techniques, we recognize the pattern that the NGS read start is positioned in the local region where the nucleotide distribution is dissimilar from the global nucleotide distribution. We also demonstrate that the mono-nucleotide distribution underestimates sequencing bias, and the recognized pattern is explained largely by the distribution of multi-nucleotides (di-, tri-, and tetra- nucleotides) rather than mono-nucleotides. This implies that the correction of sequencing bias needs to be performed on the basis of the multi-nucleotide distribution. Providing companion software to quantify the effect of the recognized pattern on read positioning, we exemplify that the bias correction based on the mono-nucleotide distribution may not be sufficient to clean sequencing bias.Entities:
Mesh:
Substances:
Year: 2016 PMID: 27299343 PMCID: PMC4907491 DOI: 10.1371/journal.pone.0157033
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Data types and sources.
| Data | Type and mapped reads | Source |
|---|---|---|
| Illumina DNA sequencing data (The sequence data were produced by the DOE Joint Genome Institute). Mapped reads: 64610953 | ||
| Illumina RNA sequencing data. Mapped reads: 7586787 | ||
| Human | Illumina DNA sequencing data. Mapped reads: 98428276 | |
| Illumina RNA sequencing data. Mapped reads: 26590790 |
Read start positions selected randomly.
Chromosome numbers indicate the chromosomes used to create data pool, the number of positions is the number of positions selected randomly from the chromosomes, and the selected ratio is the ratio of the number of the randomly selected positions to the total number of positions on the chromosomes.
| Read frequency | Human | |||||||
|---|---|---|---|---|---|---|---|---|
| Chromosomes 1–5 | Chromosomes 1–14 | Chromosomes 1–22 | Complete genome | |||||
| Number of positions | Selected ratio | Number of positions | Selected ratio | Number of positions | Selected ratio | Number of positions | Selected ratio | |
| 0 | 5000 | 0.00 | 10000 | 0.00 | 2000 | 0.00 | 3000 | 0.00 |
| 1 | 500 | 0.00 | 1000 | 0.00 | 200 | 0.00 | 300 | 0.00 |
| 2 | 500 | 0.00 | 1000 | 0.00 | 200 | 0.00 | 300 | 0.00 |
| 3 | 500 | 0.00 | 1000 | 0.01 | 200 | 0.00 | 300 | 0.01 |
| 4 | 500 | 0.00 | 1000 | 0.01 | 200 | 0.02 | 300 | 0.01 |
| 5 | 500 | 0.01 | 1000 | 0.02 | 200 | 0.04 | 300 | 0.02 |
| 6 | 500 | 0.02 | 1000 | 0.03 | 200 | 0.05 | 300 | 0.03 |
| 7 | 500 | 0.04 | 1000 | 0.04 | 200 | 0.07 | 300 | 0.05 |
| 8 | 500 | 0.06 | 1000 | 0.06 | 200 | 0.08 | 300 | 0.07 |
| 9 | 500 | 0.09 | 1000 | 0.08 | 200 | 0.09 | 300 | 0.09 |
| 10 | 500 | 0.14 | 1000 | 0.10 | 200 | 0.11 | 300 | 0.11 |
| 11 or more | 5000 | 0.08 | 10000 | 0.14 | 2000 | 0.15 | 3000 | 0.08 |
Fig 1Generation of sequence data pool.
Gray circles on reference genome indicate random positions, and black bars above reference genome indicate read starts at a given random position. Character “F” in data pool means feature.
Classification accuracies of indicated classifiers.
Accuracies were averaged over four organisms. The number in parentheses is the standard deviation. 0 vs. 1 or more, 0 vs. 6 or more, and 0 vs. 11 or more indicate problems for classification of the positions with no read and with 1 or more reads, 6 or more reads, and 11 or more reads respectively. Random shuffling was performed 10 times for each data set.
| Data set | Classifier | 0 vs. 1 or more | 0 vs. 6 or more | 0 vs. 11 or more |
|---|---|---|---|---|
| Randomly shuffled data | C4.5 | 49.84% (0.82) | 50.01% (0.67) | 49.84% (0.78) |
| Randomly shuffled data | BN | 50.08% (0.81) | 50.00% (0.89) | 50.14% (0.88) |
| Original data | C4.5 | 60.06% (3.90) | 63.28% (6.21) | 65.94% (6.00) |
| Original data | BN | 63.53% (9.21) | 65.49% (10.63) | 66.98% (12.13) |
GC contents around the positions with various read frequencies.
Numbers of 0, 1 to 5, 6 to 10, and 11 or more indicate the read frequencies.
| Data set | Whole genome | 0 | 1 to 5 | 6 to 10 | 11 or more |
|---|---|---|---|---|---|
| 36.03% | 36.28% | 35.30% | 40.17% | 43.24% | |
| 19.39% | 18.92% | 24.46% | 27.48% | 30.54% | |
| Human | 40.91% | 41.56% | 36.60% | 36.80% | 38.38% |
| 72.00% | 72.47% | 69.30% | 66.44% | 60.45% |
Fig 2GC contents around the positions with various read frequencies.
X-axis shows the organism analyzed and Y-axis displays the ratio of GC count to the total nucleotide count. Numbers of 0, 1 to 5, 6 to 10, and 11 or more indicate the read frequencies.
Fig 3Global and local nucleotide distributions.
X-axis shows mono- and dinucleotides and Y-axis displays the normalized nucleotide count. Numbers of 0, 1 to 5, 6 to 10, and 11 or more indicate the read frequencies.
Euclidean distances between the distribution of all local nucleotides and the global nucleotide distribution.
Numbers of 0, 1 to 5, 6 to 10, and 11 or more indicate the read frequencies. The number in parentheses is the p-value of t-test, where the alternative hypothesis is that the nucleotide distribution of the positions with no read is closer to the global distribution than the distribution of the positions with reads.
| Data set | 0 | 1 to 5 (p = 0.0312) | 6 to 10 (p = 0.0077) | 11 or more (p = 0.0020) |
|---|---|---|---|---|
| 0.0044 | 0.0118 | 0.0549 | 0.0921 | |
| 0.0074 | 0.0747 | 0.1157 | 0.1556 | |
| Human | 0.0081 | 0.0632 | 0.1300 | 0.1683 |
| 0.0059 | 0.0381 | 0.0746 | 0.1503 |
Classification accuracies of K-nearest neighbor classifier and pattern effects.
0 vs. 1 or more, 0 vs. 6 or more, and 0 vs. 11 or more indicate problems for classification of the positions with no read and with 1 or more reads, 6 or more reads, and 11 or more reads respectively. Mono, multi, and all indicate the data sets which are composed of the features extracted from the distributions of mono-nucleotides, multi-nucleotides (di-, tri-, and tetra-nucleotides), and all nucleotides (mono- and multi-nucleotides), respectively. PEI was calculated from the classification accuracy for the feature set of all nucleotides and averaged over three classification problems.
| Data set | 0 vs. 1 or more | 0 vs. 6 or more | 0 vs. 11 or more | Average PEI | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Mono | Multi | All | Mono | Multi | All | Mono | Multi | All | ||
| 58.06% | 70.84% | 71.20% | 62.56% | 79.62% | 79.50% | 66.02% | 89.12% | 88.84% | 0.60 | |
| 63.80% | 67.70% | 67.81% | 67.24% | 72.80% | 72.84% | 68.71% | 78.11% | 78.26% | 0.46 | |
| Human | 71.20% | 82.55% | 82.15% | 74.85% | 92.25% | 92.20% | 77.75% | 93.75% | 94.20% | 0.79 |
| 65.10% | 64.77% | 65.37% | 72.03% | 73.90% | 73.77% | 77.03% | 82.73% | 82.87% | 0.48 | |
Classification accuracies of K-nearest neighbor classifier for features selected by genetic algorithm.
The value in parentheses indicate the running time in hour: minute.
| Data set | 0 vs. 1 or more | 0 vs. 6 or more | 0 vs. 11 or more |
|---|---|---|---|
| 73.56% (16:09) | 83.42% (16:11) | 91.90% (16:49) | |
| 69.92% (45:35) | 74.70% (41:12) | 79.01% (54:36) | |
| Human | 85.80% (02:41) | 95.50% (02:29) | 97.00% (02:25) |
| 70.23% (05:45) | 77.07% (05:39) | 84.67% (05:32) |
Features selected by genetic algorithm and the number of data sets for which the features were selected.
Features which were selected for 4 to 10 data sets are not shown.
| Selected features | Number of data sets |
|---|---|
| TC, GAT, AAAA, CAAG, CGCG, CGGC, CGTA, GAAT, TCTT | 12 |
| AT, CA, GC, GT, TT, AAG, ACG, AGC, CCA, CCG, GGT, GTT, TTC, TTT, ACAC, ACAT, ACGT, AGCG, AGCT, AGGA, AGGG, ATCA, CACG, CAGG, CCAT, CCCT, CCTC, CGCA, CGGT, CGTC, CGTT, CTAA, CTAC, CTGA, CTGG, GAAA, GCAT, GGAT, GGCA, GTAC, GTTA, TATA, TCCC, TCGC, TGGC, TTGT, TTTG | 11 |
| A, C, T, CACC, GCTG, GTAA, TGGG | 3 |
| G, ATC | 2 |
Pattern effect index for chromosome 1 of Arabidopsis thaliana.
Original and corrected data indicate that reads are uncorrected and GC-corrected, respectively. 0 vs. 1 or more, 0 vs. 6 or more, and 0 vs. 11 or more indicate problems for classification of the positions with no read and with 1 or more reads, 6 or more reads, and 11 or more reads, respectively. PEIs were measured 5 times by the companion software and averaged. The number is the average PEI and the number in parentheses is the standard deviation.
| Data set | 0 vs. 1 or more | 0 vs. 6 or more | 0 vs. 11 or more |
|---|---|---|---|
| Original data | 0.22 (0.03) | 0.43 (0.03) | 0.73 (0.06) |
| Corrected data | 0.08 (0.05) | 0.14 (0.03) | 0.64 (0.12) |
Sequence complexity.
The numbers of 0, 1 to 5, 6 to 10, and 11 or more indicate the read frequencies. The numbers outside and inside parentheses are the average and standard deviation of sequence complexities, respectively. The complexities were calculated from the sequence data pools.
| Data set | 0 | 1 to 5 | 6 to 10 | 11 or more |
|---|---|---|---|---|
| 0.65 (0.06) | 0.65 (0.06) | 0.67 (0.05) | 0.67 (0.06) | |
| 0.66 (0.10) | 0.69 (0.08) | 0.70 (0.07) | 0.71 (0.07) | |
| Human | 0.69 (0.21) | 0.74 (0.08) | 0.75 (0.10) | 0.74 (0.13) |
| 0.81 (0.05) | 0.83 (0.05) | 0.83 (0.05) | 0.84 (0.04) |
Classification accuracies of K-nearest neighbor classifier on the sequence complexity data.
| Data set | 0 vs. 1 or more | 0 vs. 6 or more | 0 vs. 11 or more |
|---|---|---|---|
| 48.78% | 52.22% | 59.82% | |
| 54.93% | 55.20% | 55.76% | |
| Human | 60.65% | 62.00% | 64.75% |
| 56.87% | 60.47% | 62.73% |