| Literature DB >> 22808927 |
Berat Z Haznedaroglu1, Darryl Reeves, Hamid Rismani-Yazdi, Jordan Peccia.
Abstract
BACKGROUND: The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological information. A common solution to this problem is the clustering of single k-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of k-mer selection on the annotation output. This study provides an in-depth k-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual k-mers and clustered assemblies (CA) were considered using three representative software packages. Pair-wise comparison analyses (between individual k-mers and CAs) were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG) ortholog identifiers (KOIs), and to determine a strategy that maximizes the recovery of biological information in a de novo transcriptome assembly.Entities:
Mesh:
Year: 2012 PMID: 22808927 PMCID: PMC3489510 DOI: 10.1186/1471-2105-13-170
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Transcriptome sequencing and assembly summary
| | | | | | | | | | |
| Raw sequencing reads | | | | | 44,568,122 | | | | |
| Read length | | | | | 99 | | | | |
| | | | | | | | | | |
| Reads requiring trimming | | | | | 29,264,547 | | | | |
| Minimum read length | | | | | 1 | | | | |
| Lower quartile read length | | | | | 65 | | | | |
| Median read length | | | | | 87 | | | | |
| Upper quartile read length | | | | | 99 | | | | |
| Maximum read length | | | | | 99 | | | | |
| | | | | | | | | | |
| Number of reads assembled | 18,097,635 | 18,481,043 | 18,018,855 | 16,878,820 | 16,918,312 | 17,188,419 | 26,970,166 | 30,231,540 | 28,058,816 |
| Number of reads mapped | 28,449,200 | 31,893,197 | 33,199,219 | 32,903,930 | 33,395,304 | 32,523,018 | 31,939,290 | 30,267,756 | 27,808,027 |
| Number of contigs (≥ 100 bp) | 98,094 | 64,000 | 47,448 | 46,461 | 40,965 | 46,442 | 34,489 | 33,344 | 32,639 |
| Number of contigs (≥ 5,000 bp) | 470 | 1,296 | 1,587 | 1, 315 | 1,025 | 742 | 636 | 253 | 115 |
| Number of contigs (≥ 8,000 bp) | 42 | 155 | 215 | 165 | 119 | 72 | 105 | 17 | 22 |
| Average length of contigs | 700 | 1,114 | 1,463 | 1,356 | 1,383 | 1,115 | 1,402 | 1,120 | 914 |
| Longest contig length | 46,754 | 29,394 | 16,393 | 14,115 | 13,754 | 13,685 | 12,571 | 9,484 | 12,582 |
| N50 | 1,594 | 2,415 | 2,745 | 2,624 | 2,497 | 2,202 | 2,349 | 1,836 | 1,498 |
| N90 | 249 | 470 | 795 | 696 | 733 | 488 | 730 | 515 | 372 |
Figure 1Cumulative contig length frequency distributions for individual assemblies.
Figure 2Total contig and KOI counts for each -mer assembly.
Figure 3Comparative matrix of number of unique KOIs missing in each single -mer assembly. Each value represents the number of unique KOIs (for a specific row k-mer value assembly) not identified in the set of column k-mer assemblies.
Figure 4Contig coverages of representative single -mer assemblies as a function of mismatches allowed by Bowtie mismatches.
Number of KOIs present in individual -mer assemblies but missing from the combined assemblies generated with different programs
| 58 | 108 | 86 | 33 | 481 | |
| 19 | 55 | 41 | 9 | 433 | |
| 9 | 37 | 23 | 3 | 402 | |
| 7 | 44 | 28 | 4 | 409 | |
| 8 | 45 | 29 | 5 | 415 | |
| 8 | 53 | 36 | 7 | 420 | |
| 8 | 47 | 30 | 8 | 413 | |
| 10 | 43 | 27 | 9 | 413 | |
| 13 | 51 | 35 | 8 | 422 | |
| 17 | 60 | 40 | 10 | 432 | |
| 19 | 57 | 38 | 9 | 412 | |
| 17 | 59 | 45 | 9 | 412 | |
| 22 | 63 | 49 | 14 | 413 | |
| 22 | 67 | 52 | 14 | 420 | |
| 24 | 76 | 58 | 19 | 426 | |
| 33 | 81 | 63 | 25 | 432 | |
| 36 | 89 | 73 | 28 | 446 | |
| 38 | 95 | 75 | 33 | 447 | |
| 42 | 100 | 80 | 35 | 449 | |
| 46 | 103 | 83 | 37 | 460 | |
| 55 | 113 | 93 | 42 | 473 | |
| 56 | 112 | 97 | 43 | 468 | |
| 68 | 125 | 110 | 52 | 472 | |
1Represents sequence identity.
Figure 5Number of missing KOIs compared in reverse between clustered assemblies and single -mer assemblies. Data represent a reverse comparative analysis where the number of KOIs annotated in the CAs, but missing in single k-mer assemblies (open triangles for CD-HIT-EST 1.0 and squares for Oases), and the number of KOIs annotated in the single k-mer assemblies but missing in the CAs (closed triangles for CD-HIT-EST 1.0 and squares for Oases).