| Literature DB >> 23800085 |
Ksenia V Krasileva, Vince Buffalo, Paul Bailey, Stephen Pearce, Sarah Ayling, Facundo Tabbita, Marcelo Soria, Shichen Wang, Eduard Akhunov, Cristobal Uauy, Jorge Dubcovsky.
Abstract
BACKGROUND: The high level of identity among duplicated homoeologous genomes in tetraploid pasta wheat presents substantial challenges for de novo transcriptome assembly. To solve this problem, we develop a specialized bioinformatics workflow that optimizes transcriptome assembly and separation of merged homoeologs. To evaluate our strategy, we sequence and assemble the transcriptome of one of the diploid ancestors of pasta wheat, and compare both assemblies with a benchmark set of 13,472 full-length, non-redundant bread wheat cDNAs.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23800085 PMCID: PMC4053977 DOI: 10.1186/gb-2013-14-6-r66
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
The T. urartu and T. turgidum final assembly statistics
|
|
| |
|---|---|---|
| 100-bp paired-end reads ( | 248.5 million | 488.9 million |
| Reads after digital normalizationa ( | 47.3 million | 110.7 million |
| Contigs ( | 86,247 | 140,118 |
| Mean contig size (bp) | 1,417 bp | 1,299 bp |
| Min contig size (bp) | 212 bp | 298 bp |
| Max contig size (bp) | 17,959 bp | 26,226 bp |
| GC content (%) | 49% | 49% |
| Total transcriptome size (Mb) | 122 Mb | 181 Mb |
| Reads mapping to the assembly (% of total reads) | 82.2% | 81.5% |
| Reads mapped in proper pairs (% of total reads) | 73.0% | 71.5% |
| Unique alignments (% of total mapped) | 52.8% | 76.7% |
| Benchmark genesb assembled > 50% length in a single contig | 12,693 (94%) | 12,961 (96%) |
| Benchmark genesb assembled > 90% length in a single contig | 10,727 (80%) | 10,197 (76%) |
aElimination of Homo sapiens,Escherichia coli, wheat mitochondrial, rRNA, and chloroplast sequences resulted in the elimination of 0.5% of the digitally normalized reads in T. urartu and 0.6% in T. turgidum.
b13,472 full-length cDNAs from the RIKEN Plant Science Center Japan [35].
Figure 1Divergence of A and B transcripts. (A) Distribution of percent identity between A/B homoeologous genes in a set of 26 experimentally validated genes (52 homoeologs). Mean = 97.3%; SD = 1.20%. (B) Distribution of distances between 707 single nucleotide polymorphisms (SNPs) between homoeologs in tetraploid wheat coding regions. Mean = 37.8 bp; SD = 47.1 bp; Median = 27 bp.
Figure 2Strategies for genome-specific assembly and annotation of the tetraploid wheat transcriptome. (A) Overall assembly pipeline. Functional steps are listed on the left and specific programs used for each step on the right. Programs developed during the course of this study are underlined. (B) Steps used in the annotation. (C) Post-assembly processing pipeline using phasing to separate homoeolog-specific sequences. (D) Illustration of the phasing process. Reads are re-aligned to the reference transcriptome, single nucleotide polymorphisms (SNPs) between homoeologs are identified (in red), and phased. The example shows the phasing of A and C SNPs at positions 5 and 16 in phase 0 and G and T SNPs in phase 1.
Figure 3Comparison of the effect of different k-mers on transcriptome assembly metrics in diploid and tetraploid wheat. T. urartu values are indicated by the red dotted lineand T. turgidum by theblue solid line. (A) Average contigs length. (B) Total number of contigs. (C) Percent of total reads mapped back to the assembly. (D) Percent of total reads that are mapped in proper pairs. (E) Fraction of 13,472 full-length benchmark wheat cDNAs that are assembled in a single contig. (F) Venn diagram showing the number of benchmark cDNAs assembled full-length (>90%) at k-mer sizes 21 and 63.
Figure 4Distribution of percent identities between . The graph represents the distribution of percent identity between T. turgidum and T. urartu merged assemblies as calculated by BLASTN(E-value cutoff 1e-20). Densities are colored by the k-mer which contributed each contig to the merged assembly.
Open reading frame predictiona
|
|
| |
|---|---|---|
| Contigs ( | 140,118 | 86,247 |
| Non-wheat sequencesb (eliminated) ( | 558 | 518 |
| BLASTX, E-value cutoff 1e-3 | 96,244 | 59,439 |
| Contigs with a Pfam domain (1e-3) | 59,917 | 39,965 |
| Contig sequences without BLASTX (1e-3) or Pfam (1e-3) | 42,999 | 26,070 |
| Predicted ORFs (non-redundant, >30 amino acids) | 76,570 | 43,014 |
| Fulllength | 32,548 | 22,868 |
| Missing 5' end | 26,723 | 12,225 |
| Missing 3' end | 12,792 | 5,376 |
| Missing 5' and 3' end | 4,507 | 2,545 |
| Putative pseudogenes (frameshift and/or premature stop codon) | 9,937 | 5,208 |
| Contigs with BLASTX on inconsistent strand | 4,376 | 3,628 |
| Contigs with >1 predicted ORFs (>30 amino acids, no repetitive elements, not a pseudogene) | 2,164 | 1,349 |
| Putative fused transcripts (excluding overlaps) ( | 6,409 | 4,866 |
aOpen reading frames were predicted with a comparative genomics approach using the findorfprogram and BLASTX alignments (E-value cutoff 1e-5) between contigs and proteomes of barley, Brachypodium, rice, maize, sorghum, and Arabidopsis.
bNon-wheat sequences were identified based on taxonomic distribution of top 10 BLASTX hits against nr.
Figure 5Comparison of codon usage in predicted genes and pseudogenes. A multidimension scaling scatterplot was generated from a random set of 3,000 full-length and 3,000 pseudogene-containing contigs. Pseudogenes were predicted by findorf by the presence of internal frameshifts or stop codon compared with known plant proteins.
Comparison of predicted ORFs (excluding pseudogenes) with T. aestivum genomic DNA contigs
| Transcriptome |
|
|
|---|---|---|
| 14,678 | 32,554 | |
| ≥95% coverage in more than one CS contig | 489 | 911 |
| ≥65% coverage in one or more CS contigs | 2,094 | 3,136 |
| 12,239 | 17,437 | |
| ≥95% coverage in more than one CS contig | 1,146 | 1,549 |
| ≥65% coverage in one or more CS contigs | 2,416 | 3,262 |
| Not aligned | ||
| Id <94% or coverage <65%) | 4,549 | 7,370 |
| Number of query sequences with no significant BLAST hits (e-10) | 195 | 414 |
| Total number of query sequences | 37,806 | 66,633 |
Figure 6Identification and phasing of A/B contigs merged during the assembly. (A) Schematic illustration of a contig merged during the assembly. Empty circles represent nucleotides that are common between homoeologs. Grey and black circles correspond to biological polymorphisms between homoeologs. (B) Density plots of percent identity between T. turgidum and T. urartufor contigs with <2 SNPs. The 95% identity peak represents mostly B genome contigs and suggests a relatively good separation of A and B genome contigs in this dataset. (C,D) Density plots of percent identity between T. turgidum and T. urartu for contigs with ≥2 SNPs. (C) Distribution before phasing (note the absence of a bimodal distribution) and (D) after phasing (bimodal distribution as in B).
Polymorphism detection in the tetraploid wheat assembly and polymorphism phasing
|
| |
|---|---|
| Polymorphisms ( | 1,179,465 |
| Singlenucleotide polymorphisms (SNP) ( | 958,362 |
| Multi-nucleotide polymorphisms (MNP) ( | 23,424 |
| Insertions | 72,144 |
| Deletions | 39,882 |
| Complexa | 84,457 |
| Other (>2 alleles)b | 1,089 |
| Contigs with <2 SNP/MNP ( | 65,238 |
| Contigs with >1 SNP/MNP ( | 74,880 |
| Phased contigs ( | 67,169 |
| Phased blocks ( | 81,413 |
| Phased SNPs/MNPs ( | 864,865 |
| Chimeric reference contigs ( | 34,029 |
| Reads filtered due to mapping quality <30 ( | 106,003,190 |
| Reads filtered due to indels ( | 6,544,331 |
| Reads passed to MIRA ( | 256,016,046 |
aComplex: composite insertions and substitution events.
bOther: includes cases with >1 alternative allele.