| Literature DB >> 24324652 |
Matteo Chiara1, David S Horner, Alberto Spada.
Abstract
De novo transcriptome characterization from Next Generation Sequencing data has become an important approach in the study of non-model plants. Despite notable advances in the assembly of short reads, the clustering of transcripts into unigene-like (locus-specific) clusters remains a somewhat neglected subject. Indeed, closely related paralogous transcripts are often merged into single clusters by current approaches. Here, a novel heuristic method for locus-specific clustering is compared to that implemented in the de novo assembler Oases, using the same initial transcript collections, derived from Arabidopsis thaliana and the developmental model Streptocarpus rexii. We show that the proposed approach improves cluster specificity in the A. thaliana dataset for which the reference genome is available. Furthermore, for the S. rexii data our filtered transcript collection matches a larger number of distinct annotated loci in reference genomes than the Oases set, while containing a reduced overall number of loci. A detailed discussion of advantages and limitations of our approach in processing de novo transcriptome reconstructions is presented. The proposed method should be widely applicable to other organisms, irrespective of the transcript assembly method employed. The S. rexii transcriptome is available as a sophisticated and augmented publicly available online database.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24324652 PMCID: PMC3855653 DOI: 10.1371/journal.pone.0080961
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Schematic representation of strategy for re-clustering of transcript loci generated by Oases.
The figure presents a simple flow diagram illustrating the strategy used to re-assemble clusters of transcripts into LSTCs. As an example, we consider a single hypothetical Oases locus. A) An original Oases locus. B) Secondary loci are constructed by Cd-Hit-EST and defined such that within each secondary cluster, all transcripts must show 85% identity to the longest transcript. Thus the original Oases locus is split into 4 secondary loci. C) Representative sequences from each cluster are assigned “Blast annotations” and clusters with identical blast annotation profiles across proteomes used in annotation are merged into tertiary loci. D) All against all Blastn searches within clusters are used to estimate distributions of overlap and identity of transcripts within tertiary loci. E) All against all blast searches of transcripts in tertiary loci are used to merge clusters into LSTCs where overlaps and identities exceed cutoffs obtained in the previous step.
Summary of assembly statistics.
|
|
| |||
| Raw data | ||||
| Read type | 2*90 nt paired end | 2*100 nt paired end | ||
| Total pairs | 25,748,028 | 92,353,379 | ||
| Pairs after trimming | 20,015,515 | 64,243,530 | ||
| Unpaired reads | 1,503,212 | 2,503,141 | ||
| Total nt | 4,171,180,536 | 11,134,167,085 | ||
|
| ||||
| Original oases loci | 33,272 | 44,303 | ||
| Original oases transcripts | 77,980 | 843,303 | ||
| Secondary loci(CD-HIT-EST) | 39,816 | 239,406 | ||
|
| ||||
| Annotated | 26,060 | 35,704 | ||
| Unannotated | 13,810 | 38,588 | ||
| Contaminants | 280 | 9,503 | ||
| Shorter than 350 annotated | 208 | 3,504 | ||
| Shorter than 350 unannotated | 13,410 | 15,512 | ||
|
|
|
|
|
|
| Annotated | 25,807 | 23,203 | 24,832 | 20,252 |
| Unannotated | 683 | 400 | 8,281 | 6,895 |
| N50 representative | 1,542 | 1,556 | 2,201 | 2,252 |
| N50 confidentlyannotated | 1,683 | 1,694 | 2,803 | 2,728 |
|
| ||||
| To LSTC representative | 17,176,209 | 29,401,372 | ||
| To LSTC complete | 18,576,209 | 60,334,068 | ||
| To contaminants | 950,203 | 11,303,201 | ||
Characteristics of sequence data used in the assembly.
Results of the original assembly by Oases and initial clustering by CD-HIT-EST.
Statistics regarding representative sequences of tertiary loci.
Statisitics regarding LSTCs and Filtered Oases clusters.
Number of reads mapping to the reference transcript collections.
Numbers of distinct matched genes from reference genomes and mean numbers of best matching representative transcripts for Oases loci and LSTCs.
|
|
| |||||||
| Oases | LSTCs | Oases | LSTCs | |||||
| Species (total loci) | match | mean | match | mean | match | mean | match | mean |
|
| 10,394 | 1.95 | 10,961 | 1.67 | 11,253 | 1.94 | 11,554 | 1.5 |
|
| 16,128 | 1.58 | 16,900 | 1.37 | 12,109 | 1.97 | 12,427 | 1.45 |
|
| 13,392 | 1.65 | 14,000 | 1.43 | 14,058 | 1.73 | 14,428 | 1.29 |
|
| 11,348 | 1.92 | 11,960 | 1.66 | 12,798 | 2.02 | 13,106 | 1.44 |
|
| 11,621 | 1.89 | 12,101 | 1.64 | 12,936 | 2.03 | 13,317 | 1.44 |
loci in the reference proteome.
distinct genomic loci matched by the transcript collections.
average number of loci in the collection per matched genomic locus.
Comparison between gene family size correlation coefficients between LSTC collection and the filtered Oases collection.
|
|
| |||
| Oases | LSTCs | Oases | LSTCs | |
|
| 0.87 | 0.91 | 0.75 | 0.84 |
|
| 0.88 | 0.91 | 0.76 | 0.85 |
|
| 0.8 | 0.83 | 0.75 | 0.85 |
|
| 0.86 | 0.85 | 0.79 | 0.9 |
|
| 0.85 | 0.87 | 0.69 | 0.78 |
|
| 0.86 | 0.9 | 0.78 | 0.88 |
|
| 0.87 | 0.9 | 0.76 | 0.85 |
|
| 0.84 | 0.87 | 0.67 | 0.76 |
|
| 0.84 | 0.88 | 0.79 | 0.88 |
|
| 0.67 | 0.7 | 0.54 | 0.61 |
|
| 0.75 | 0.78 | 0.68 | 0.77 |
|
| 0.83 | 0.88 | 0.69 | 0.89 |
|
| 0.86 | 0.88 | 0.78 | 0.85 |
|
| 0.75 | 0.78 | 0.63 | 0.72 |
|
| 0.82 | 0.83 | 0.71 | 0.8 |
|
| 0.82 | 0.84 | 0.73 | 0.82 |
|
| 0.79 | 0.83 | 0.78 | 0.88 |
correlation coefficients between the size of gene families in the Plaza hom fam database for different reference species and that inferred from the original Oases assembly+filtering.
correlation coefficients between the size of gene families in the Plaza hom fam database for different reference species and that inferred from our LSTC collection.
Impact of inferred UTRs and putative retained introns on N50 of LSTC representative transcripts.
| InferredN50 | N50annotation | N50transcripts | ||
|
|
| 1,481 | 1,536 | 1,650 |
|
| 1,624 | 1,815 | ||
|
|
| 1,464 | 1,636 | 2,300 |
|
| 1,741 | 1,841 |
N50 of A. thaliana orthologs.
N50 of CDS, excluding inferred UTR and putative retained introns.
N50 of CDS and UTR, excluding putative retained introns.
Mean GC content and read coverage of inferred CDS and putative retained introns.
| CDS | Intronic | Pvalue | Exon/Intron fold | |
|
| 43.9 (3.44) | 37.1 (2.88) | 2.90E-61 | 8.72 |
|
| 46.18 (4.81) | 39.67 (3.04) | 3.30E-64 | 4.42 |
mean G+C% (and standard deviation) in inferred CDS.
mean G+C% (and standard deviation) in inferred intronic regions.
Pvalue for difference of GC content between CDS and putative retained introns (t-test).
ratio between the RPKM calculated on the inferred exonic and intronic regions.
COSII gene discovery and cluster fragmentation.
| Clusters per COSII |
|
| ||
| Oases | LSTC | Oases | LSTC | |
| > = 4 | 98 | 65 | 141 | 75 |
| 3 | 115 | 83 | 184 | 107 |
| 2 | 423 | 494 | 298 | 344 |
| 1 | 1,789 | 1,990 | 1,699 | 1,992 |
| absent | 444 | 257 | 567 | 351 |
number of COSII genes matching the specified number of reference transcripts.