| Literature DB >> 32046653 |
Fernando G Razo-Mendivil1, Octavio Martínez2, Corina Hayano-Kanashiro3.
Abstract
BACKGROUND: RNA-Seq is the preferred method to explore transcriptomes and to estimate differential gene expression. When an organism has a well-characterized and annotated genome, reads obtained from RNA-Seq experiments can be directly mapped to that genome to estimate the number of transcripts present and relative expression levels of these transcripts. However, for unknown genomes, de novo assembly of RNA-Seq reads must be performed to generate a set of contigs that represents the transcriptome. These contig sets contain multiple transcripts, including immature mRNAs, spliced transcripts and allele variants, as well as products of close paralogs or gene families that can be difficult to distinguish. Thus, tools are needed to select a set of less redundant contigs to represent the transcriptome for downstream analyses. Here we describe the development of Compacta to produce contig sets from de novo assemblies.Entities:
Keywords: Corset; Grouper; RNA-Seq; Transcriptomics; de novo assembly
Year: 2020 PMID: 32046653 PMCID: PMC7014741 DOI: 10.1186/s12864-020-6528-x
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Data sources. Sources and characteristics of the RNA-Seq data used in this study
| Organism | Source | Accession | Reads (Gb) | Contigs |
|---|---|---|---|---|
| Arabidopsis | [ | ERP016911 | 36.0 | 106,895 |
| Mango | [ | SRP043494 | 62.5 | 107,744 |
| Mouse | [ | PRJNA474181 | 41.0 | 327,616 |
Fig. 1Execution time for Compacta, Corset and Grouper in three assemblies. Bar diagram of running time in hours for Compacta, Corset and Grouper algorithms to analyze assemblies from Arabidopsis, mango and mouse. Numbers in the upper bars for Corset and Grouper are the number of rounds that the execution took for the corresponding program compared with the Compacta execution time
Fig. 2Compacta results for the Arabidopsis assembly. Values for d are displayed on the X-axis and the Y-axis shows the percentage of clusters (z; red line), number of Arabidopsis sequences identified (n; blue dotted line) and efficiency (Ef = n/z; green dashed line) as a function of d
Fig. 3Estimated Recall and Precision of 5 programs in two assemblies. Bar plots for Recall (upper) and Precision (lower) of 5 de novo assembly clustering algorithms applied to two assemblies, Arabidopsis (left) and mouse (right)
Number of representative contigs selected by each algorithm from each transcriptome when run with default parameters
| Arabidopsis | Mouse | Mango | |||
|---|---|---|---|---|---|
| Real | Simulated | Real | Simulated | Real | |
| Compacta | 33,542 | 21,518 | 223,169 | 19,844 | 28,356 |
| Corset | 27,080 | 26,414 | 95,079 | 23,716 | 38,448 |
| Grouper | 27,949 | 23,026 | 57,501 | 18,652 | 38,063 |
Compacta results for the Arabidopsis assembly. d - Parameter value, z - Number of clusters (representative contigs), n - Number of Arabidopsis sequences identified
| 0.000 | 13,770 | 3344 | |
| 0.035 | 28,704 | 18,381 | |
| 0.500 | 34,860 | 20,455 | |
| 0.955 | 40,656 | 21,254 | |
| 1.000 | 103,262 | 23,607 |
Case classification for contig pairs after clustering. a, b, c and d are frequencies resulting from a clustering experiment
| Same locus? | ||||
|---|---|---|---|---|
| Yes | No | Total | ||
| Clustered? | Yes | |||
| No | ||||
| Total | ||||