Literature DB >> 28224052

Compacting and correcting Trinity and Oases RNA-Seq de novo assemblies.

Cédric Cabau1, Frédéric Escudié2, Anis Djari3, Yann Guiguen4, Julien Bobe4, Christophe Klopp1,2.   

Abstract

BACKGROUND: De novo transcriptome assembly of short reads is now a common step in expression analysis of organisms lacking a reference genome sequence. Several software packages are available to perform this task. Even if their results are of good quality it is still possible to improve them in several ways including redundancy reduction or error correction. Trinity and Oases are two commonly used de novo transcriptome assemblers. The contig sets they produce are of good quality. Still, their compaction (number of contigs needed to represent the transcriptome) and their quality (chimera and nucleotide error rates) can be improved.
RESULTS: We built a de novo RNA-Seq Assembly Pipeline (DRAP) which wraps these two assemblers (Trinity and Oases) in order to improve their results regarding the above-mentioned criteria. DRAP reduces from 1.3 to 15 fold the number of resulting contigs of the assemblies depending on the read set and the assembler used. This article presents seven assembly comparisons showing in some cases drastic improvements when using DRAP. DRAP does not significantly impair assembly quality metrics such are read realignment rate or protein reconstruction counts.
CONCLUSION: Transcriptome assembly is a challenging computational task even if good solutions are already available to end-users, these solutions can still be improved while conserving the overall representation and quality of the assembly. The de novo RNA-Seq Assembly Pipeline (DRAP) is an easy to use software package to produce compact and corrected transcript set. DRAP is free, open-source and available under GPL V3 license at http://www.sigenae.org/drap.

Entities:  

Keywords:  Compaction; Correction; De novo assembly; Quality assessment; RNA-Seq

Year:  2017        PMID: 28224052      PMCID: PMC5316280          DOI: 10.7717/peerj.2988

Source DB:  PubMed          Journal:  PeerJ        ISSN: 2167-8359            Impact factor:   2.984


Background

Second-generation sequencing platforms have enabled the production of large amounts of transcriptomic data permitting to analyze gene expression for a large variety of species and conditions. For species lacking a reference genome sequence, the now-classical processing pipeline includes a de novo transcriptome assembly step. Assembling an accurate transcriptome reference is difficult because of the raw data variability. This variability comes from different factors: including: 1. The variability of gene expression levels ranging usually between one and millions of copies; 2. The biology of mRNA synthesis which goes through an early stage of pre-mRNA still containing introns and a late state in which mRNA can be decayed; 3. The synthesis from pre-mRNA of numerous alternative transcripts; 4. Potential sample contaminations; 5. Sequencing quality biases; 6. Most of the genome can be expressed in low abundance depending on the biological condition as presented by Djebali et al. (2012) in the results of the ENCODE project. Today there is no unique best solution to these RNA-Seq assembly problems but several software packages have been proven to generate contig sets comprising most of the expressed transcripts correctly reconstructed. Trinity (Grabherr et al., 2011) and Oases (Schulz et al., 2012) are good examples. The assembled contig sets produced by these packages often contain multiple copies of complete or partial transcripts and also chimeras. Chimeras are structural anomalies of a unique transcript (self-chimeras) or multiple transcripts (multi-transcripts chimeras). They are called “cis” if the transcripts are in the same direction and “trans” if they are in opposite directions. Natural chimeric transcripts exist in some cancer tissues but are rare (Frenkel-Morgenstern et al., 2013). Yang & Smith (2013) have shown the tendency of de novo transcriptome assemblers to produce self-chimeric contigs. The prevalence of the phenomenon depends on the assembly parameters. Multi-transcript chimeras distort contig annotation. The functions of the transcripts merged in the same contig can be very different and therefore the often-unique annotation given to such a chimeric contig does not reflect its content. Assemblies include also contigs corresponding to transcription or sequencing noise a phenomenon often referred as illegitimate transcription (Chelly et al., 1989). These contigs have often low coverage and are not found in the different replicates of the same condition. Some contigs contain local biological variations or sequencing errors such as substitutions, insertions or deletions. These variations and errors can deeply impact the read alignment rate, create frameshifts which hinder annotation, limit the efficacy of primer design and generate false variations. Assemblies contain also polyA/T tails, which are posttranscriptional marks. They are usually removed before publication. For all these reasons contig sets usually need error correction. Trinity and Oases have different algorithms, which give them advantages or disadvantages depending on gene expression levels. The main difference comes from their assembly strategy. Trinity chains a greedy algorithm with a de Bruijn graph one and Oases uses multiple de Bruijn graphs with different k-mers. The first step of Trinity is very effective in assembling parts of highly expressed transcripts which will be connected at the second step. As shown by Surget-Groba & Montoya-Burgos (2010), the Oases multi-k assembly approach is able to build contigs corresponding to transcripts with very low to very high expression levels. However, highly expressed genes with multiple transcripts will generate very complex graphs mainly because of the presence of variations or sequencing errors, which will form new paths possibly considered as valid by the assembler and produce numerous erroneous contigs. No assembler is producing the best contig set in all situations. Bio-informaticians and biologists therefore use different strategies to maximize the reference contig set quality (Mbandi et al., 2015; Bens et al., 2016; He et al., 2015; Nakasugi et al., 2014). The simplest approach is to produce a reference set per software package or parameter set, to compare their metrics and choose the best one. It is also possible to merge different results and filter them. Assemblies can be compared on different criteria. The usual ones are simple contig metrics such as total count, total length, N50, and average length. Assembling equals summarizing (compressing the expression dimension) and therefore a good metric to check the summary quality is the proportion of reads mapped back to the contigs. As a large part of the transcripts correspond to mRNA, it is also possible to use as quality metric the number of correctly reconstructed proteins using a global reference as it is done by CEGMA (Parra, Bradnam & Korf, 2007) or BUSCO (Simão et al., 2015) or using a protein reference set from a phylogenetically closely related organism. Last, some software packages are also rating the contig set or the individual contigs using the above-mentioned criteria (Honaas et al., 2016) or some other for example only related to the way reads map back to the contigs (Smith-Unna et al., 2016; Li et al., 2014; Davidson & Oshlack, 2014). We have built a de novo RNA-Seq Assembly Pipeline (DRAP) in order to correct the following assembly problems: multiple copies of complete or partial transcripts, chimeras, lowly expressed intergenic transcription, insertion and deletion generated by the assemblers and polyA tails. The pipeline implementation is presented in the next section. The “results and discussion” section compares raw and DRAP assembly metrics for seven different datasets.

Implementation

DRAP is written in Perl, Python, and shell. The software is a set of three command-line tools respectively called runDrap, runMeta and runAssessment. runDrap performs the assembly including compaction and correction. It produces a contig set but also a HTML log report presenting different assembly metrics. runAssessment compares different contig sets and gathers the results in a global report. runMeta merges and compacts different contigs sets and should be used for very large datasets for which memory or CPU requirements do not enable a unique global assembly or for highly complex datasets. The modules chained by each tool are presented in a graphical manner in Figs. 1, 2 and 3. Details on the compaction, correction and quality assessment steps of the tools are described hereafter. All software versions, parameters and corresponding default values are presented in Table S1.
Figure 1

Steps in runDRAP workflow.

This workflow is used to produce an assembly from one sample/tissue/development stage. It take as input R1 from single-end sequencing or R1 and R2 from paired-end sequencing and eventually a reference proteins set from closest species with known proteins.

Figure 2

Steps in runMeta workflow.

This workflow is used to produce a merged assembly from several samples/tissues/development stage outputted by runDRAP. Inputs are runDRAP output folders and eventually a reference protein set.

Figure 3

Steps in runAssessment workflow.

This workflow is used to evaluate quality for one assembly or for compare several assemblies produced from the same dataset. Inputs are the assembly/ies, R1 and eventually R2, and a reference protein set.

Steps in runDRAP workflow.

This workflow is used to produce an assembly from one sample/tissue/development stage. It take as input R1 from single-end sequencing or R1 and R2 from paired-end sequencing and eventually a reference proteins set from closest species with known proteins.

Steps in runMeta workflow.

This workflow is used to produce a merged assembly from several samples/tissues/development stage outputted by runDRAP. Inputs are runDRAP output folders and eventually a reference protein set.

Steps in runAssessment workflow.

This workflow is used to evaluate quality for one assembly or for compare several assemblies produced from the same dataset. Inputs are the assembly/ies, R1 and eventually R2, and a reference protein set.

Contig set compaction

Contig compaction removes redundant and lowly expressed contigs. Four different approaches are used to compact contig sets. The first is only implemented for Oases assemblies and corresponds to the sub-selection of only one contig per locus (NODE) produced by the assembler. Oases resolves the connected component of the de Bruijn graph and for complex sub-graphs generates several longest paths corresponding to different possible forms. These forms have shown (https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/velvet-and-oases-transcriptome) to correspond to subpart of the same transcript, which are usually included one in another. Oases provides the locus (connected component of the assembly graph) of origin of each contig as well as its length and depth. The Oasesv2.0.4BestTransChooser.py script sub-selects the longest and most covered contig of a locus. The second compaction method removes contigs included in longer ones. CD-HIT-EST (Fu et al., 2012) orders the contigs by length and removes all the included ones given identity and coverage thresholds. The third method elongates the contigs through a new assembly step. TGICL (Pertea et al., 2003) performs this assembly in DRAP. The last approach filters contigs using their length or the length of their longest ORF if users are only interested in coding transcripts, and using read coverage according to the idea that lowly covered contigs often correspond to noise. A last optional filter selects contigs using their TransRate quality score when above the calculated threshold (–optimize parameter). By default, runDrap produces eight contigs sets, four include only protein coding transcripts and four others contain all transcripts. Each group comprises a contig set filtered for low coverage with respectively 1, 3, 5 and 10 fragments per kilobase per million (FPKM) thresholds. Compaction favors assemblies having contigs with multiple ORFs. Because a unique ORF is expected for contig annotation, DRAP splits multi-transcript chimera in mono-ORF contigs. runMeta also performs a three step compaction of the contigs. The first is based on the contig nucleotide content and uses CD-HIT-EST. The second run CD-HIT on the protein translation of the longest ORF found by EMBOSS getorf. The third, in the same way as runDrap, filters contigs using their length (global or longest ORF), their expression level and optionally their TransRate score producing the eight result files described in the previous paragraph.

Contig set corrections

Contig correction splits chimeras, removes duplicated parts, removes insertions, deletion and polyA/T tails. DRAP corrects contigs in three ways. It first searches self-chimera and removes them by splitting contigs in parts or removing duplicated chimeric elements. An in house script aligns contigs on themselves using bl2seq and keeps only matches having an identity greater or equal to 96%. A contig is defined as a putative chimera if (i) the longest self-match covers at least 60% of the contig length or (ii) the sum of partial non-overlapping self-matches covers at least 80% of its length. In the first case, the putative chimera is split at the start position of the repeated block. In the second case, the contig is only a repetition of a short single block and is therefore discarded. For the second correction step, DRAP searches substitutions, insertions and deletions in the read realignment file. When found it corrects the consensus according to the most represented allele at a given position. Low read coverage alignment areas are usually not very informative therefore only positions having a minimum depth of 10 reads are corrected. The manual assessment made on DRAP assemblies has shown that a second path of this algorithm improves consensus correction. Part of the reads change alignment location after the first correction. runDrap, consequently, runs this step twice. The last correction script eases the publication of the contig set in TSA (https://www.ncbi.nlm.nih.gov/genbank/tsa): NCBI transcript sequence assembly archive. TSA stores the de novo assembled contig sets of over 1300 projects. In order to improve the data quality, it performs several tests before accepting a new submission. These tests search for different elements such as sequencing adapters or vectors, polyA or polyT and stretches of unknown nucleotides (N). The thresholds used by TSA are presented at https://www.ncbi.nlm.nih.gov/genbank/tsaguide. DRAP performs the same searches on the contig set and corrects the contigs when needed.

Quality assessment

All three workflows create an HTML report. The report is a template including HighCharts (http://www.highcharts.com) graphics and tables using JSON files as database. These files are generated by the different processing steps. The report can therefore also be used to monitor processing progression. Each graphic included in the report can be downloaded in PNG, GIF, PDF or SVG. Some of the graphics can be zoomed in by mouse selecting the area to be enlarged. The report tables can be sorted by clicking on the column headers and exported in CSV format. For runDrap and runMeta, the reports present results of a single contig file. runAssessment processes one or several contig files and one or several read files. It calculates classical contig metrics, checks for chimeras, searches alignment discrepancies, produces read and fragment alignment rates and assess completeness using an external global reference running BUSCO. If provided, it aligns a set of proteins on the contigs to measure their overlap. Last, it runs TransRate, a contig validation software using four alignment linked quality measures to generate a global quality criterion for each contig and for the complete set. runAssessment does not modify the contig set content but enables users to check and select the best candidate between different assemblies.

Parallel processing and flow control

DRAP runs on Unix machines or clusters. Different steps of the assembly or assessment process are run in parallel mode, if the needed computer infrastructure is available. All modules have been implemented to take advantage of an SGE compliant HPC environment. They can be adapted to other schedulers through configuration file modification. DRAP first creates a set of directories and shell command files and then launches these files in the predefined order. The ‘–write’ command line parameter forces DRAP to stop after the first step. At this stage, the user can modify the command files for example to set parameters which are not directly accessible from runDRAP, runMeta or runAssessment and then launch the process with the ‘–run’ command line option. DRAP checks execution outputs at each processing step. If an error has occurred, it adds an error file to the output directory indicating at which step of the processing it happened. After correction, DRAP can be launched again and it will scan the result directory and restart after the last error free step. The pipeline can easily be modified to accept other assemblers by rewriting the corresponding wrapper using the input files and producing correctly named output files.

Results and Discussion

DRAP has been tested on seven different datasets corresponding to five species. These datasets are presented in Table 1 and include five real datasets (Arabidopsis thaliana: At, Bos taurus: Bt, Drosophila melanogaster: Dm, Danio rerio: Dr and Homo sapiens: Hs), one set comprising a large number of diverse samples (Danio rerio multi samples: Dd) and one simulated dataset (Danio rerio simulated: Ds). The simulated reads have been produced using rsem-simulate-reads (version rsem-1.2.18) (Li & Dewey, 2011). The theta0 value was calculated with the rsem-calculate-expression program on read files from the Danio rerio pineal gland sample (SRR1048059). Table 1 also presents for each dataset: the number, length, type (paired or not) and strandedness of the reads, the public accession number, the tissue and experimental condition of origin. The results presented hereafter compare the metrics collected from Trinity, Oases, DRAP Trinity and DRAP Oases assemblies of the six first datasets. The multi sample dataset has been used to compare a strategy in which all reads of the different samples are gathered and processed as one dataset (pooled) to a strategy in which the assemblies are performed by sample and the resulting contigs joined afterwards (meta-assembly). The same assembly pipeline has been used in both strategies, except the contig set merging step, which is specific to the meta-assembly strategy.
Table 1

Datasets.

NameSpeciesLayoutLibraryProtocol
PairedStrandedLength (nt)Nb R1SRA IDTissueCondition
AtArabidopsis thalianaYes10032,041,730 SRR1773557 RootFull nutrition
Yes10030,990,531 SRR1773560 ShootFull nutrition
Yes10024,898,527 SRR1773563 RootN starvation
Yes10054,344,171 SRR1773569 FlowerFull nutrition
Yes15031,467,967 SRR1773580 ShootN starvation
BtBos taurusYesNo10030,140,101 SRR2635009 MilkDay 70 with low milk production
YesNo7515,339,206 SRR2659964 Endometrium
YesYes5013,542,516 SRR2891058 Oviduct
DdDanio rerioYesNo10035,368,936 SRR1524238 Brain5 months female
54,472,116 SRR1524239 Gills5 months female
85,672,616 SRR1524240 Heart5 months male and female
34,032,976 SRR1524241 Muscle5 months female
59,248,034 SRR1524242 Liver5 months female
46,371,614 SRR1524243 Kidney5 months male and female
96,715,965 SRR1524244 Bones5 months female
43,187,341 SRR1524245 Intestine5 months female
55,185,501 SRR1524246 Embryo2 days embryo
24,878,233 SRR1524247 Unfertilized eggs5 months female
22,026,486 SRR1524248 Ovary5 months female
59,897,686 SRR1524249 Testis5 months male
DmDrosophila melanogasterYesYes7521,849,652 SRR2496909 Cell line R4Time P17
21,864,887 SRR2496910 Cell line R4Time P19
20,194,362 SRR2496918 Cell line R5Time P17
22,596,303 SRR2496919 Cell line R5Time P19
DrDanio rerioYesNo1005,072,822 SRR1048059 Pineal glandLight
8,451,113 SRR1048060 Pineal glandLight
8,753,789 SRR1048061 Pineal glandDark
7,420,748 SRR1048062 Pineal glandDark
9,737,614 SRR1048063 Pineal glandDark
DsDanio rerioYesNo10030,000,000Simulated
HsHomo sapiensNoNo25–5015,885,224 SRR2569874 TK6 cellspretreated with the protein kinase C activating tumor
15,133,619 SRR2569875 TK6 cellspretreated with the protein kinase C activating tumor
19,312,543 SRR2569877 TK6 cellspretreated with the protein kinase C activating tumor
21,956,840 SRR2569878 TK6 cellspretreated with the protein kinase C activating tumor
Summary Table 2 and Table 3 present the metrics collected for the six first datasets. Table 2 provides metrics related to compaction and correction as Table 3 includes validation metrics and Table 4 collects all three metric types for pooled versus meta-assembly strategies.
Table 2

Compaction and correction in DRAP and standard assembler.

DatasetAssemblerNb contigsN50 (nt)L50 (nt)Sum(nt)Median length (nt)Included contigs (%)Contigs with multi-ORF (%)Contigs with Multi-prot (%)Chimeric Contigs (%)Contigs with Bias* (%)
AtOases381,4402,97192,020843,329,2641,81672.7527.890.260.8013.88
DRAP Oases32,2692,0149,56356,122,0471,5470.000.241.400.042.78
Trinity95,0082,19819,140130,969,7379914.0515.631.220.2011.29
DRAP Trinity54,9231,76115,85780,258,6591,2870.000.200.520.002.68
BtOases147,1632,73931,441269,085,1411,35971.197.450.060.666.29
DRAP Oases29,6852 4416 02947,727,7301 1110.000.280.320.031.23
Trinity89,5202,18412,08090,989,6114314.123.690.170.125.98
DRAP Trinity46,5612,1299,18364,809,4489270.000.230.140.001.50
DmOases178,6962,22029,086232 776 71775675.485.140.180.3513.11
DRAP Oases21,5502,3093,67429,372,2618040.000.090.450.062.27
Trinity55,2142,2667,12657,209,8904385.194.580.950.2213.33
DRAP Trinity27,2362 1465 24037,249,6129140.000.070.310.003.59
DrOases702,6402,715114,0421,059,904,84485770.992.800.011.3911.52
DRAP Oases46,8312,7579,04682,268,8721,1730.000.150.270.1613.05
Trinity126,2101 27921,00396,279,0464185.560.810.080.5623.63
DRAP Trinity58,1141,64413 02268,900,3968660.000.070.120.007.41
DsOases131,9822,97528,618280,469,6941 61975.053.050.060.144.07
DRAP Oases21 1913 0004,87246,994,9281,7440.000.080.250.021.10
Trinity40,3352,3987,15958,571,8599103.121.820.370.096.47
DRAP Trinity31,1132,3816,49251,580,4071,2050.000.040.140.001.15
HsOases101,2712,04820,131132,681,06589555.735.550.030.117.51
DRAP Oases30,2011,8805,54234,670,8625400.000.150.080.000.68
Trinity57,1951,6877,84347,639,1903842.632.850.120.095.79
DRAP Trinity39,4891,7056,62138,557,7585400.000.110.060.000.59

Notes.

Contigs with consensus variations corrected by DRAP.

Bold values are “best in class” values between raw and DRAP assemblies.

Table 3

Validation DRAP against standard assembler.

DatasetAssembler% contigs by ORF countContigs with Complete ORF (%)% contigs by Proteins countNb reference Proteins alignedReads mapping (%)TransRate score * 100
0101MappedProperly paired
AtOases18.9653.1565.7294.275.5723 45797.1890.332.39
DRAP Oases9.9089.8672.3839.3859.2220 89596.5390.2133.16
Trinity38.9745.4040.3281.0917.6920 29093.8185.7810.04
DRAP Trinity13.8985.9155.5169.8529.6417 91692.9985.4424.77
BtOases36.0756.4828.2993.336.6110 56090.5387.202.71
DRAP Oases32.5967.1325.7067.6332.0510 45691.0388.5923.30
Trinity64.1332.1815.3389.4810.3510 31392.1886.664.99
DRAP Trinity38.5561.2324.8679.9519.9110 14491.0385.9713.51
DmOases46.1948.6720.2796.433.396 87392.8683.242.21
DRAP Oases48.8051.1131.4570.3029.256 73192.0282.2141.17
Trinity67.5327.8918.4989.639.426 49493.2485.0717.56
DRAP Trinity45.9453.9932.2377.7621.936 35885.7778.0934.23
DrOases56.8140.3923.3797.982.0115 18685.7375.160.67
DRAP Oases40.2059.6533.4370.8928.8414 90188.2682.8425.19
Trinity66.7632.439.7992.347.5810 73484.1175.705.81
DRAP Trinity39.7460.1920.1682.4417.4411 27281.3375.4318.25
DsOases24.5272.4341.6089.4710.4714 92983.6274.348.56
DRAP Oases12.8087.1153.7335.5664.1914 91390.3288.2259.08
Trinity37.7260.4630.2967.3732.2614 39488.7985.3738.77
DRAP Trinity22.8577.1137.6557.5342.3314 36488.2885.5950.51
HsOases44.5149.9421.1893.046.937 55488.30NANA
DRAP Oases46.9552.9120.0677.2822.647 46386.90NANA
Trinity69.0228.1311.7088.5311.357 19986.76NANA
DRAP Trinity55.4844.4116.0783.4616.487 12484.08NANA

Notes.

Bold values are “best in class” values between raw and DRAP assemblies.

Table 4

Pooled samples vs meta-assembly strategies on the Danio rerio multi samples dataset (Dd)).

Assembly strategyPooled OasesMeta OasesPooled TrinityMeta Trinity
Compaction
Nb seq42,72643,04962,32765 271
N50 (nt)3,5653,3792,0272 237
L50 (nt)10,4099,25914,95613,106
Sum (nt)114,371,59899,928,20694,993,91098,421,439
Median length (nt)2,1821,7661,2171,052
Contigs with multi-ORF (%)0.330.500.130.17
Contigs with multi-prot (%)1.391.730.640.95
Correction
Chimeric contigs (%)0.110.210.000.00
Contigs with bias* (%)75.1968.0058.7961.88
Validation
% contigs by ORF count024.7938.7737.2450.63
174.8860.7262.6349.20
Contigs with complete ORF (%)61.8446.3638.8031.55
% contigs by proteins count058.5257.1575.2372.02
140.0941.1324.1327.03
Nb reference proteins aligned32,36735,43226,04133,385
Reads mapping (%)Mapped87.3887.5777.8285.19
Properly paired78.8880.1370.1377.30
TransRate score * 10028.6629.4917.9723.36

Notes.

contigs with consensus variations corrected by DRAP.

Bold values are “best in class” values between raw and DRAP assemblies.

Notes. Contigs with consensus variations corrected by DRAP. Bold values are “best in class” values between raw and DRAP assemblies. Notes. Bold values are “best in class” values between raw and DRAP assemblies. Notes. contigs with consensus variations corrected by DRAP. Bold values are “best in class” values between raw and DRAP assemblies. The improvement in compactness is measured by three criteria. The first is the number of assembled contigs presented in Fig. 4. The differences between raw Oases and Trinity assemblies and DRAP assemblies are very significant ranging from 1.3 fold to 15 fold. The impact of DRAP on Oases assemblies (from 3.4 to 15 fold) is much more significant than on Trinity assemblies (from 1.3 to 2,2 fold). Oases multi-k assembly strategy generates a lot of redundant contigs which are not removed at the internal Oases merge step. The second criterion is the percentage of inclusions, i.e., contigs which are part of longer ones. Oases and Trinity inclusion rate range respectively from 55 to 75% and from 2.3 to 5.5% (Table 2). Because of its inclusion removal step this rate is null for DRAP assemblies. The last compaction criteria presented here is the total number of nucleotides in the contigs. The ratios between raw and DRAP assembly sizes for Oases and Trinity range respectively from 3.4 to 14.8 fold and from 1.1 and 2.6 fold (Table 2). All these metrics show that DRAP produces less contigs with less redundancy resulting in an assembly with a much smaller total size.
Figure 4

Number of contigs.

The figure shows for the different assemblers (Oases, DRAP Oases, Trinity, DRAP Trinity) the number of contigs produced for each dataset.

Number of contigs.

The figure shows for the different assemblers (Oases, DRAP Oases, Trinity, DRAP Trinity) the number of contigs produced for each dataset. Another metric that can be negatively correlated to compactness, but has to be taken into account, is the number of multi-ORF contigs found in the assemblies. The ratios of multi-ORF contigs found between raw and DRAP assemblies range from 11 and 116 folds (Table 2). DRAP multi-transcript chimera splitting procedure improves significantly this criterion. In order to check if the compaction step only selects one isoform per gene, we compared the number of genes with several transcripts aligning on different contigs before and after DRAP. A transcript is linked to a contig if its best blat hit has over 90% query identity and 90% query coverage. The test has been performed on the Dr and the Ds datasets assembled with Oases and Trinity. The number of alternative spliced isoforms decreases more, with or without DRAP, in the Oases than in the Trinity assemblies (Table 5). This reduction is of 69% and 23% in the real dataset (Dr) and 83% and 18% in the simulated dataset for Oases and Trinity respectively. However, the spliced forms reduction does not impact the gene representation in the compacted sets (Table 5). Remarkably, the gene representation is increased for the real dataset when processed with DRAP Oases. This results from the different merging strategies used by Oases and DRAP Oases. Using TGICL, DRAP is able, in some cases, to correctly merge gene parts which have been generated by the Oases multi-k assemblies and this more efficiently than the build-in Oases merge procedure.
Table 5

Compaction vs gene representation on Danio rerio simulated dataset (Ds) and Danio rerio dataset (Dr).

DatasetAssemblyNb seqAll genesMulti-isoform genesRaw/DRAP assemblies
All genes (%)Multi-isoform genes (%)
DsRaw Oases131,98214,3963,593−1.74−82.99
DRAP Oases21,19114,145611
Raw Trinity40,33512,4571,792−2.04−17.97
DRAP Trinity31,11312,2031,470
DrRaw Oases702,64011,6132,177+10.40−69.09
DRAP Oases46,83112,821673
Raw Trinity126,2108,310801−2.33−22.60
DRAP Trinity58,1148,116620

Notes.

Bold values are “best in class” values between raw and DRAP assemblies.

Notes. Bold values are “best in class” values between raw and DRAP assemblies. DRAP corrects contigs in two ways: removing self-chimera and rectifying consensus substitutions, insertions and deletions when the consensus does not represent the major allele at the position in the read re-alignment file. Self-chimeras appear in Oases and Trinity contig sets at rate ranging respectively from 0.11 to 1.39 and from 0.09 to 0.56%. In DRAP, the corresponding figures drop to 0.01 to 0.16 and 0.00 to 0.01%. Concerning consensus correction only five datasets can be taken into account i.e., At, Bt, Dm, Ds and Hs. Dr Oases assembly generates such a large number of contigs and total length that it decreases significantly the average coverage and therefore limits the number of positions for which the correction can be made. As shown in Fig. 5 and Table S2, the Dr dataset is an outlier concerning this criteria. Regarding the five other datasets raw versus DRAP correction rates range from 1.7 to 18.6 for insertions, 3.1 to 27.1 for deletions and 2.7 to 14.1 for substitutions. DRAP correction steps lowers significantly the number of positions for which the consensus does not correspond to the major allele found in the alignment. In order to check the positive impact of the correction step, the Danio rerio reference proteome has been aligned to the simulated dataset (Ds) contigs before and after correction. 94.5% of DRAP Oases contigs and 86.2% of DRAP Trinity contigs which have been corrected, have improved alignment scores (Data S1 section “Contig set correction step assessment”).
Figure 5

Consensus error rates.

(A) presents the ratio of the global error rates between raw and DRAP assemblies for each dataset (data from Table 2 colum 12). (B), (C) and (D) present the ratio of the error rates respectively for substitution, insertions and deletions between raw and DRAP assemblies for each dataset (data from Table S2).

Consensus error rates.

(A) presents the ratio of the global error rates between raw and DRAP assemblies for each dataset (data from Table 2 colum 12). (B), (C) and (D) present the ratio of the error rates respectively for substitution, insertions and deletions between raw and DRAP assemblies for each dataset (data from Table S2).

Reads re-alignment rates.

(A) and (B) show respectively the alignment rates for reads and read pairs for the four assemblies of each dataset.

Assembly quality assessment

The two previous parts have shown the beneficial impacts of DRAP on the assembly compactness and error rates but this should not impair quality metrics such as read and read pairs alignment rates, number of ORFs, complete ORFs found in the contigs, number of proteins of the known proteome mapped on the contigs or TransRate marks. Read and read pair alignment rates differences between raw and DRAP assemblies are usually very low, between 1 and 2% and can sometimes be in favor of DRAP (Fig. 6) . In our test sets, the difference is significant (7.5%) for Dm when comparing Trinity to DRAP Trinity. This comes from the removal by DRAP of a highly expressed transcript (Ensembl: FBtr0100888 mitochondrial large ribosomal RNA) because that does not fulfill the criteria of having at least one 200 base pairs long ORF despite having over 11M reads aligned on the corresponding contig in the Trinity assembly. DRAP Oases assembly was not impacted because it builds a longer contig for this transcript with a long enough ORF to be selected in the additional part.
Figure 6

Reads re-alignment rates.

(A) and (B) show respectively the alignment rates for reads and read pairs for the four assemblies of each dataset.

The reference proteome has been aligned on the contigs and matches with over 80% identity and 80% protein coverage have been counted (Fig. 7). These figures give a good overview of the amount of well-reconstructed proteins in the contig sets. For all datasets except one (At) the number of proteins are very close between raw and DRAP results. For this At dataset the difference is of 12.2% for Oases and 13.2% for Trinity. This is due to the FPKM filtering step performed by DRAP and the expression profile of this dataset that mixes different tissues (root, shoot and flower) and conditions (full nutrition and starvation). Contigs corresponding to low expression in one condition do not have sufficient overall expression to pass DRAP expression filter threshold and are therefore eliminated from the final set. Mixed libraries can benefit from the meta-assembly approach presented in the next section.
Figure 7

Proteins realignment rates.

The figure shows the number of proteins which have been aligned on the contig sets with more than 80% identity and 80% coverage for each assembler and dataset.

TransRate global scores (Fig. 8) are much higher for DRAP assemblies compared to raw ones. This comes from the compaction performed by DRAP and the limited impact it has on the read alignment rate.
Figure 8

TransRate scores.

The figure presents the TransRate scores of the four assemblers for each dataset.

DRAP has limited negative effect on the assembly quality metrics, and sometimes even improves some of them. Some cases in which multiple libraries are mixed with very distinct conditions can affect the results and it is good practice to systematically compare raw and DRAP assemblies. It is also to be noticed that Oases multi-k strategy outperforms Trinity for all datasets regarding the number of well-reconstructed proteins.

Pooled versus meta-assembly strategies

In the previous sections we compared results from raw and DRAP assemblies. This section compares results from pooled versus meta-assembly strategies both using the DRAP assembly pipeline (Table 4). Because of the read re-alignment filtering thresholds used in DRAP, we expect different metrics between a pooled assembly and merged per sample assembly (meta-assembly). DRAP includes the runMeta workflow, which performs this task.

Proteins realignment rates.

The figure shows the number of proteins which have been aligned on the contig sets with more than 80% identity and 80% coverage for each assembler and dataset.

TransRate scores.

The figure presents the TransRate scores of the four assemblers for each dataset. Differences in compaction and correction are more important between Trinity and Oases than between pooled versus meta-assembly. Pooled assemblies collect significantly worse results for the number of reference proteins and number of read pairs aligned on the contigs. This comes from the filtering strategy which eliminates low-expressed contigs of a given condition when merging all the samples but will keep these contigs in a per sample assembly and meta-assembly strategy. Therefore, we recommend using runMeta when the assembly input samples mix distinct conditions with specific and variable expression patterns.

Assemblies fidelity check using simulated reads

The simulation process links each read with its transcript of origin. With this information it is possible to link contigs and transcripts. Here, the transcript-contig link was calculated using exon content and order in both sets (method explained in Data S1). The results presented in Table 6 first shows that the assembly process loses between 15.76 and 19.97% of the exons compared to the initial transcript set. This loss is close to 22% for all assemblies when the exon order is taken into account. As shown in Fig. 9, this is mainly the case for transcripts with low read coverage. The figures show once more that DRAP has a very limited negative impact on number of retrieved exons in correct order.
Table 6

Structure validation on Danio rerio simulated dataset (Ds).

AssemblyRetrieved exonsExons in Right contigExons in Right orderContigs with More than 1 geneMax number Of genes by contig
Real assembly99.81%99.81%99.50%0.16% (46)5
Raw Oases80.03%77.83%77.61%2.77% (537)221
DRAP Oases80.21%77.54%77.29%4.13% (671)203
Raw Trinity84.24%77.30%77.10%3.65% (717)339
DRAP Trinity83.30%76.65%76.47%3.17% (602)327

Notes.

Bold values are “best in class” values between raw and DRAP assemblies.

Figure 9

Gene reconstruction versus expression depth using simulated reads.

The figure presents the proportion of correctly build transcripts (method presented in Data S1 section “Contig validation using exon re-alignment and order checking”) versus the read count per transcript.

Notes. Bold values are “best in class” values between raw and DRAP assemblies.

Gene reconstruction versus expression depth using simulated reads.

The figure presents the proportion of correctly build transcripts (method presented in Data S1 section “Contig validation using exon re-alignment and order checking”) versus the read count per transcript. Table 6 shows the number of contigs linked to more than one gene. DRAP compaction and ORF splitting feature could have an antagonist impact for this criteria. But depending on the assembler, the figures are in favor or not of DRAP. Table 6 also presents the maximum number of genes linked to a single contig. These clusters correspond to zink finger gene family members which have been assembled as single contig. Between 92.3 and 93.7% of the clustered transcripts belong to this family. De novo assembly tools are not able to distinguish transcript originating from different gene when the nucleotide content is highly similar.

Conclusion

Different software packages are available to assemble de novo transcriptomes from short reads. Trinity and Oases are commonly used packages which produce good quality references. DRAP assembly pipeline is able to compact and correct contig sets with usually very low quality loss. As no package out performs the others in all cases, producing different assemblies and comparing their metrics is a good general practice. Click here for additional data file. Click here for additional data file. Click here for additional data file.
  20 in total

1.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors:  Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal:  Bioinformatics       Date:  2015-06-09       Impact factor: 6.937

2.  Inferring bona fide transfrags in RNA-Seq derived-transcriptome assemblies of non-model organisms.

Authors:  Stanley Kimbung Mbandi; Uljana Hesse; Peter van Heusden; Alan Christoffels
Journal:  BMC Bioinformatics       Date:  2015-02-21       Impact factor: 3.169

3.  Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels.

Authors:  Marcel H Schulz; Daniel R Zerbino; Martin Vingron; Ewan Birney
Journal:  Bioinformatics       Date:  2012-02-24       Impact factor: 6.937

4.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.

Authors:  Bo Li; Colin N Dewey
Journal:  BMC Bioinformatics       Date:  2011-08-04       Impact factor: 3.307

5.  Combining transcriptome assemblies from multiple de novo assemblers in the allo-tetraploid plant Nicotiana benthamiana.

Authors:  Kenlee Nakasugi; Ross Crowhurst; Julia Bally; Peter Waterhouse
Journal:  PLoS One       Date:  2014-03-10       Impact factor: 3.240

6.  Evaluation of de novo transcriptome assemblies from RNA-Seq data.

Authors:  Bo Li; Nathanael Fillmore; Yongsheng Bai; Mike Collins; James A Thomson; Ron Stewart; Colin N Dewey
Journal:  Genome Biol       Date:  2014-12-21       Impact factor: 13.583

7.  TransRate: reference-free quality assessment of de novo transcriptome assemblies.

Authors:  Richard Smith-Unna; Chris Boursnell; Rob Patro; Julian M Hibberd; Steven Kelly
Journal:  Genome Res       Date:  2016-06-01       Impact factor: 9.043

8.  Full-length transcriptome assembly from RNA-Seq data without a reference genome.

Authors:  Manfred G Grabherr; Brian J Haas; Moran Yassour; Joshua Z Levin; Dawn A Thompson; Ido Amit; Xian Adiconis; Lin Fan; Raktima Raychowdhury; Qiandong Zeng; Zehua Chen; Evan Mauceli; Nir Hacohen; Andreas Gnirke; Nicholas Rhind; Federica di Palma; Bruce W Birren; Chad Nusbaum; Kerstin Lindblad-Toh; Nir Friedman; Aviv Regev
Journal:  Nat Biotechnol       Date:  2011-05-15       Impact factor: 54.908

9.  CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors:  Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal:  Bioinformatics       Date:  2012-10-11       Impact factor: 6.937

10.  Landscape of transcription in human cells.

Authors:  Sarah Djebali; Carrie A Davis; Angelika Merkel; Alex Dobin; Timo Lassmann; Ali Mortazavi; Andrea Tanzer; Julien Lagarde; Wei Lin; Felix Schlesinger; Chenghai Xue; Georgi K Marinov; Jainab Khatun; Brian A Williams; Chris Zaleski; Joel Rozowsky; Maik Röder; Felix Kokocinski; Rehab F Abdelhamid; Tyler Alioto; Igor Antoshechkin; Michael T Baer; Nadav S Bar; Philippe Batut; Kimberly Bell; Ian Bell; Sudipto Chakrabortty; Xian Chen; Jacqueline Chrast; Joao Curado; Thomas Derrien; Jorg Drenkow; Erica Dumais; Jacqueline Dumais; Radha Duttagupta; Emilie Falconnet; Meagan Fastuca; Kata Fejes-Toth; Pedro Ferreira; Sylvain Foissac; Melissa J Fullwood; Hui Gao; David Gonzalez; Assaf Gordon; Harsha Gunawardena; Cedric Howald; Sonali Jha; Rory Johnson; Philipp Kapranov; Brandon King; Colin Kingswood; Oscar J Luo; Eddie Park; Kimberly Persaud; Jonathan B Preall; Paolo Ribeca; Brian Risk; Daniel Robyr; Michael Sammeth; Lorian Schaffer; Lei-Hoon See; Atif Shahab; Jorgen Skancke; Ana Maria Suzuki; Hazuki Takahashi; Hagen Tilgner; Diane Trout; Nathalie Walters; Huaien Wang; John Wrobel; Yanbao Yu; Xiaoan Ruan; Yoshihide Hayashizaki; Jennifer Harrow; Mark Gerstein; Tim Hubbard; Alexandre Reymond; Stylianos E Antonarakis; Gregory Hannon; Morgan C Giddings; Yijun Ruan; Barbara Wold; Piero Carninci; Roderic Guigó; Thomas R Gingeras
Journal:  Nature       Date:  2012-09-06       Impact factor: 49.962

View more
  40 in total

1.  Transcriptomic responses of the endangered freshwater mussel Margaritifera margaritifera to trace metal contamination in the Dronne River, France.

Authors:  Anthony Bertucci; Fabien Pierron; Julien Thébault; Christophe Klopp; Julie Bellec; Patrice Gonzalez; Magalie Baudrimont
Journal:  Environ Sci Pollut Res Int       Date:  2017-09-30       Impact factor: 4.223

Review 2.  A simple guide to de novo transcriptome assembly and annotation.

Authors:  Venket Raghavan; Louis Kraft; Fantin Mesny; Linda Rigerte
Journal:  Brief Bioinform       Date:  2022-03-10       Impact factor: 11.622

Review 3.  Music of metagenomics-a review of its applications, analysis pipeline, and associated tools.

Authors:  Bilal Wajid; Faria Anwar; Imran Wajid; Haseeb Nisar; Sharoze Meraj; Ali Zafar; Mustafa Kamal Al-Shawaqfeh; Ali Riza Ekti; Asia Khatoon; Jan S Suchodolski
Journal:  Funct Integr Genomics       Date:  2021-10-18       Impact factor: 3.410

4.  Improving the Annotation of the Venom Gland Transcriptome of Pamphobeteus verdolaga, Prospecting Novel Bioactive Peptides.

Authors:  Cristian Salinas-Restrepo; Elizabeth Misas; Sebastian Estrada-Gómez; Juan Carlos Quintana-Castillo; Fanny Guzman; Juan C Calderón; Marco A Giraldo; Cesar Segura
Journal:  Toxins (Basel)       Date:  2022-06-15       Impact factor: 5.075

5.  Transcriptome of the synganglion in the tick Ixodes ricinus and evolution of the cys-loop ligand-gated ion channel family in ticks.

Authors:  Claude Rispe; Caroline Hervet; Nathalie de la Cotte; Romain Daveu; Karine Labadie; Benjamin Noel; Jean-Marc Aury; Steeve Thany; Emiliane Taillebois; Alison Cartereau; Anaïs Le Mauff; Claude L Charvet; Clément Auger; Elise Courtot; Cédric Neveu; Olivier Plantard
Journal:  BMC Genomics       Date:  2022-06-23       Impact factor: 4.547

6.  Chromosome-Level Genome Assembly Reveals Dynamic Sex Chromosomes in Neotropical Leaf-Litter Geckos (Sphaerodactylidae: Sphaerodactylus).

Authors:  Brendan J Pinto; Shannon E Keating; Stuart V Nielsen; Daniel P Scantlebury; Juan D Daza; Tony Gamble
Journal:  J Hered       Date:  2022-07-09       Impact factor: 2.679

7.  Detecting, Categorizing, and Correcting Coverage Anomalies of RNA-Seq Quantification.

Authors:  Cong Ma; Carl Kingsford
Journal:  Cell Syst       Date:  2019-11-27       Impact factor: 10.304

8.  Root transcriptomic responses of grafted grapevines to heterogeneous nitrogen availability depend on rootstock genotype.

Authors:  Noé Cochetel; Frédéric Escudié; Sarah Jane Cookson; Zhanwu Dai; Philippe Vivin; Pierre-François Bert; Mindy Stephania Muñoz; Serge Delrot; Christophe Klopp; Nathalie Ollat; Virginie Lauvergeat
Journal:  J Exp Bot       Date:  2017-07-10       Impact factor: 6.992

9.  Convergent Acquisition of Nonembryonic Development in Styelid Ascidians.

Authors:  Alexandre Alié; Laurel Sky Hiebert; Paul Simion; Marta Scelzo; Maria Mandela Prünster; Sonia Lotito; Frédéric Delsuc; Emmanuel J P Douzery; Christelle Dantec; Patrick Lemaire; Sébastien Darras; Kazuo Kawamura; Federico D Brown; Stefano Tiozzo
Journal:  Mol Biol Evol       Date:  2018-07-01       Impact factor: 16.240

10.  Transcriptome profiling of the honeybee parasite Varroa destructor provides new biological insights into the mite adult life cycle.

Authors:  Fanny Mondet; Andrea Rau; Christophe Klopp; Marine Rohmer; Dany Severac; Yves Le Conte; Cedric Alaux
Journal:  BMC Genomics       Date:  2018-05-04       Impact factor: 3.969

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.