Literature DB >> 31337347

Impact of sequencing depth and technology on de novo RNA-Seq assembly.

Jordan Patterson1, Eric J Carpenter2, Zhenzhen Zhu3, Dan An3, Xinming Liang3, Chunyu Geng3, Radoje Drmanac3, Gane Ka-Shu Wong4,5.   

Abstract

BACKGROUND: RNA-Seq data is inherently nonuniform for different transcripts because of differences in gene expression. This makes it challenging to decide how much data should be generated from each sample. How much should one spend to recover the less expressed transcripts? The sequencing technology used is another consideration, as there are inevitably always biases against certain sequences. To investigate these effects, we first looked at high-depth libraries from a set of well-annotated organisms to ascertain the impact of sequencing depth on de novo assembly. We then looked at libraries sequenced from the Universal Human Reference RNA (UHRR) to compare the performance of Illumina HiSeq and MGI DNBseq™ technologies.
RESULTS: On the issue of sequencing depth, the amount of exomic sequence assembled plateaued using data sets of approximately 2 to 8 Gbp. However, the amount of genomic sequence assembled did not plateau for many of the analyzed organisms. Most of the unannotated genomic sequences are single-exon transcripts whose biological significance will be questionable for some users. On the issue of sequencing technology, both of the analyzed platforms recovered a similar number of full-length transcripts. The missing "gap" regions in the HiSeq assemblies were often attributed to higher GC contents, but this may be an artefact of library preparation and not of sequencing technology.
CONCLUSIONS: Increasing sequencing depth beyond modest data sets of less than 10 Gbp recovers a plethora of single-exon transcripts undocumented in genome annotations. DNBseq™ is a viable alternative to HiSeq for de novo RNA-Seq assembly.

Entities:  

Keywords:  Rna-seq assembly; Sequencing depth; Sequencing technology

Year:  2019        PMID: 31337347      PMCID: PMC6651908          DOI: 10.1186/s12864-019-5965-x

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

RNA-Seq is a widely used next-generation sequencing (NGS) methodology for transcriptome profiling [1], both to identify novel transcript sequences and for differential expression studies. Much has been written about this methodology and it is not our intention to rehash the many excellent articles that can be found in the literature [2, 3]. We focus instead on how continuing improvements in NGS technologies have brought new perspectives to two fundamental questions that many scientists ask before they initiate a RNA-Seq experiment. First, with the decreasing costs of generating these data, one can now sequence a given library many times deeper than before. Motivated by ongoing projects on the sequencing of phylodiverse species with no reference genomes, e.g. 1KP for plants [4] and 1KITE for insects [5], we wanted to see how many novel transcript sequences can be recovered by de novo assemblies of RNA-Seq data if a project is willing to spend more money. This is not an issue that will soon be mooted by the ever lower costs of sequencing complete genomes, because genome size variations (e.g. 2,342-fold for land plants [6]), polyploidy, and the outbred nature of many samples collected in the wild make genome assembly a continuing challenge. Choice among sequencing platforms is the second issue to be addressed. We will explore how DNBSeq™, a recent platform from MGI (a subsidiary of BGI Group), may serve as an alternative to the market-leading platform from Illumina. It uses DNA nanoballs (DNB) and combinatorial probe-anchor synthesis (cPAS) [7], building on technology from Complete Genomics. Both platforms provide short but high-quality reads, in contrast to the long but low-quality reads offered by Pacific Biosciences and Oxford Nanopore. There are several technical differences in the two sequencing pipelines, which are illustrated in Fig. 1.
Fig. 1

Technical comparison of DNBSeq™ and Illumina platforms

Technical comparison of DNBSeq™ and Illumina platforms In both platforms, DNA molecules (1) are fragmented. Adapters are ligated to these fragments and are processed to produce libraries (2) containing single-strand DNA circles with an adapter in the middle for DNBSeq™ and linear double stranded DNA with adapters at each end for Illumina. These DNA sequences are then replicated to produce an amplified signal for sequencing (3), using rolling circle replication for DNBSeq™ (producing DNBs) and bridge PCR amplification for Illumina (producing clusters). Rolling circle replication is a linear amplification where each replicate is generated from the original fragment and does not produce detectable clonal amplification errors or molecular switching of sample barcodes [8], produces a small percentage of spot duplicates, and has reduced coverage bias, resulting in better coverage of some GC-rich regions. The exponential amplification performed on Illumina libraries is known to have issues with molecular barcode switching [9] and GC-rich sequence coverage [10-12]. DNBSeq™ and newer Illumina instruments use patterned flow cells (4) for higher nanoball / cluster density. Both platforms use stepwise sequencing by polymerase on ssDNA template with fluorescently labeled nucleotides (5). The small size of the DNBs (~ 200 nm) relative to PCR clusters results in smaller and more concentrated DNA spot fluorescence, giving DNBSeq™ a higher signal-to-noise ratio, higher spot densities, and faster sequencing times. DNBseq™ has also been investigated for use in analyzing small non-coding RNAs [7], palaeogenomic sequencing [13], metagenomic sequencing [14], germline and somatic variant identification in whole genomes [15], and transcriptome analysis in plants [16].

Analyses

Increased sequencing depth enriches for single-exon transcripts not reported in genome annotations

Sequencing depth is an important consideration for RNA-Seq because of the tradeoff between the cost of the experiment and the completeness of the resultant data. Transcript abundance follows an exponential distribution, and greater sequencing depths are required to recover less abundant transcripts. As more sequencing is done to assemble less abundant transcripts, a greater proportion of the additional reads will come from transcripts that already have sufficient depth to assemble. Hence there are diminishing returns as sequencing depth increases, and intuitively one might expect the number of recovered transcripts to asymptote. What happens in practice is surprising, as we will show for our representative species.

Datasets

RNA-seq datasets are usually a few Gbp in size, but in the public databases there are some unusually large datasets with sizes in the many 10’s of Gbp. Importantly for our purposes, these were sequenced from a single library, not by pooling sequences from multiple libraries. We could therefore sub-sample these datasets to simulate the consequences of doing RNA-seq at varying sequencing depths. To benchmark the resultant assemblies, we used species with longstanding (decade old) reference genome annotations, i.e. Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Oryza sativa, Arabidopsis thaliana. Tables 1 and 2 show the datasets used and their related genomes, respectively. The data was downloaded from the National Center for Biotechnology Information’s (NCBI) Sequence Read Archive (SRA).
Table 1

Datasets used to study effect of sequencing depth

SRA Run IDSpeciesTissue DescriptionPlatform# of Spots# of BasesDate PublishedEndLength
SRR1047863 Homo sapiens post-mortem brain (dorsolateral prefrontal cortex / Brodmann area 46)Illumina HiSeq 1000258 733 82752 264 233 0542014-01-08101 bp
SRR980471Homo sapiensCD19+ primary cells, hispanic male age 37Illumina HiSeq 2000263 034 15539 981 191 5602013-09-1276 bp
SRR1732347 Mus musculus male brain age 8 weeks, strain C57BL/6 JIllumina HiSeq 2000354 274 08771 563 365 5742014-12-23101 bp
SRR1509508 Drosophila melanogaster adult; strain: y; cn bw spIllumina HiSeq 2000140 645 54028 410 399 0802014-07-09101 bp
SRR1523365 Caenorhabditis elegans at least 100 adult wormsIllumina HiSeq 2000228 557 93945 711 587 8002014-07-31100 bp
SRR1178906 Oryza sativa panicle from O. sativa japonica (Nipponbare cultivar)Illumina HiSeq 2000207 489 21741 497 843 4002014-03-03100 bp
DRR018424 Arabidopsis thaliana 4 day old seedlingsIllumina HiSeq 2000192 531 28538 891 319 5702014-11-27101 bp
SRR1061361Arabidopsis thalianaleaves – when first flower openIllumina HiSeq 2000202 019 33440 807 905 4682015-07-22101 bp

All datasets were sourced from the NCBI and DDJB Sequence Read Archives

Table 2

Reference genome and annotation (GFF) sources

SpeciesRefSeq Assembly and AnnotationGenome (bp)Total Gaps (bp)Exome (bp)
Homo sapiensGCF_000001405.29_GRCh38.p33 226 010 022161 368 151120 562 222
Mus musculusGCF_000001635.24_GRCm38.p42 803 568 84079 356 756114 986 282
Drosophila melanogasterGCF_000001215.4_Release_6_plus_ISO1_MT143 726 0021 152 97835 879 647
Caenorhabditis elegansGCF_000002985.6_WBcel235100 286 401026 801 799
Oryza sativaGCF_000005425.2_Build_4.0382 778 12510 060 00449 757 833
Arabidopsis thalianaGCF_000001735.3_TAIR10119 667 750185 64457 812 822

All genome sequence and GFF reference files were obtained from the NCBI Assembly database

Datasets used to study effect of sequencing depth All datasets were sourced from the NCBI and DDJB Sequence Read Archives Reference genome and annotation (GFF) sources All genome sequence and GFF reference files were obtained from the NCBI Assembly database

Quality control

All libraries were preprocessed with Trim Galore! to remove adapter/primer sequences. Only reads aligning to the host genome using TopHat2 [17] were kept, to prevent confounding effects from library contaminations. Random sampling of each library was done at sequencing depths of 1, 2, 3, 4, 5, 6, 8, 10, 12, 14, and 16 Gbp.

Assembly results

Each sub-sampled dataset was de novo assembled by SOAPdenovo-Trans and gap-filled using GapCloser [18]. Scaffolds were aligned to their host genome using BLAT [19]. Low-quality and chimeric assemblies were filtered out. Alignments of the high-quality scaffolds remaining were compared to the transcriptome annotations in the reference genomes. We counted the total number of unique bases in these alignments, but did so in two different ways, depicted in Fig. 2 as “genome” or “exome”. For the former we counted everything, but for the latter we only counted transcriptome bases within the annotated exons.
Fig. 2

Effect of sequencing depth on transcriptome recovery. We count the total number of unique bases in the alignments, based either on the genome or the exome. The vertical scale normalizes the exome size of the 16 Gbp assemblies to unity

Effect of sequencing depth on transcriptome recovery. We count the total number of unique bases in the alignments, based either on the genome or the exome. The vertical scale normalizes the exome size of the 16 Gbp assemblies to unity Our a priori expectation was that the total number of unique bases would asymptote as the sequencing depth increased. For the exome curves, this was universally observed, typically at 2~8 Gbp, almost regardless of species analyzed. For the genome curves, such asymptotic behavior was best observed in Drosophila melanogaster, but not at all in Homo sapiens, with other species falling somewhere between these two extremes of behavior. It is not clear when the Homo sapiens genome data might asymptote. For Homo sapiens, the genome data is nearly seven times the exome data at 16 Gbp of sequencing depth. Even in the curves that reach the asymptote, it is not because the genome or annotated exome has been completely covered, as only 60% of the genome bases and 75% of the exon bases have transcripts aligned against them in the most complete case (Drosophila). Given our stringent alignment criteria, we believe the assembled sequences are genuine transcripts. Why so many are not in the “official” annotations is best deferred to the discussions. An important consideration is the proportion of the assembled sequences that align with and without introns. This is because the latter case, single-exon transcripts, can arise from a variety of sources including protein-coding genes, long non-coding RNAs (lncRNAs), and improperly spliced pre-mRNAs. The results for assemblies at 16Gbp of sequencing are shown in Fig. 3. For almost all species, unannotated transcripts were overwhelmingly single-exon, essentially 100%, with the exception of Drosophila melanogaster. A much smaller proportion of annotated transcripts were classified as single-exon, 50 to 80% depending on parameter settings. In Homo sapiens, Mus musculus, and Arabidopsis thaliana libraries, a majority of the unannotated single-exon material is intronic, suggesting but not proving that they are simply unprocessed mRNAs. We would however be cautious about overly interpreting the single-exon fractions, as incomplete assemblies can produce single-exon transcripts, despite their underlying genes having multiple exons. That said, the difference between the proportions for unannotated and annotated transcripts is striking.
Fig. 3

Single-exon and ORF proportions in annotated and unannotated scaffolds. Many scaffolds align partially to the exome. A scaffold is said to be “annotated” when it has an alignment that exceeds an arbitrary fraction, i.e. annotation threshold, of either the scaffold length or of the reference transcript length. Annotated and unannotated SE refers to the proportion of annotated and unannotated transcripts that are single-exon. Unannotated SEI refers to the proportion of unannotated single-exon transcripts that are intronic. Annotated and Unannotated ORF refers to the proportion of scaffolds in each category that have ORFs of at least 100 amino acids in length, out of the scaffolds that are at least 300 bases long

Single-exon and ORF proportions in annotated and unannotated scaffolds. Many scaffolds align partially to the exome. A scaffold is said to be “annotated” when it has an alignment that exceeds an arbitrary fraction, i.e. annotation threshold, of either the scaffold length or of the reference transcript length. Annotated and unannotated SE refers to the proportion of annotated and unannotated transcripts that are single-exon. Unannotated SEI refers to the proportion of unannotated single-exon transcripts that are intronic. Annotated and Unannotated ORF refers to the proportion of scaffolds in each category that have ORFs of at least 100 amino acids in length, out of the scaffolds that are at least 300 bases long Of further interest is whether the annotated and unannotated scaffolds might contain viable open reading frames (ORFs). In analyzing the scaffolds for ORFs, we only considered scaffolds with a minimum length of 300 bases and looked for ORFs of at least 100 amino acids. As shown in Fig. 3, the annotated scaffolds contain a higher proportion of ORFs than the unannotated scaffolds in all cases. Some species contain higher levels of ORFs in the unannotated scaffolds, which may be partially explained by missing annotations in the references. The proportion of annotated scaffolds not containing ORFs is likely affected by the completeness of the assembly. Transcripts that are partially assembled into separate scaffolds are less likely to have ORFs of sufficient length, even though they would be considered annotated. The observed differences in genome sequence recovered relative to exome in Fig. 2 are consistent with the total amounts of non-minimal introns [20] in the underlying genomes. Caenorhabditis elegans and Arabidopsis thaliana are noted for their small genomes (100 Mbp and 120 Mbp respectively) and small proportions of introns greater than a kilobase. Drosophila melanogaster has a comparably small genome of 144 Mbp, but larger proportions of introns of kilobase length or larger. Oryza sativa has a slightly larger genome of 383 Mbp, but its introns are shorter in length than Drosophila. Homo sapiens and Mus musculus have much larger genomes (3.2 Gbp and 2.8 Gbp respectively) and comparatively large proportions of introns of 10 kilobases in length or larger, with some reaching a megabase in length.

HiSeq and DNBseq™ platforms are nearly equivalent except in the most GC-rich regions

Here, we compare two short-read NGS sequencing platforms, the market-leading Illumina platform, HiSeq, and a recent platform from MGI, DNBseq™, based on technology from Complete Genomics. Our primary interest is to recover as many complete transcript sequences as possible using a de novo assembly. Within the context of the previous discussion, and in particular Fig. 2, we are most interested in recovering the exome that appears in the genome annotations. Obviously, this will be a function of sequencing depth, but given an equivalent amount of sequencing, might there be significant differences between platforms?

Datasets

All of the analyzed sequence was for libraries created from the Universal Human Reference RNA (UHRR), which is comprised of RNA from ten human cell lines, and is commonly used as a control for microarray gene-expression experiments. Nine libraries were constructed by MGI with the MGIEasy RNA Library Prep Set (1000006383, 1000006384) kit, consisting of three sets with three replicates each. Nine sequencing runs with paired-end 100 bp (PE100) data were performed on the BGISEQ-500, giving a total of 416,161,025 reads. Five approximately PE100 runs from the Illumina HiSeq-2000 were used for comparison; they totaled 525,070,317 reads. Notice however that the HiSeq libraries were produced in two different labs. Table 3 shows the datasets used.
Table 3

Datasets used to compare DNBseq™ and HiSeq platforms

SRA Run IDLabPlatform# of Spots# of BasesDate PublishedEndLength
ERR18313621BGISEQ-50048 148 8219 629 764 2002017-02-21100 bp
ERR18313631BGISEQ-50029 782 9595 956 591 8002017-02-21100 bp
ERR18313641BGISEQ-50054 940 05610 988 011 2002017-02-21100 bp
ERR18313651BGISEQ-50036 073 2107 214 642 0002017-02-21100 bp
ERR18313661BGISEQ-50043 664 0658 732 813 0002017-02-21100 bp
ERR18313671BGISEQ-50055 025 94611 005 189 2002017-02-21100 bp
ERR18313681BGISEQ-50053 296 16110 659 232 2002017-02-21100 bp
ERR18313691BGISEQ-50065 455 75413 091 150 8002017-02-21100 bp
ERR18313701BGISEQ-50029 774 0535 954 810 6002017-02-21100 bp
SRR12611682HiSeq 2000134 921 15426 984 230 8002014-04-24100 bp
SRR12611702HiSeq 200072 897 48214 579 496 4002014-04-24100 bp
SRR9500783HiSeq 2000100 387 01020 278 176 0202013-08-29101 bp
SRR9500803HiSeq 200091 781 47718 539 858 3542013-08-29101 bp
SRR9500843HiSeq 2000125 083 19425 266 805 1882013-08-28101 bp

All datasets were sourced from the NCBI and ENA Sequence Read Archives

Datasets used to compare DNBseq™ and HiSeq platforms All datasets were sourced from the NCBI and ENA Sequence Read Archives SOAPnuke [21] was used to filter the reads based on the amount of low-quality bases, ambiguous bases (Ns), or adapter sequence. In total, the DNBseq™ libraries had 4.70% of their reads filtered (0.25% adapter, 2.83% low-quality, and 1.62% ambiguous); the HiSeq libraries had 27.91% of their reads filtered (1.48% adapter, 24.38% low-quality, and 2.06% ambiguous). Notice however that this large difference in number of low-quality reads filtered can potentially be explained by differences in base callers. It does not necessarily reflect any intrinsic quality difference between the platforms. Detailed results are given in Table 4.
Table 4

Read quality filtering on two sequencing platforms

PlatformNameAdapterLow QualityN%
DNBseq™ERR1831362125 9260.26%1 424 4922.96%769 5031.60%
ERR183136379 5330.27%871 0082.92%479 6351.61%
ERR1831364127 0690.23%1 476 4962.69%869 7641.58%
ERR1831365103 9940.29%1 051 5682.92%588 4271.63%
ERR1831366110 3240.25%1 060 4622.43%711 2361.63%
ERR1831367132 4610.24%1 581 4702.87%901 1231.64%
ERR1831368133 4440.25%1 588 8432.98%875 8651.64%
ERR1831369152 0270.23%1 825 7082.79%1 049 3091.60%
ERR183137083 6740.28%898 7763.02%484 5411.63%
HiSeqSRR12611683 4970.00%30 319 01922.47%466 9000.35%
SRR12611705090.00%33 381 93745.79%6 147 8228.43%
SRR9500781 313 4631.31%20 004 66819.93%1 307 6431.30%
SRR9500802 022 5592.20%18 689 01620.36%1 194 2691.30%
SRR9500844 443 8423.55%25 596 16020.46%1 675 6491.34%

The columns show the number and percentage of reads filtered out based on them containing adapter sequence, having too many low quality bases, and having too many ambiguous bases

Read quality filtering on two sequencing platforms The columns show the number and percentage of reads filtered out based on them containing adapter sequence, having too many low quality bases, and having too many ambiguous bases To reduce the effect of differences in library preparations, we also filtered the reads by aligning against the Genome Reference Consortium human genome build38 (GRCh38) with HISAT2 [22]. This removed likely contaminations, and eliminated the spike-in reads that were added to some libraries. For DNBseq™ libraries, 3.97% of remaining reads could not be aligned; for HiSeq libraries, 3.68% of remaining reads could not be aligned. To reduce differences resulting from input material amounts or PCR cycles, we also filtered duplicate reads using Picard Tools [23]. For DNBseq™ libraries, 27.23% of remaining reads were filtered out; for HiSeq libraries, 28.85% of remaining reads were filtered out. Full results are given in Table 5.
Table 5

Read filtering for differences in library preparation

Preprocessing
PlatformNameRawFilteredMappedDeduplicated
DNBseq™ERR183136248 148 82145 828 90044 015 46932 497 907
ERR183136329 782 95928 352 78327 237 35421 102 998
ERR183136454 940 05652 466 72750 421 99336 693 200
ERR183136536 073 21034 329 22132 932 43824 745 818
ERR183136643 664 06541 782 04340 108 61529 250 048
ERR183136755 025 94652 410 89250 302 19736 153 502
ERR183136853 296 16150 698 00948 688 10734 475 744
ERR183136965 455 75462 428 71059 984 28841 622 545
ERR183137029 774 05328 307 06227 164 67620 606 457
HiSeqSRR1261168134 921 154104 132 308101 903 99870 430 950
SRR126117072 897 48233 367 21432 597 42227 498 074
SRR950078100 387 01077 761 23673 979 29350 505 406
SRR95008091 781 47769 875 63366 896 95549 310 706
SRR950084125 083 19493 367 54389 192 06961 653 799
DNBseq™Total416 161 025396 604 347380 855 137277 148 219
% Removed4.70%3.97%27.23%
HiSeqTotal525 070 317378 503 934364 569 737259 398 935
% Removed27.91%3.68%28.85%

Here we show reads remaining after each preprocessing step. The columns indicate read counts after SOAPnuke filtering (Filtered), aligning to GRCh38 with HISAT2 (Mapped), and PCR deduplication with Picard Tools (Deduplicated)

Read filtering for differences in library preparation Here we show reads remaining after each preprocessing step. The columns indicate read counts after SOAPnuke filtering (Filtered), aligning to GRCh38 with HISAT2 (Mapped), and PCR deduplication with Picard Tools (Deduplicated) We performed multiple de novo assemblies, all using SOAPdenovo-Trans [18], with randomly-selected subsets of each library. Target sizes were 1, 2, 3, 4, 5, 6, 8, and 10 Gbp, to the extent that sufficient data was available in the source library. These assemblies were aligned against GRCh38 with BLAT [19] and evaluated against the GENCODE v28 [24] annotations. As before we differentiate between alignments to the genome and exome. Results are shown in Fig. 4. Looking only at the genome curves, the HiSeq libraries appear superior; but looking at the exome curves, there is no appreciable difference between platforms. Each user will have to decide for him/herself if this additional genome coverage is worthwhile, given that it was not included in the exome annotations from GENCODE.
Fig. 4

Genome vs exome coverage from HiSeq and DNBseq™. Each line corresponds a subsampled library, with total sizes of 1, 2, 3, 4, 5, 6, 8, and 10 Gbp

Genome vs exome coverage from HiSeq and DNBseq™. Each line corresponds a subsampled library, with total sizes of 1, 2, 3, 4, 5, 6, 8, and 10 Gbp Next, the scaffolds were aligned against the GRCh38 transcriptome using the LAST aligner [25], to evaluate the completeness of the RNA-seq assembly. To be declared complete, at least 95% of the annotated transcript must be aligned to by a single RNA-seq scaffold. This definition recognizes that a complete RNA-seq assembly is often longer than the corresponding annotation, because the former will include UTR sequences while the latter typically does not. The results are depicted in Table 6. DNBseq™ results were fairly consistent in the number of complete transcripts recovered for each subset size. HiSeq results showed greater variation with subset size, which we believe is due to differences in library preparation among different labs. In particular, the libraries SRR1261168 and SRR1261170 seemed of especially high quality, as determined by transcript completeness, and both were generated at the same sequencing center.
Table 6

Completeness of transcripts assemblies per library

Complete Transcripts Assembled
PlatformName1Gbp2Gbp3Gbp4Gbp5Gbp6Gbp8Gbp10Gbp12Gbp20Gbp
DNBseq™COMBINED2 4833 8914 7205 2765 7055 9946 4586 7166 9797 472
ERR18313622 0493 3064 1404 7195 1895 530
ERR18313631 9813 2354 0264 641
ERR18313642 0523 3934 2534 8055 1875 527
ERR18313651 9973 2954 1254 667
ERR18313661 9953 2934 1114 7395 193
ERR18313672 0613 3514 1734 7755 2165 511
ERR18313681 9513 3044 1314 7075 1645 471
ERR18313692 0323 3234 1704 7725 1785 5755 989
ERR18313701 9203 2604 0914 622
HiSeqSRR12611682 3633 6964 4834 9875 4395 7206 3046 6886 933
SRR12611701 9083 1353 9084 4784 9165 107
SRR9500788091 4021 8202 1902 4442 6733 0243 362
SRR9500809821 6202 1262 5042 8163 0943 471
SRR9500849361 5732 0682 4332 7222 9853 3793 666

Complete transcript counts are shown for each randomly-selected subset for each library

Completeness of transcripts assemblies per library Complete transcript counts are shown for each randomly-selected subset for each library We also combined all of the DNBseq™ libraries and assembled subsets of different sizes. The results showed that the combined library performs better in terms of complete transcripts than any other DNBseq™ library at every subset size (see COMBINED row in Table 6). This is likely because the sequences that are sampled in different libraries are complementary and occur in sufficient quantity such that they will be assembled in the combined libraries. To get an idea of the overlap in complete assembled transcripts between the two sequencing platforms, we compared the complete transcripts for the 4 Gbp subset assemblies, as that was the largest available subset in most of the libraries. Results are depicted in Fig. 5. Complete overlap does not exist. This was the case whether comparing libraries from different sequencing platforms or from the same sequencing platform. The implication is that we are mostly seeing the inevitable differences in sampling of lower level transcripts. However, the amount of overlap between libraries from the same platform was higher than that between libraries from different platforms, indicating that some transcripts are more likely to be completely assembled when we use a particular sequencing platform.
Fig. 5

Overlap in complete assembled transcripts. Comparisons of complete transcripts between libraries at the 4 Gbp subsets. For the libraries in each row, the fraction is calculated as the number of complete transcripts that are shared with the library from each column, divided by the total number of complete transcripts for the library in that row

Overlap in complete assembled transcripts. Comparisons of complete transcripts between libraries at the 4 Gbp subsets. For the libraries in each row, the fraction is calculated as the number of complete transcripts that are shared with the library from each column, divided by the total number of complete transcripts for the library in that row Next, we looked at the 4 Gbp assemblies for transcripts that were not complete in any library on one sequencing platform, but complete for at least one library on the other platform. We then compared the GC contents of the assembled regions to the gap regions. The results are shown in Fig. 6. For DNBseq™ libraries, gap regions are uniformly distributed among low, mid, and high GC-content. In contrast, albeit for only some of the HiSeq libraries, the gap regions reveal a bias against GC-rich sequence. The fact Illumina libraries can be susceptible to both high and low GC biases has previously been reported [10-12], although there are techniques that can reduce the magnitude of the biases. And indeed, the two best HiSeq libraries from our study showed much less of a GC-content bias than the other three.
Fig. 6

GC-content of assembled versus gap regions. Cumulants of GC-content in the assembled versus gap regions for (a) DNBseq™ libraries and (b) HiSeq libraries. Each pair (solid and dotted lines) represents a single library

GC-content of assembled versus gap regions. Cumulants of GC-content in the assembled versus gap regions for (a) DNBseq™ libraries and (b) HiSeq libraries. Each pair (solid and dotted lines) represents a single library Looking further into the GC-content biases, we examined the read depths from the HISAT2 alignments. We established a set of 565 non-overlapping transcripts with a minimum average depth of 10 across all our 4 Gbp datasets. To ensure that the subset of transcripts are representative of the complete set of transcripts, we counted the number of transcripts with GC-content in 1% segments for both sets. These two sets have a Spearman correlation of 0.904, showing that the subset has a similar GC-content distribution to the full set of transcripts. Then, for each 100 base pair window along each transcript, we determined their GC-content and average read depth. The ratio of each window depth to the entire transcript average depth is taken. Figure 7 depicts the average of these ratios for each GC-content bin from 0 to 100% on a few representative libraries. The DNBseq™ library ERR1831362 and the HiSeq library SRR1261168 look rather similar, not surprisingly considering that these were among the best in terms of assembly completeness. However the HiSeq library SRR950078 exhibited a substantial drop-off in read depth at high GC-content, consistent with its inferior assembly completeness.
Fig. 7

GC-bias plots for representative libraries. Relative coverage as a function of GC-content, computed on 100-base windows across the set of 565 transcripts described in the text. A relative coverage of one would indicate no bias

GC-bias plots for representative libraries. Relative coverage as a function of GC-content, computed on 100-base windows across the set of 565 transcripts described in the text. A relative coverage of one would indicate no bias Finally, we estimated the transcript abundances using Kallisto [26], focusing again on the 3 representative libraries, plotting the ratio of transcript abundances from the DNBseq™ library to each of the two HiSeq libraries. These results are shown in Fig. 8. Transcripts were only included when their estimated abundances were at least 10 transcripts per million in both libraries, to avoid the higher-variance low-abundance transcripts. Comparing the 9,687 transcripts (4.75% of all transcripts) with abundances meeting the threshold in ERR1831367 and SRR1261168, we see a slight slope in the regression-fitted line, as expected if there is a bias against higher GC-content reads in the HiSeq data. Comparing the 11,753 transcripts (5.77% of all transcripts) with abundances meeting the threshold in ERR1831367 and SRR950078, we see a similar slope, consistent with the other comparison.
Fig. 8

A slight bias in the transcript abundances vs GC-content. The log ratio of the expression levels of (a) SRR1261168, the HiSeq run with the most complete assemblies and (b) SRR950078, the HiSeq run with the least complete assemblies, compared to ERR1831364, the DNBseq™ run with most the complete assemblies. Regression fit is shown in blue, with numerical parameters indicated

A slight bias in the transcript abundances vs GC-content. The log ratio of the expression levels of (a) SRR1261168, the HiSeq run with the most complete assemblies and (b) SRR950078, the HiSeq run with the least complete assemblies, compared to ERR1831364, the DNBseq™ run with most the complete assemblies. Regression fit is shown in blue, with numerical parameters indicated

Discussion

How much RNA-Seq data is optimal? It is well-known that there are diminishing returns to ever deeper transcriptome sequencing and the exact choice will always be a function of budget vs ambition. However, it is less well-known that deeper transcriptome sequencing will generate a plethora of single-exon transcripts that are not typically included in most genome annotations, especially but not exclusively for human sequencing. What these additional transcripts might be has long been a subject of debate. The first high-profile mention of this phenomenon was the FANTOM Consortium publication of full-length mouse cDNAs, where it was claimed that 15,815 of 33,409 non-redundant cDNAs were non-protein-coding RNA genes [27]. However, most of these genes were poorly conserved across mammalian evolution [28] and it was unclear how many were biologically functional. These debates escalated when the ENCODE Consortium assigned biochemical functions to 80% of the human genome [29]. The arguments focussed on the definition of biological functionality, and the relevance or not of evolutionary conservation [30-34]. We do not wish to revisit these arguments. The point of Figs. 2 and 3 was simply to demonstrate what you get if you sequence more deeply on an RNA-Seq library. It is up to the individual user to decide if such additional transcripts are worth the extra expenditures. Which sequencing platform is better? For most users, next-generation sequencing is a choice between higher-quality shorter reads, as exemplified by the market-leading Illumina platform, or lower-quality longer reads, as exemplified by Pacific Biosciences and Oxford Nanopore. In the former category, the most pertinent question is if the DNBseq™ platform (BGISEQ-500 and more recent MGISEQ-2000 and MGISEQ-T7, which are capable of PE150 reads) is a viable alternative to Illumina. Here, we show that for recovery of transcript sequences from de novo assembled RNA-Seq libraries the two platforms give equally good results. Some of the Illumina libraries under-represented GC-rich sequences, leading to gaps in the assemblies. However, other Illumina libraries did not exhibit such a GC-content bias, and without a systematic analysis of library making protocols that is beyond the scope of this publication, it is unclear if this is an intrinsic disadvantage of the Illumina platform. However, the importance of good library construction, and the consequent biases if this is not done right, is worth emphasizing regardless of sequencing platform used.

Conclusions

Increasing sequencing depth of RNA-Seq experiments has quickly diminishing returns in terms of exomic sequence assembled. A large portion of the additional sequences assembled as sequencing depth increases appears to be unannotated single-exon transcripts. Of these sequences, a majority appears from intronic regions. DNBseq™ is a viable alternative to HiSeq for de novo RNA-Seq assembly. Libraries sequenced on both technologies recovered similar numbers of full-length transcripts after assembly. Higher levels of GC-bias are seen in some of the Illumina libraries, which is likely attributable to differences in library preparation.

Methods

Reads are trimmed by Trim Galore! 0.3, which removes adapter/primer sequences. The --paired parameter is used, with the --quality parameter set to 0 so that no bases are removed for low-quality scores (these are dealt with later). To improve alignment rates in the next step, five bases were also trimmed from the 5′-ends of reads in the SRR1523365 (C. elegans) data. Each library is split into million read-pair chunks. Each chunk is aligned to their reference genome by Tophat 2.0.13, using the --b2-very-sensitive parameter to increase alignment sensitivity. Reads are considered “clean” if both ends align and they are consistent in location and direction. Clean reads from all chunks are merged and used for further processing.

Assembly and alignment

Clean read-pairs from each library are randomly subsampled to generate 1, 2, 3, 4, 5, 6, 8, 10, 12, 14, and 16 Gbp of sequencing data. These subsets are assembled with SOAPdenovo-Trans 1.03 (2014-01-23) using default parameters, but with the -F argument set to enable gap filling and the average insert size defined as 250 bp. Post-processing with GapCloser filled in more gaps. Assembled sequences are aligned against their host genome by BLAT [19] using the -fine (looks harder for smaller initial and terminal exons) and -ooc (speeds up alignment by skipping overly common 11-mers) parameters. Scaffolds that align to the genome over at least 98% of their length are deemed to be correctly assembled. Alignments are compared with the reference annotations for the host genome. To compute the proportion of unannotated and annotated scaffolds with single vs multi-exon alignments, we set a percentage threshold that the alignment must reach, from 0 to 100% at 1% intervals. A scaffold is considered to be annotated if has an alignment that is greater than that percentage threshold, where the denominator on the percentage calculation is the length of the scaffold or the length of the reference transcript (whichever is more favorable). To analyze the proportion of unannotated and annotated scaffolds containing viable ORFs, we looked at the set of scaffolds at least 300 bases long in each category, and computed the proportion of those that had ORFs at least 100 amino acids long. Reads from each library are first filtered with SOAPnuke 1.5.6. We chose filtering parameters matching the previous study of a reference human genome dataset generated on DNBseq™ [35]. Reads are eliminated when more than 10% of the bases have a PHRED score of less than 10 or when more than 1% of the bases are ambiguous N’s. Reads containing at least 50% of an adapter with no more than one mismatch are also filtered. To reduce the effect of contamination or otherwise unwanted sequence, we filter the reads by aligning them to the Genome Reference Consortium human genome build 38 (GRCh38) using HISAT2 2.1.0 and default parameters. Both ends must be aligned concordantly for the read to be kept. This removes contamination, spike-in reads, etc. PCR duplicates are marked by the Picard Tools 2.18.5 MarkDuplicates command, and then filtered using Samtools 1.8 and an awk command to reduce the effect of differences in input materials and PCR cycles between libraries. The clean deduplicated reads are randomly subsampled from each library to generate 1, 2, 3, 4, 5, 6, 8, and 10 Gbp of sequencing data, to whatever extent that the desired amount of data is available. The datasets are assembled with SOAPdenovo-Trans 1.04 using default parameters, with the -F argument set to enable gap filling and the average insert size set to 200 bp. This is followed with GapCloser post-processing to fill in more gaps. Assembled sequences are aligned against the GRCh38 genome with BLAT 36 [19], using parameters -fine (looks harder for smaller initial and terminal exons) and -ooc = 11.ooc (speeds up alignment by skipping overly common 11-mers). Scaffolds aligning to the genome above 98% of their length are said to be assembled correctly. Their alignments are compared with the Gencode v28 [24] annotations to determine exome vs genome coverage.

Completeness

Assembled sequences are also aligned with lastal 941 [25], using default parameters, to the set of transcripts in GRCh38 generated from the Gencode v28 annotations. For a reference transcript to be considered complete, there must be a scaffold that can align over 95% of that reference transcript. Only scaffolds which have an unambiguous top-scoring alignment are considered.

GC content gaps/biases

GC content analysis is done on the 4 Gbp subsets, because all libraries except for SRR515084 have a 4 Gbp subset. For comparing assembled versus gap regions, transcripts are chosen if they are not assembled in any other library from the same sequencing platform, but assembled in a library from the other sequencing platform. Regions are declared to be a gap when there are no scaffolds that align to that region of the annotated transcript and, moreover, there are no reads (assembled or unassembled) that align to that region. Only transcripts with at least 10 bases worth of gap region are used for our comparisons. To analyze GC-bias in reads coverage, we must exclude genes with multiple splice variants, because their reads coverage cannot be confidently assigned. Hence we limit the analysis to a set of 565 transcripts that have no overlap with any other annotated transcripts and that also have a minimum average read depth of 10 across their lengths. For each 100 bp window along each transcript, we calculate the ratio of the read depth for that window against the average read depth along the transcript, as well as the GC-content of that window. The set of all depth ratios is averaged and plotted against GC-content. To analyze the effect of GC-bias on transcript abundance, we run Kallisto 0.44.0 [26] to estimate transcript abundance. For each library pair, we plot the ratio of the transcript abundance for each transcript, as a function of GC-content. Only transcripts that have a transcripts per million (TPM) estimate of 10 or greater are included.
  8 in total

1.  Transcriptome Comparison Reveals the Difference in Liver Fat Metabolism between Different Sheep Breeds.

Authors:  Taotao Li; Meilin Jin; Xiaojuan Fei; Zehu Yuan; Yuqin Wang; Kai Quan; Tingpu Wang; Junxiang Yang; Maochang He; Caihong Wei
Journal:  Animals (Basel)       Date:  2022-06-27       Impact factor: 3.231

2.  TagSeq for gene expression in non-model plants: A pilot study at the Santa Rita Experimental Range NEON core site.

Authors:  Hannah E Marx; Stephen Scheidt; Michael S Barker; Katrina M Dlugosch
Journal:  Appl Plant Sci       Date:  2020-11-22       Impact factor: 1.936

3.  Comparison of the MGISEQ-2000 and Illumina HiSeq 4000 sequencing platforms for RNA sequencing.

Authors:  Sol A Jeon; Jong Lyul Park; Jong-Hwan Kim; Jeong Hwan Kim; Yong Sung Kim; Jin Cheon Kim; Seon-Young Kim
Journal:  Genomics Inform       Date:  2019-09-27

4.  Pilot RNA-seq data from 24 species of vascular plants at Harvard Forest.

Authors:  Hannah E Marx; Stacy A Jorgensen; Eldridge Wisely; Zheng Li; Katrina M Dlugosch; Michael S Barker
Journal:  Appl Plant Sci       Date:  2021-02-14       Impact factor: 1.936

5.  Low-pass sequencing increases the power of GWAS and decreases measurement error of polygenic risk scores compared to genotyping arrays.

Authors:  Jeremiah H Li; Chase A Mazur; Tomaz Berisa; Joseph K Pickrell
Journal:  Genome Res       Date:  2021-02-03       Impact factor: 9.043

6.  Assembly-free rapid differential gene expression analysis in non-model organisms using DNA-protein alignment.

Authors:  Anish M S Shrestha; Joyce Emlyn B Guiao; Kyle Christian R Santiago
Journal:  BMC Genomics       Date:  2022-02-04       Impact factor: 3.969

7.  Transcriptome analysis identifies putative genes involved in triterpenoid biosynthesis in Platycodon grandiflorus.

Authors:  Hanwen Yu; Mengli Liu; Minzhen Yin; Tingyu Shan; Huasheng Peng; Jutao Wang; Xiangwei Chang; Daiyin Peng; Liangping Zha; Shuangying Gui
Journal:  Planta       Date:  2021-07-21       Impact factor: 4.116

8.  An improved de novo assembling and polishing of Solea senegalensis transcriptome shed light on retinoic acid signalling in larvae.

Authors:  José Córdoba-Caballero; Pedro Seoane; Fernando M Jabato; James R Perkins; Manuel Manchado; M Gonzalo Claros
Journal:  Sci Rep       Date:  2020-11-26       Impact factor: 4.379

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.