| Literature DB >> 29346292 |
Dong An1, Hieu X Cao2, Changsheng Li3, Klaus Humbeck4, Wenqin Wang5.
Abstract
Single-molecule real-time (SMRT) sequencing developed by PacBio, also called third-generation sequencing (TGS), offers longer reads than the second-generation sequencing (SGS). Given its ability to obtain full-length transcripts without assembly, isoform sequencing (Iso-Seq) of transcriptomes by PacBio is advantageous for genome annotation, identification of novel genes and isoforms, as well as the discovery of long non-coding RNA (lncRNA). In addition, Iso-Seq gives access to the direct detection of alternative splicing, alternative polyadenylation (APA), gene fusion, and DNA modifications. Such applications of Iso-Seq facilitate the understanding of gene structure, post-transcriptional regulatory networks, and subsequently proteomic diversity. In this review, we summarize its applications in plant transcriptome study, specifically pointing out challenges associated with each step in the experimental design and highlight the development of bioinformatic pipelines. We aim to provide the community with an integrative overview and a comprehensive guidance to Iso-Seq, and thus to promote its applications in plant research.Entities:
Keywords: alternative splicing; fusion genes; genome annotation; isoform sequencing; long reads; novel genes; plant; single-molecule real-time sequencing; transcriptomics
Year: 2018 PMID: 29346292 PMCID: PMC5793194 DOI: 10.3390/genes9010043
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1Definition of the polymerase read, subreads, and the read of insert (ROI). DNA template is labeled in blue and adapter in green. SMRT: Single-Molecule Real-Time sequencing.
Figure 2Schematic workflow of isoform sequencing.
Summary of sample collection, RNA extraction, library construction, sequencing platform, and throughput for isoform sequencing in plants.
| Species | Sample Collection | RNA Extraction | Size-Fractionated Libraries | Platform and Throughput | Ref. |
|---|---|---|---|---|---|
| sorghum (BTx623) | Control and drought treatment of 7-day-old seedlings for 6 h | TRIzol reagent (Invitrogen, Carlsbad, CA, USA) with DNaseI (Fermentas, Waltham, MA, USA) | 1–2 kb and 2–6 kb | PacBio RS II with 28 SMRT cells | [ |
| maize B73 | Root, pollen, embryo, endosperm, immature ear and immature tassel | TRIzol reagent (Invitrogen, Carlsbad, CA, USA) with RQ1 DNase (Promega, Madison, WI, USA) | <1 kb, 1–2 kb, 2–3 kb, 3–5 kb, 4–6 kb and >5 kb | PacBio RS II with 47 SMRT cells | [ |
| wheat Xiaoyan | Unfertilized caryopses and developing grains | RNA extraction kit (Takara Biotechnology, Dalian, Liaoning, China) with TURBO DNaseI (Promega, Madison, WI, USA) | <2 kb, ≥2 kb | PacBio RS II with 8 SMRT cells | [ |
| Young leaves and female flowers | CTAB method and RNeasy Mini extraction kit (Qiagen, Hilden, Germany) with TURBO DNA-free Kit | 1–2 kb, 2–3 kb and >3 kb | PacBio RS II with 19 SMRT cells | [ | |
| wild strawberry | Receptacle of five different stages | Plant Total RNA Isolation Kit (Sangon Biotech, Shanghai, China) | 1–2 kb, 2–3 kb and >3 kb | PacBio RS with 13 SMRT cells | [ |
| moso bamboo | Underground rhizome, lateral rhizome, shoot, root, and leaf | RNAprep Pure Plant Kit (Tiangen, Beijing, China) with DNase I | 1–2 kb, 2–3 kb and >3 kb | PacBio RS II with 7 SMRT cells | [ |
| Periderm, phloem, and xylem from roots | RNeasy Plus Mini Kit (#74134, Qiagen, Hilden, Germany) | <1 kb, 1–2 kb, 2–3 kb and >3 kb | PacBio RS with 8 SMRT cells | [ | |
| cotton | Root, hypocotyl, leaf, petal, anther, stigma; fibre samples | Spectrum Plant Total RNA kit (Sigma-Aldrich, St. Louis, MI, USA) | 1–2 kb, 2–3 kb and 3–6 kb | PacBio RS II with 30 SMRT cells | [ |
| sugarcane | Leaf, internode, and root tissues of different stages | TRIzol (Invitrogen) and Qiagen RNeasy Plant minikit (#74134, Qiagen, Hilden, Germany) | 0.5–2.5 kb, 2–3.5 kb, 3–6 kb and 5–10 kb | PacBio RS II with 6 SMRT cells | [ |
| sugar beet | Seedlings | Nucleospin Plant RNA kit (Macherey-Nagel, Duren, Germany) | 1–2 kb, 2–3 kb and >3 kb | PacBio RS with 6 SMRT cells | [ |
| coffee bean | Immature, intermediated, and mature fruits | TRIzol plus RNA purification kit (Invitrogen, Carlsbad, CA, USA), the RNeasy Plant Mini Kit (#74903, Qiagen, Hilden, Germany) | 0.5–2.5 kb, 2–3.5 kb, 3–6 kb and 5–10 kb | PacBio RS II with 2 SMRT cells | [ |
Sequencing depth in Iso-Seq.
| Species | ROI | Full-Length ROI | Error Correction FLNC Reads | Mapped Reads |
|---|---|---|---|---|
| sorghum (BTx623) | 1,838,330 | 884,638 | NA | 867,089 |
| maize B73 | 3,716,604 | 1,553,692 | 643,330 | 606,145 |
| wheat Xiaoyan | 240,312 | NA | 197,709 | 91,881 |
| 660,458 | 217,954 | 146,686 1 | 124,509 2 | |
| wild strawberry | 442,601 | 354,393 | 85,416 | 82,360 |
| moso bamboo | 288,312 | 147,362 | 146,225 | 145,522 |
| 796,011 | 223,368 | NS | NA | |
| cotton | 2,542,318 | 1,096,932 | NA | 339,230 |
| sugar cane | 290,393 | 186,999 | 107,604 | 74,716 |
| sugar beet | 395,038 | 109,920 | NA | 107,721 |
| coffee bean | 433,877 | 233,464 | NA | NA |
ROI: read of insert; FLNC reads: full-length non-chimeric reads; NA means the data which are not presented in the literature; 1 denotes reads corrected by ICE-Quiver; 2 denotes mapped reads obtained by two full-passes full-length non-chimeric read of insert (flncROIs) data; NS means the detail numbers are not shown although the analysis has been done.
Plant genome annotation by using Iso-Seq.
| Species | Isoform | Novel Transcripts | AS | APA | Novel Genes | lncRNA | Mis-Annotated Genes |
|---|---|---|---|---|---|---|---|
| sorghum (BTx623) | 27,860 | 11,342 | 10,053 | 11,013 | 2171 | 540 | 941 |
| maize B73 | 111,151 | 65,350 | NS | NA | 2253 | 867 | 2199 * |
| wheat Xiaoyan | 22,768 | 9591 | NS | NA | 3026 | NA | 180 |
| 10,617 | 3680 | 4879 | NA | 510 | NA | 3255 | |
| wild strawberry | 33,236 | 5501 | 17,260 | NA | 3649 | NA | NA |
| moso bamboo | 42,280 | 35,447 | 21,154 | 6311 | 8091 | 3096 | 2241 |
| 160,468 | NA | 4165 | NA | NA | 11,046 | NA | |
| cotton | 176,849 | 13,551 | 133,329 | 43,784 | NA | 2447 | NA |
| sugar cane | 107,598 | 2450 | 4870 | NA | NA | 2426 | NA |
| sugar beet | NA | NA | NA | NA | NA | NA | 4000 |
| coffee bean | 95,995 | NA | NS | NS | 1213 | NA | NA |
AS, alternative splicing; APA, alternative polyadenylation; lncRNA, long non-coding RNA; NA means the data is not available, and NS means detail numbers are not shown although the analysis has been done; * There were 2199 transcripts from Iso-Seq data covering more than one annotated V3 gene. It was confirmed that 682 (81%) out of 844 Gramene gene models were mis-annotated, while the remaining genes need further evidence to support whether they were mis-annotated.
Figure 3Schematic representation of alternative splicing (AS) and alternative polyadenylation (APA). (a) Shows alternative splicing; (b) shows alternative polyadenylation. The first two APA types—which both are termed as tandem 3′ untranslated region (UTR) APA—generate multiple isoforms that differ in their 3′UTR length without impacting the protein sequence encoded by the gene. The other three APA types potentially affect the coding sequences: alternative terminal exon APA, in which APA generates isoforms that differ in their last exon; intronic APA, which involves cleaving at the cryptic intronic polyA signal (PAS) with an extending terminal exon; and internal exon APA, which involves premature within one coding region with PAS. The filled dark blue boxes denote the retaining exons, and the filled light blue boxes denote the alternative exons. The blue solid lines represent introns. The black dash lines represent AS events. The filled yellow boxes represent 3′ UTR with different length, and the arrows denote PAS.
Bioinformatic programs applied in plant Iso-Seq analysis.
| Species | Read Processing | Correction | Mapping | AS | Novel Gene | APA |
|---|---|---|---|---|---|---|
| sorghum (BTx623) | TAPIS | LoRDEC, proovread and TAPIS | GMAP | SpliceGrapher | TAPIS | TAPIS |
| maize B73 | ToFU | ICE-Quiver | GMAP | AStalavista | BLASTN | NA |
| wheat Xiaoyan | SMRT analysis | SMRT analysis, proovread | GMAP | In-house perl script | GMAP | NA |
| SMRT analysis_v2.2.0 | minFullPasses, LSC-corrected and ICE-Quiver | GMAP, BLAT | PASA, de novo AS detection | NA | NA | |
| wild strawberry | RS_IsoSeq_v2.3 | ICE-Quiver, LoRDEC | GMAP | AStalavista | NA | NA |
| moso bamboo | SMRT analysis_2.3.0 | LSC | GMAP | AStalavista | TAPIS | TAPIS |
| SMRT analysis_2.2.0 | LSC | GMAP | SPLICEMAP | SPLICEMAP | NA | |
| cotton | SMRT analysis | pipeline-for-Iso-Seq | GMAP | alternative_splice.py | BLAST | SMRT analysis |
| sugar cane | SMRT analysis_2.3.0 | ICE-Quiver, proovread, and LoRDEC | GMAP | TAPIS | BLAST | NA |
| sugar beet | SMRT analysis_v2.0 | Proovread, normalize-by-median.py | GMAP, AUGUSTUS | NA | NA | NA |
| coffee bean | RS_IsoSeq_v2.3 | ICE-Quiver | BLAST | BLAST | BLAST | BLAST |
GMAP, genome mapping and alignment program; ICE, iterative clustering for error correction; PASA, program to assemble spliced alignments; IDP, isoform detection and prediction tool; TAPIS, transcriptome analysis pipeline for isoform sequencing; SMRT, single-molecule real-time. alternative_splice.py [29]. NA means the data is not available.
Comparison of detecting efficiency in terms of isoform number, average gene length, AS events, and fusion genes between Iso-Seq and SGS.
| Species | Iso-Seq | SGS/Sanger | Reference | |
|---|---|---|---|---|
| Isoform number per gene | cotton | 3.93 | 1.35 | [ |
| maize B73 | 6.56 * | 2.84 * | [ | |
| Total isoform number | wild strawberry | 26,676 | 20,705 | [ |
| moso bamboo | 42,280 | 10,471 | [ | |
| Average gene length (bp) | 2044 | 950 1 | [ | |
| maize B73 | 2632 | 1684 | [ | |
| wild strawberry | 2466 | 1187 | [ | |
| cotton | 2175 | 1462 | [ | |
| AS events | wild strawberry | 17,260 | 12,080 | [ |
| cotton | 133,229 | 16,437 | [ | |
| Number of fusion genes | maize B73 | 1430 | 134 | [ |
denotes data downloaded from the Amborella Genome database [54]. * The PacBio long read data identified 15,146 genes with an average of 6.56 isoforms, more than twice the number of the maize V3 annotation from SGS/Sanger data [11]. The PacBio Iso-Seq data has been included in the current version of annotation, which greatly improved existing gene models although only ~70% of the total maize genes were captured [11].