| Literature DB >> 29396453 |
Dong Jia1, Yuanxin Wang1, Yanhong Liu1, Jun Hu2, Yanqiong Guo1, Lingling Gao1,3, Ruiyan Ma4.
Abstract
This study was aimed at generating the full-length transcriptome of flea beetle Agasicles hygrophila (Selman and Vogt) using single-molecule real-time (SMRT) sequencing. Four developmental stages of A. hygrophila, including eggs, larvae, pupae, and adults were harvested for isolating total RNA. The mixed samples were used for SMRT sequencing to generate the full-length transcriptome. Based on the obtained transcriptome data, alternative splicing event, simple sequence repeat (SSR) analysis, coding sequence prediction, transcript functional annotation, and lncRNA prediction were performed. Total 9.45 Gb of clean reads were generated, including 335,045 reads of insert (ROI) and 158,085 full-length non-chimeric (FLNC) reads. Transcript clustering analysis of FLNC reads identified 40,004 consensus isoforms, including 31,015 high-quality ones. After removing redundant reads, 28,982 transcripts were obtained. Total 145 alternative splicing events were predicted. Additionally, 12,753 SSRs and 16,205 coding sequences were identified based on SSR analysis. Furthermore, 24,031 transcripts were annotated in eight functional databases, and 4,198 lncRNAs were predicted. This is the first study to perform SMRT sequencing of the full-length transcriptome of A. hygrophila. The obtained transcriptome may facilitate further exploration of the genetic data of A. hygrophila and uncover the interactions between this insect and the ecosystem.Entities:
Mesh:
Year: 2018 PMID: 29396453 PMCID: PMC5797098 DOI: 10.1038/s41598-018-20181-y
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Polymerase reads sequence statistics.
| Sample name | cDNA size | SMRT cells | Polymerase reads | Post-filter polymerase reads | Post-filter total number of subread bases | Post-filter number of subread | Post-filter subreads N50 | Post-filter mean subread length |
|---|---|---|---|---|---|---|---|---|
| T01 | 0.5–1 K | 2 | 300,584 | 158,893 | 3,021,283,493 | 3,163,026 | 938 | 955 |
| T01 | 1–2 K | 2 | 300,584 | 199,791 | 4,169,421,605 | 2,534,817 | 1,648 | 1,644 |
| T01 | 2–6 K | 1 | 150,292 | 98,310 | 2,263,026,061 | 801,048 | 2,835 | 2,825 |
cDNA size: insert fragment size of cDNA libraries; SMRT cells: the number of cells used for library construction; Polymerase reads: the number of polymerase reads sequences after sequencing; Post-filter polymerase reads: the number of polymerase reads sequences after filtration; Post-filter total number of subread bases: the number of subreads bases after filtration; Post-filter number of subread: the number of subreads after filtration; Post-filter subreads N50: subread N50 length after filtration; Post-filter mean subread length: average length of subread after filtration.
Reads of insert (ROI) statistics.
| Sample | cDNA size | Reads of insert | Read bases of insert | Mean read length of insert | Mean read quality of insert | Mean number of passes |
|---|---|---|---|---|---|---|
| T01 | 0.5–1 K | 122,928 | 156,155,506 | 1,270 | 0.93 | 19 |
| T01 | 1–2 K | 142,751 | 279,590,957 | 1,958 | 0.92 | 12 |
| T01 | 2–6 K | 69,366 | 198,654,998 | 2,863 | 0.93 | 9 |
cDNA size: insert fragment size of cDNA libraries; Reads of insert: the number of ROI sequences; Read bases of insert: the total number of ROI bases; Mean read length of insert: average length of ROI; Mean read quality of insert: Quality value of ROI sequence; Mean number of passes: the mean sequencing depth of sequences in zero-mode wave.
Full-length sequences statistics
| Sample | cDNA size | Reads of insert | Number of five prime reads | Number of three prime reads | Number of poly-A reads | Number of filtered short reads | Number of non-full-length reads | Number of full-length reads | Number of full-length non-chimeric reads | Average full-length non-chimeric read length | Full-length Percentage (FL%) | Artificial concatemers (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| T01 | 0.5–1 K | 122,928 | 68,904 | 78,474 | 73,850 | 27,518 | 40,363 | 55,047 | 53,115 | 688 | 44.78% | 3.51% |
| T01 | 1–2 K | 142,751 | 76,467 | 85,726 | 83,554 | 23,630 | 55,571 | 63,550 | 63,051 | 1,225 | 44.52% | 0.79% |
| T01 | 2–6 K | 69,366 | 49,331 | 50,528 | 50,045 | 2,192 | 25,127 | 42,047 | 41,919 | 2,697 | 60.62% | 0.30% |
cDNA size: insert fragment size of cDNA libraries; reads of insert: the number of reads of insert (ROI) sequences; Number of five prime reads: the number of ROI sequences containing 5′ primer; Number of three prime reads: the number of ROI sequences containing 3′ primer; Number of poly-A reads: the number of ROI sequences containing poly-A; Number of filtered short reads: the number of filtered ROI of <300 bp; Number of non-full-length reads: the number of non-full-length ROI; Number of full-length non-chimeric reads: the number of full-length non-chimeric ROI; Average full-length non-chimeric read length: average length of full-length non-chimeric sequence; Full-length percentage (FL%): the percentage of full-length sequence in ROI sequence; Artificial concatemers (%): the percentage of full-length chimeric sequence in full-length sequence.
Comparison results between SMRT sequencing transcript and Illumina sequencing contig and unigene.
| Length distribution (bp) | SMRT sequencing transcript | Illumina sequencing assembled contig | Illumina sequencing assembled unigene | |||
|---|---|---|---|---|---|---|
| Number | Percentage | Number | Percentage | Number | Percentage | |
| 200–300 | 1 | 0.00% | 67378 | 70.41% | 11994 | 25.99% |
| 300–500 | 1663 | 5.74% | 11176 | 11.68% | 11981 | 25.96% |
| 500–1000 | 7331 | 25.30% | 9331 | 9.75% | 10993 | 23.82% |
| 1000–2000 | 8796 | 30.35% | 5600 | 5.85% | 7472 | 16.19% |
| 2000+ | 11191 | 38.61% | 2215 | 2.31% | 3711 | 8.04% |
Comparison of assembly indicators between SMRT sequencing transcript and Illumina sequencing contig and unigene
| Indicator | SMRT sequencing transcript | Illumina sequencing assembled contig | Illumina sequencing assembled unigene |
|---|---|---|---|
| Total Number | 28982 | 95700 | 46151 |
| Total Length | 48811662 | 35633777 | 38506958 |
| N50 Length | 2331 | 731 | 1312 |
| Mean Length | 1684.206128 | 372.348767 | 834.3688761 |
Results of Iterative Clustering for Error Correction (ICE) clustering analysis.
| Samples | Size | Number of consensus isoforms | Average consensus isoforms read length | Number of polished high-quality isoforms | Number of polished low-quality isoforms | Percent of polished high-quality isoforms(%) |
|---|---|---|---|---|---|---|
| T01 | 0–2 kb | 22,147 | 1,036 | 18,973 | 3,174 | 85.67% |
| T01 | 2–3 kb | 11,876 | 2,483 | 8,826 | 3,050 | 74.32% |
| T01 | 3–6 kb | 5,548 | 3,647 | 3,212 | 2,336 | 57.89% |
| T01 | >6 kb | 433 | 8,785 | 4 | 429 | 0.92% |
cDNA size: insert fragment size of cDNA libraries; Number of consensus isoforms: the number of consensus isoforms obtained from ICE clustering analysis; Average consensus isoforms length: sequence length of consensus isoform; Number of HQ isoforms: the number of high-quality transcripts; Number of LQ isoforms: the number of low-quality transcripts; Percent of HQ isoforms (%): percentage of high-quality transcripts in consensus isoform.
Figure 1The distribution of the coding sequence lengths of the complete open reading frame. The x-axis represents the coding sequence length; the y-axis represents the number of predicted open reading frames.
Figure 2Homologous species distribution of Agasicles hygrophila annotated in the NR database.
Figure 3Gene Ontology (GO) functional annotation of Agasicles hygrophila transcripts. Green represents biological process; blue represents molecular function; and red represents cellular component. The x-axis represents GO categories; the y-axis (right) represents the number of transcripts; and the y-axis (left) represents the percentage of transcripts.
Figure 4Clusters of Orthologous Groups of protein (COG) annotation of Agasicles hygrophila transcripts. The x-axis represents COG categories; the y-axis represents the number of transcripts.
Figure 5Venn diagram of the number of lncRNAs predicted by Calculator (CPC), Coding-Non-Coding Index (CNCI), Coding Potential Assessment Tool (CPAT), and pfam protein structure domain analysis.