| Literature DB >> 29495964 |
Vincent Magrini1, Xin Gao1, Bruce A Rosa1, Sean McGrath1, Xu Zhang1, Kymberlie Hallsworth-Pepin1, John Martin1, John Hawdon2, Richard K Wilson1,3, Makedonka Mitreva4,5.
Abstract
BACKGROUND: The advantages of Pacific Biosciences (PacBio) single-molecule real-time (SMRT) technology include long reads, low systematic bias, and high consensus read accuracy. Here we use these attributes to improve on the genome annotation of the parasitic hookworm Ancylostoma ceylanicum using PacBio RNA-Seq.Entities:
Keywords: Ancylostoma ceylanicum; Gene loci; Genome annotation improvement; Hookworm; Pacific bioscience mRNA sequencing
Mesh:
Substances:
Year: 2018 PMID: 29495964 PMCID: PMC5833154 DOI: 10.1186/s12864-018-4555-7
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Comparison of genome statistics to other nematode species
| Species | Phylogenetic clade | # Genes | CEGMA completeness | Average gene length | Assembly Length (bp) | # Scaffolds | N50 | GC content % | Contig Length | |||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # | Length | Mean | Median | Max Length | ||||||||
|
| I | 16,380 | 95.6% | 952.8 | 63,525,422 | 6863 | 4 | 6,373,445 | 33.9 | 9256 | 1071 | 12,041,450 |
|
| I | 11,004 | 96.1% | 1245.7 | 84,674,602 | 1683 | 59 | 400,602 | 44.8 | 50,312 | 1914 | 1,774,400 |
|
| I | 8813 | 96.1% | 1293.9 | 75,496,503 | 4156 | 265 | 70,602 | 42.2 | 18,166 | 3965 | 533,758 |
|
| III | 15,260 | 98.9% | 1188.5 | 265,545,801 | 31,538 | 260 | 290,558 | 37.8 | 8420 | 226 | 1,465,500 |
|
| III | 18,074 | 97.4% | 1012.2 | 94,136,243 | 9827 | 62 | 191,089 | 29.6 | 9579 | 1340 | 5,235,760 |
|
| III | 12,857 | 98.0% | 1134.2 | 88,309,529 | 16,061 | 219 | 71,281 | 28 | 5498 | 754 | 1,085,577 |
|
| III | 15,445 | 97.8% | 987.8 | 91,373,458 | 5773 | 130 | 174,388 | 31 | 15,828 | 1186 | 1,325,655 |
|
| IV | 16,403 | 83.8% | 1079.6 | 124,672,549 | 6873 | 298 | 121,687 | 36.7 | 18,139 | 1699 | 600,076 |
|
| IV | 14,420 | 97.2% | 1044.7 | 53,017,507 | 3452 | 372 | 37,608 | 27.4 | 15,358 | 5814 | 360,446 |
|
| V | 30,697 | 100.0% | 1229.5 | 100,286,401 | 7 | 3 | 17,493,829 | 35.4 | 14,326,629 | 15,279,421 | 20,924,180 |
|
| V | 24,466 | 95.0% | 1124.8 | 369,846,877 | 23,860 | 1151 | 83,287 | 43.1 | 15,501 | 1515 | 947,606 |
|
| V | 19,153 | 97.2% | 804.6 | 244,075,060 | 11,864 | 284 | 211,861 | 40.2 | 20,573 | 1315 | 1,890,151 |
|
| V | 24,217 | 98.0% | 994.3 | 172,494,865 | 18,083 | 39 | 1,244,534 | 42.8 | 9539 | 685 | 5,268,024 |
| AC-Orig | V | 16,155 | 98.9% | 894.3 | 348,994,891 | 8098 | 263 | 373,206 | 43.5 | 43,096 | 1515 | 2,174,208 |
| AC-PB | V | 17,540 | 962.8 | |||||||||
P4 Chemistry Sequencing Statistics
| Movie Length (mins) | Total Bases (MB) | Polymerase Reads | Reads of Insert | Zero Mode Waveguide Loading Efficiency | ||||
|---|---|---|---|---|---|---|---|---|
| Length (bp) | Quality | Length | Quality | (P0) | (P1) | (P2) | ||
| 75 | 409.5 | 5858 | 0.84 | 813 | 0.93 | 34,281 (23%) | 69,907 (47%) | 46,104 (31%) |
| 75 | 432.28 | 5938 | 0.84 | 811 | 0.93 | 26,892 (18%) | 72,796 (48%) | 50,604 (34%) |
| 75 | 408.76 | 5837 | 0.84 | 808 | 0.93 | 30,481 (20%) | 70,029 (47%) | 49,782 (33%) |
| 75 | 398.05 | 6142 | 0.84 | 795 | 0.94 | 35,605 (24%) | 64,808 (43%) | 49,879 (33%) |
Fig. 1An overview of the gene prediction process. The gene prediction process, as described in the methods, is divided into major and minor steps
Illumina and Pacbio RNA-Seq read coverage over predicted gene sets
| Read Type | Gene Set | Subset of genes | Total count | # of expressed genes (breadth ≥50%) | % expressed | Average read depth | |
|---|---|---|---|---|---|---|---|
| Any | Expressed | ||||||
| Illumina Reads | AC-Orig | All genes | 16,026 | 10,405 | 64.9% | 161.1 | 245.6 |
| Overlapping AC-PB genes | 15,808 | 10,254 | 64.9% | 160.9 | 245.5 | ||
| Not overlapping AC-PB genes | 218 | 151 | 69.3% | 171.3 | 246.4 | ||
| AC-PB | All genes | 17,540 | 11,721 | 66.8% | 156.3 | 231.6 | |
| Overlapping AC-Orig genes | 15,931 | 10,365 | 65.1% | 158.3 | 240.8 | ||
| Not overlapping AC-Orig genes | 1609 | 1356 | 84.3% | 136.5 | 161.1 | ||
| Schwarz et al., [ | 36,687 | 16,376 | 44.6% | 90.4 | 199.4 | ||
| PacBio Reads | AC-Orig | All genes | 16,026 | 3166 | 19.8% | 3.3 | 8.9 |
| Overlapping AC-PB genes | 15,808 | 3128 | 19.8% | 3.2 | 8.8 | ||
| Not overlapping AC-PB genes | 218 | 38 | 17.4% | 4.1 | 12.8 | ||
| AC-PB | All genes | 17,540 | 4209 | 24.0% | 3.6 | 8.8 | |
| Overlapping AC-Orig genes | 15,931 | 3398 | 21.3% | 3.4 | 8.7 | ||
| Not overlapping AC-Orig genes | 1609 | 811 | 50.4% | 6.4 | 9.1 | ||
| Schwarz et al., [ | 36,687 | 4903 | 13.4% | 2.1 | 8.8 | ||
Fig. 2Differences in gene lengths and UTR lengths between AC-Orig and AC-PB. Differences in gene lengths are shown for: a Genes not split or merged between the annotations, (b) Genes split in AC-PB compared to AC-Orig, and (c) Genes merged in AC-PB compared to AC-Orig. d Differences in UTR lengths (summed for 5′ and 3′) between AC-Orig and AC-PB. Additional file 1 shows the UTR lengths separate for 5′ and 3′ regions
Fig. 3Frequency distribution plots for AC-Orig (blue) and AC-PB (orange) for (a) CDS Lengths, (b) Exon lengths, (c) Intron lengths, (d) 3’ UTRs and (e) 5’ UTRs
Fig. 4An example of an improved gene structure predictions in AC-PB (black) compared to AC-Orig (orange). The BLASTX (red) and protein2genome (blue) predictions used for AC-Orig predicted a short gene model, but additional PacBio evidence (green) extended the existing gene, and also predicted a second gene at the 5′ end of the original gene. Shaded grey areas represent masked repeat sequences in the assembly
Summary of CDS and 5′ and 3’ Untranslated Region (UTR) statistics, including overlaps between gene sets and with assembled Illumina transcripts and RNAseq Illumina reads
| Gene region | Statistic | AC-Orig | AC-PB | |||||
|---|---|---|---|---|---|---|---|---|
| All genes | Overlapping genes | Unique genes | All genes | Overlapping genes | Unique genes | |||
| CDS | # of genes | 16,026 | 15,808 | 218 | 17,540 | 15,931 | 1609 | |
| Average length (bp) | 962.8 | 962.4 | 994.3 | 894.3 | 922.0 | 619.8 | ||
| Total length (kbp) | 15,430.0 | 15,213.2 | 216.8 | 15,685.3 | 14,688.0 | 997.3 | ||
| Coverage by Illumina reads | % of genes | 78.1% | 78.0% | 80.7% | 79.6% | 78.2% | 93.7% | |
| % of bases | 66.9% | 66.8% | 68.7% | 67.5% | 66.4% | 83.7% | ||
| Length not covered (kbp) | 5111.7 | 5043.9 | 67.8 | 5104.8 | 4942.2 | 162.7 | ||
| Coverage by Illumina Stringtie contigs | % of genes | 65.2% | 65.1% | 67.9% | 66.9% | 64.8% | 84.0% | |
| % of bases | 61.8% | 61.8% | 64.3% | 62.4% | 61.2% | 79.5% | ||
| Length not covered (kbp) | 5895.8 | 5818.4 | 77.4 | 5899.6 | 5694.7 | 204.9 | ||
| 5’ UTR | # of genes | 1101 | 1083 | 18 | 3404 | 2702 | 702 | |
| Average length (bp) | 57.7 | 58.2 | 24.3 | 88.2 | 83.9 | 104.8 | ||
| Total length (kbp) | 63.5 | 63.1 | 0.4 | 300.2 | 226.7 | 73.6 | ||
| Coverage by Illumina reads | % of genes | 74.1% | 74.1% | 72.2% | 78.7% | 79.8% | 74.2% | |
| % of bases | 61.5% | 61.3% | 89.7% | 73.1% | 74.7% | 68.4% | ||
| Length not covered (kbp) | 18.7 | 18.6 | 0.1 | 80.7 | 57.4 | 23.3 | ||
| Coverage by Illumina Stringtie contigs | % of genes | 77.3% | 77.2% | 83.3% | 79.5% | 80.4% | 76.2% | |
| % of bases | 70.6% | 70.5% | 86.7% | 66.8% | 69.2% | 59.5% | ||
| Length not covered (kbp) | 24.5 | 24.4 | 0.0 | 99.5 | 69.8 | 29.8 | ||
| 3’ UTR | # of genes | 1234 | 1218 | 16 | 6608 | 5363 | 1245 | |
| Average length (bp) | 78.4 | 78.8 | 52.3 | 232.1 | 234.7 | 221.0 | ||
| Total length (kbp) | 96.8 | 95.9 | 0.8 | 1533.9 | 1258.8 | 275.1 | ||
| Coverage by Illumina reads | % of genes | 50.1% | 50.3% | 31.3% | 73.3% | 74.6% | 67.5% | |
| % of bases | 73.9% | 73.9% | 75.6% | 77.0% | 78.2% | 71.2% | ||
| Length not covered (kbp) | 31.1 | 30.8 | 0.3 | 513.6 | 402.5 | 111.1 | ||
| Coverage by Illumina Stringtie contigs | % of genes | 56.3% | 56.7% | 25.0% | 70.4% | 71.5% | 65.3% | |
| % of bases | 67.9% | 67.9% | 60.0% | 66.5% | 68.0% | 59.6% | ||
| Length not covered (kbp) | 25.3 | 25.1 | 0.2 | 353.1 | 273.9 | 79.2 | ||
A Summary of characteristics of assemblies annotated without and with PacBio mRNA sequences
| Statistic | Original | Improved | AC-Orig genes overlapping (10%) AC-PB genes with PacBio evidence | AC-PB genes with PacBio evidence (with or without EST evidence) |
|---|---|---|---|---|
| Number of genes | 16,026 | 17,540 | 6734 | 8238 |
| Number of single exon genes | 805 | 863 | 154 | 211 |
| Total length of all exons (bp) | 15,590,301 | 17,519,546 | 7,915,169 | 9,931,507 |
| Total number of exons | 117,877 | 121,578 | 63,273 | 67,714 |
| Average exon length (bp) | 132.3 | 144.1 | 125.1 | 146.7 |
| Average # exons/gene | 7.4 | 6.9 | 9.4 | 8.2 |
| Total length of all CDS exons (bp) | 15,429,981 | 15,685,322 | 28,877,304 | 27,490,435 |
| Total number of CDS exons | 117,657 | 119,866 | 63,129 | 66,083 |
| Average CDS exon length (bp) | 131.1 | 130.9 | 123.5 | 123.2 |
| Average # coding exons/gene | 7.3 | 6.8 | 9.4 | 8.0 |
| Total length of all introns (bp) | 63,133,642 | 60,868,345 | 28,930,431 | 28,142,643 |
| Total number of introns | 101,851 | 104,038 | 56,566 | 59,476 |
| Average intron length (bp) | 621.2 | 594.9 | 511.7 | 473.2 |
| Average # introns/gene | 6.4 | 5.9 | 8.4 | 7.2 |
| Total UTR length (bp) | 160,320 | 1,834,224 | 117,930 | 1,788,957 |
| Number of genes with UTR | 1889 | 7295 | 1205 | 6567 |
| Average size of UTR per gene with UTR | 84.9 | 251.4 | 97.9 | 272.4 |
| Number of genes with UTR < 10 bp | 1228 | 3966 | 744 | 3451 |
| Number of genes with UTR 10 bp - 100 bp | 423 | 1058 | 287 | 908 |
| Number of genes with UTR > 100 bp | 238 | 2271 | 174 | 2208 |
| Total 5’ UTR length (bp) | 63,556 | 300,333 | 39,954 | 273,401 |
| Number of genes with 5’ UTR | 1150 | 3488 | 710 | 3013 |
| Number of genes with spliced 5’ UTR | 127 | 745 | 82 | 696 |
| Total 3’ UTR length (bp) | 96,764 | 1533,891 | 77,976 | 1515,556 |
| Number of genes with 3’ UTR | 1238 | 6611 | 817 | 6145 |
| Number of genes with spliced 3’ UTR | 45 | 400 | 33 | 388 |
| # of ESTs at 3’ | – | – | 3610 | 105,053 |
| polyA signal ‘aataaa/attaaa’ | – | – | 1307 | 61,069 |
| polyA signal ‘agtaaa’ only | – | – | 215 | 7775 |
| polyA total (any signal) | – | – | 1522 | 68,844 |