| Literature DB >> 34289805 |
Alexandre Souvorov1, Richa Agarwala2.
Abstract
BACKGROUND: Illumina is the dominant sequencing technology at this time. Short length, short insert size, some systematic biases, and low-level carryover contamination in Illumina reads continue to make assembly of repeated regions a challenging problem. Some applications also require finding multiple well supported variants for assembled regions.Entities:
Keywords: Antimicrobial resistance; De-novo assembly; Illumina reads; RNA-seq; de Bruijn graphs
Mesh:
Year: 2021 PMID: 34289805 PMCID: PMC8293564 DOI: 10.1186/s12859-021-04174-9
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Forks for filtering by reads and read pairs
Fig. 2Four subgraphs resulting from assembly of mouse reads using NP_631888.1 target sequence by SAUTE low are shown as graphs A, B, C, and D. Graph D consists of k-mers from low-complexity region of the target
Read and target information for RNA-seq BUSCO set
| Read sp. | SRA runs | Clade | Count |
|---|---|---|---|
| Corn | SRR1588569 | liliopsida | 15 |
| Thale cress | SRR5344669, SRR5344670 | eudicots | 31 |
| Worm | SRR10005501 | nematoda | 7 |
| Mouse | SRR10982198 | mammalia | 23 |
| Human | SRR1957703, SRR1957706 | mammalia | 23 |
Count in the last column is the number of species in OrthoDB v10.1 for the clade after excluding Ornithorhynchus anatinus from the mammalian clade
Number of benchmark proteins in orthologous pairs recovered perfectly or as essentially complete by coding regions assembled by different methods
| Read | Target | Ortho pairs | Perfect | Essentially complete | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| set | species | Count | Median | rnaSP | Trinity | SPAln | Clust | SP10 | rnaSP | Trinity | SPAln | Clust | SP10 |
| Corn | Z. marina | 2710 | 57.78 | 537 | 539 | 213 | 312 | 1434 | 1037 | 916 | 1344 | ||
| A. tauschii | 2888 | 77.94 | 550 | 555 | 410 | 510 | 1491 | 1514 | 1360 | 1395 | |||
| S. bicolor | 2871 | 92.60 | 549 | 554 | 463 | 572 | 1488 | 1507 | 1399 | 1464 | |||
| Thale | P. axillaris | 2287 | 61.56 | 0 | 598 | 1216 | 1598 | 1902 | 0 | 1364 | 1503 | ||
| cress | B. rapa | 2312 | 85.18 | 1636 | 0 | 1255 | 1405 | 1924 | 0 | 1879 | 1697 | ||
| A. thaliana | 2317 | 100.00 | 1640 | 0 | 1689 | 1375 | 1928 | 0 | 1991 | 1668 | |||
| Worm | T. spiralis | 2194 | 40.92 | 0 | 156 | 646 | 1046 | 0 | 638 | 867 | 1269 | ||
| C. briggsae | 3112 | 86.46 | 2230 | 0 | 1216 | 2016 | 2707 | 0 | 2336 | 2486 | |||
| C. elegans | 3130 | 100.00 | 2244 | 0 | 2276 | 2176 | 2724 | 0 | 2720 | 2624 | |||
| Mouse | P. cinereus | 9005 | 77.04 | 2824 | 2880 | 1954 | 2185 | 3790 | 3743 | 3393 | 3202 | ||
| H. sapiens | 9109 | 86.73 | 2831 | 2889 | 2212 | 2568 | 3803 | 3753 | 3537 | 3640 | |||
| M. musculus | 9177 | 100.00 | 2858 | 2914 | 2862 | 3001 | 3832 | 3781 | 3850 | 3982 | |||
| Human | P. cinereus | 8984 | 78.80 | 2796 | 3107 | 2144 | 2206 | 4383 | 4645 | 3995 | 3551 | ||
| M. Musculus | 9109 | 86.73 | 2824 | 3134 | 2372 | 2401 | 4425 | 4684 | 4168 | 3784 | |||
| H. sapiens | 9157 | 100.00 | 2839 | 3152 | 2994 | 2601 | 4448 | 4708 | 4630 | 4032 | |||
SAUTE_PROT low (SP10) with maximum of 10 variants reported per graph, SPAligner (SPAln), and CLUSTER (Clust) used proteins from the target species for assembling the read set. rnaSPAdes (rnaSP) and Trinity are de-novo assemblers. The median percent identity between orthologous protein pairs varies from 40.92 to 100%. In each row, count for the method that finds the largest number of proteins as perfect or as essentially complete are in bold
Fig. 3RNA-seq assembly comparison using BUSCO set: Number of additional proteins recovered perfectly by SAUTE_PROT low with a maximum of 10 variants reported per graph compared to rnaSPAdes is shown as a function of the percent identity of the alignment between the target and read protein. SAUTE_PROT low performs worse than rnaSPAdes only for the worm reads assembled using Trichinella spiralis target, but slightly outperforms rnaSPAdes for the small subset of Trichinella spiralis target sequences whose alignment to worm proteins have identity 75%
Drosophila melanogaster and Drosophila innubila proteins in THO complex genes, their lengths, and alignment percent identity between orthologous pairs
| Drosophila melanogaster | Drosophila innubila | Percent | ||||
|---|---|---|---|---|---|---|
| Gene | Isoform | Protein | Length (aa) | Protein | Length (aa) | Identity (%) |
| A | NP_722763.1 | 1641 | 79.42 | |||
| tho2 | B | NP_608646.3 | 1642 | XP_034472414.1 | 1660 | 79.38 |
| C | NP_001259905.1 | 1641 | 79.32 | |||
| thoc5 | NP_611856.1 | 616 | XP_034478414.1 | 616 | 63.21 | |
| thoc6 | All | NP_648557.1 | 350 | XP_034481127.1 | 346 | 73.45 |
| Hpr1 | NP_649594.1 | 701 | XP_034485608.1 | 716 | 77.59 | |
| A | NP_728489.2 | 288 | 70.55 | |||
| thoc7 | B | NP_612011.1 | 287 | XP_034481424.1 | 273 | 70.18 |
Comparison of Drosophila melanogaster protein recovery for genes in THO complex
| Gene and isoform | Method/Reads | SRR10541157 | SRR10541159 | SRR10541107 | SRR10541200 | SRR10541164 |
|---|---|---|---|---|---|---|
| Read bases (Gb) | 2.51 | 2.78 | 2.95 | 3.47 | 4.49 | |
| tho2 isoform A | SAUTE w/ D. mel | 1..1527, 100% | ||||
| SAUTE w/ D. inn | 1..1527, 100% | |||||
| rnaSPAdes | Not found | Not found | Not found | Not found | Not found | |
| Trinity | Not found | Not found | Not found | Not found | ||
| Clust w/ D. mel | Not found | Not found | Not found | Not found | ||
| Clust w/ D. inn | Not found | Not found | Not found | Not found | Not found | |
| tho2 isoform B | SAUTE w/ D. mel | |||||
| SAUTE w/ D. inn | ||||||
| rnaSPAdes | ||||||
| Trinity | 1..1528, 100% | |||||
| Clust w/ D. mel | Not found | |||||
| Clust w/ D. inn | Not found | Not found | Not found | Not found | Not found | |
| tho2 isoform C | SAUTE w/ D. mel | Not found | Not found | Not found | Not found | 1..1539, 100% |
| SAUTE w/ D. inn | Not found | Not found | Not found | Not found | 1..1539, 100% | |
| rnaSPAdes | Not found | Not found | Not found | Not found | Not found | |
| Trinity | Not found | Not found | Not found | Not found | Not found | |
| Clust w/ D. mel | Not found | Not found | Not found | Not found | Not found | |
| Clust w/ D. inn | Not found | Not found | Not found | Not found | Not found | |
| thoc5 | SAUTE w/ D. mel | Full, 100% | Full, 99.5% | |||
| SAUTE w/ D. inn | Full, 100% | Full, 99.5% | ||||
| rnaSPAdes | Full, 99.8% | Full, 100% | Full, 99.5% | Full, 99.7% | ||
| Trinity | Full, 99.8% | Full, 100% | Full, 99.5% | Full, 99.7% | ||
| Clust w/ D. mel | Full, 100% | Full, 99.5% | Full, 99.7% | |||
| Clust w/ D. inn | Full, 99.8% | Full, 100% | Full, 99.5% | |||
| thoc6 | SAUTE w/ D. mel | Full, 99.7% | Full, 99.7% | Full, 99.7% | Full, 99.7% | |
| SAUTE w/ D. inn | Full, 99.7% | Full, 99.7% | Full, 99.7% | Full, 99.7% | ||
| rnaSPAdes | Full, 99.7% | Full, 99.7% | Full, 99.7% | Full, 99.7% | Full, 99.7% | |
| Trinity | Full, 99.7% | Full, 99.7% | Full, 99.7% | Full, 99.7% | Full, 99.7% | |
| Clust w/ D. mel | Full, 99.7% | Full, 99.7% | Full, 99.7% | Full, 99.7% | Full, 99.7% | |
| Clust w/ D. inn | Full, 99.7% | Full, 99.7% | Full, 99.7% | Full, 99.7% | Full, 99.7% | |
| Hpr 1 | SAUTE w/ D. mel | Full, 100% | Full, 100% | Full, 100% | Full, 100% | Full, 100% |
| SAUTE w/ D. inn | Full, 100% | Full, 100% | Full, 100% | Full, 100% | Full, 100% | |
| rnaSPAdes | Full, 100% | Full, 100% | Full, 100% | Full, 100% | Full, 100% | |
| Trinity | Full, 100% | Full, 100% | Full, 100% | Full, 100% | Full, 100% | |
| Clust w/ D. mel | Full, 100% | Full, 100% | Full, 100% | Full, 100% | Full, 100% | |
| Clust w/ D. inn | Full, 100% | Full, 100% | Full, 100% | Full, 100% | Full, 100% | |
| thoc7 isoform A | SAUTE w/ D. mel | Not found | Not found | 24..288, 100% | 10..288, 100% | Not found |
| SAUTE w/ D. inn | Not found | Not found | 24..288, 100% | 10..288, 100% | Not found | |
| rnaSPAdes | Not found | Not found | Full, 99.7% | 10..288, 100% | Not found | |
| Trinity | Not found | Not found | Full, 99.7% | 10..288, 100% | Not found | |
| Clust w/ D. mel | Not found | Not found | Full, 99.7% | 10..288, 100% | Not found | |
| Clust w/ D. inn | Not found | Full, 99.7% | 12..288, 99.6% | 10..288, 100% | Not found | |
| thoc7 isoform B | SAUTE w/ D. mel | Not found | Not found | Not found | 10..287, 100% | Not found |
| SAUTE w/ D. inn | Not found | Not found | Not found | Not found | Not found | |
| rnaSPAdes | Not found | Not found | Not found | Not found | Not found | |
| Trinity | Not found | Not found | Not found | Not found | Not found | |
| Clust w/ D. mel | Not found | Not found | Not found | Not found | Not found | |
| Clust w/ D. inn | Not found | Not found | Not found | Not found | Not found |
SAUTE_PROT low (SAUTE) and CLUSTER (Clust) used proteins in Table 3 for Drosophila melanogaster (D. mel) and Drosophila innubila (D. inn) as targets. Cells in bold and highlighted in yellow show cases where the D. mel protein is recovered perfectly and there is at least one method that does not recover it perfectly. Proteins recovered as full length are marked as ’Full’; otherwise, coordinates on D. mel protein recovered are provided
Fig. 4Variants reported in assembly of SRR10541157 by SAUTE low for thoc5 protein. SAUTE produces correct variants using pairing information in reads for region A while the variant produced by both rnaSPAdes and Trinity is not supported by any paired read. Region B shows haplotyping achieved using reads alone as highlighted in yellow and additional haplotyping achieved using pairing information
Sensitivity and precision achieved by different methods using AMRFinderPlus calls made on assemblies of 763 read sets and the corresponding finished assembly in FDA-ARGOS set
| Set | True | False | False | Sensitivity | Precision |
|---|---|---|---|---|---|
| positive | positive | negative | |||
| SAUTE default | 2801 | 575 | 22 | 0.99 | 0.83 |
| SAUTE low | 2803 | 833 | 20 | 0.99 | 0.77 |
| SKESA | 2674 | 308 | 149 | 0.95 | 0.90 |
| SKESA + SAUTE default | 2801 | 577 | 22 | 0.99 | 0.83 |
| SKESA + SAUTE low | 2803 | 834 | 20 | 0.99 | 0.77 |
| SPAdes | 2716 | 794 | 107 | 0.96 | 0.77 |
| plasmidSPAdes | 915 | 362 | 1908 | 0.32 | 0.72 |
| SPAdes + plasmidSPAdes | 2720 | 809 | 103 | 0.96 | 0.77 |
| Cluster | 2756 | 1209 | 67 | 0.98 | 0.70 |
| SPAligner | 2738 | 925 | 85 | 0.97 | 0.75 |
Number of variants produced by graphs generated by SAUTE default on AMR set and SAUTE_PROT low on BUSCO set
| Number of | Number of graphs (percent %) | |
|---|---|---|
| variants | AMR | BUSCO |
| 1 | 177,185 (96.83) | 607,609 (60.16) |
| 2 | 4407 (2.41) | 204,994 (20.30) |
| 3 | 230 (0.13) | 34,205 (3.39) |
| 4 | 946 (0.52) | 63,283 (6.27) |
| 5-10 | 172 (0.09) | 53,370 (5.28) |
| 11-100 | 50 (0.03) | 41,143 (4.07) |
| 101-1000 | 5 (0) | 4316 (0.43) |
| 0 (0) | 1054 (0.10) | |
| Total | 182,995 | 1,009,974 |
Species and number of read sets for the species assembled in the pathogen detection pipeline using SAUTE for antimicrobial resistance genes as of July 28, 2020
| Species | Number of read sets |
|---|---|
| Salmonella enterica | 278,133 |
| E.coli and Shigella | 89,600 |
| Campylobacter jejuni | 51,750 |
| Listeria monocytogenes | 32,124 |
| Klebsiella pneumoniae | 17,381 |
| Enterococcus faecium | 14,072 |
| Neisseria | 9308 |
| Pseudomonas aeruginosa | 4594 |
| Vibrio cholerae | 3556 |
| Acinetobacter baumannii | 3204 |
| Enterococcus faecalis | 3176 |
| Legionella pneumophila | 2848 |
| Clostridioides difficile | 1439 |
| Enterobacter | 1319 |
| Staphylococcus pseudintermedius | 1253 |
| Vibrio parahaemolyticus | 1170 |
| Candida auris | 744 |
| Serratia marcescens | 709 |
| Mycobacterium tuberculosis | 539 |
| Citrobacter freundii | 494 |
| Klebsiella oxytoca | 390 |
| Vibrio vulnificus | 365 |
| Providencia alcalifaciens | 253 |
| Clostridium perfringens | 223 |
| Cronobacter | 148 |
| Corynebacterium striatum | 98 |
| Clostridium botulinum | 95 |
| Aeromonas hydrophila | 26 |
| Morganella morganii | 20 |
| Elizabethkingia anophelis | 19 |
| Kluyvera intermedia | 1 |
| Total | 519,051 |