| Literature DB >> 34995394 |
Charles F Crane1,2, Jill A Nemacheck1,3, Subhashree Subramanyam1,3, Christie E Williams1, Stephen B Goodwin1,2.
Abstract
Although finished genomes have become more common, there is still a need for assemblies of individual genes or chromosomal regions when only unassembled reads are available. slag (Seeded Local Assembly of Genes) fulfils this need by performing iterative local assembly based on cycles of matching-read retrieval with blast and assembly with cap3, phrap, spades, canu or unicycler. The target sequence can be nucleotide or protein. Read fragmentation allows slag to use phrap or cap3 to assemble long reads at lower coverage (e.g., 5×) than is possible with canu or unicycler. In simple, nonrepetitive genomes, a slag assembly can cover a whole chromosome, but in complex genomes the growth of target-matching contigs is limited as additional reads are consumed by consensus contigs consisting of repetitive elements. Apart from genomic complexity, contig length and correctness depend on read length and accuracy. With pyrosequencing or Illumina reads, slag-assembled contigs are accurate enough to allow design of PCR primers, while contigs assembled from Oxford Nanopore or pre-HiFi Pacific Biosciences long reads are generally only accurate enough to design baiting sequences for further targeted sequencing. In an application with real reads, slag successfully extended sequences for four wheat genes, which were verified by cloning and Sanger sequencing of overlapping amplicons. slag is a robust alternative to atram2 for local assemblies, especially for read sets with less than 20× coverage. slag is freely available at https://github.com/cfcrane/SLAG. Published 2022. This article is a U.S.Government work and is in the public domain in the USA. Molecular Ecology Resources published by John Wiley & Sons Ltd.Entities:
Keywords: bioinfomatics/phyloinfomatics; long reads; multiple alleles; pipeline; sequence assembly
Mesh:
Year: 2022 PMID: 34995394 PMCID: PMC9303413 DOI: 10.1111/1755-0998.13580
Source DB: PubMed Journal: Mol Ecol Resour ISSN: 1755-098X Impact factor: 8.678
Wheat gene models used as target sequences for local assembly
| Gene Mmodel | e‐ value | Accession no. | Description |
|---|---|---|---|
| TraesCS1A01G393300.1 | 6e‐21 | EMS62516.1 | Hypothetical protein TRIUR3_20468 |
| TraesCS1A01G393800.1 | 6e‐29 | EMS54359.1 | Chaperone protein DnaJ |
| TraesCS1A01G396600.1 | 9e‐24 | XP_020158285.1 | E3 SUMO‐protein ligase MMS21 |
| TraesCS1A01G396600.2 | 2e‐21 | XP_020158285.1 | E3 SUMO‐protein ligase MMS21 |
| TraesCS1A01G397600.1 | 7e‐55 | XP_020153608.1 | Zinc finger MYM‐type protein 1‐like |
| TraesCS1A01G398400.1 | 8e‐44 | XP_020169217.1 | ras‐related protein RABA1f‐like |
| TraesCS1A01G399600.1 | No hit | ||
| TraesCS1A01G402200.2 | 4e‐29 | EMS60683.1 | General transcription factor 3C polypeptide 2 |
| TraesCS1A01G403200.1 | 2e‐28 | XP_020147766.1 | FRIGIDA‐like protein 3 |
| TraesCS1A01G404500.1 | 9e‐26 | XP_020176384.1 | Phytepsin |
| TraesCS1A01G405600.1 | 6e‐45 | XP_020176342.1 | Sugar transporter ERD6‐like 4 |
| TraesCS1A01G407000.1 | 8e‐44 | VAH10327.1 | Unnamed protein product |
| TraesCS1A01G408800.1 | 2e‐44 | XP_020186556.1 | Short‐chain dehydrogenase/reductase 2b‐like |
| TraesCS1A01G410200.1 | No hit | ||
| TraesCS1A01G411500.2 | 2e‐36 | VAH11092.1 | Unnamed protein product |
| TraesCS1A01G411700.1 | 3e‐33 | XP_020174231.1 | Nuclear transcription factor Y subunit B‐4‐like |
| TraesCS1A01G415200.1 | 1e‐32 | KAE8800973.1 | Protein CHUP1, chloroplastic |
| TraesCS1A01G417300.1 | 2e‐43 | XP_020154488.1 | Peroxisomal membrane protein PEX14‐like isoform X2 |
| TraesCS1A01G419700.1 | 6e‐27 | XP_020178694.1 | Uncharacterized protein LOC109764261 isoform X2 |
| TraesCS1A01G421800.1 | No hit | ||
| TraesCS1A01G423100.1 | 1e‐25 | EMS51643.1 | Spastin |
| TraesCS1A01G423800.1 | 1e‐59 | EMS58218.1 | Late embryogenesis abundant protein Lea14‐A |
| TraesCS1A01G426400.1 | 5e‐39 | VAH11437.1 | Unnamed protein product |
| TraesCS1A01G430700.1 | No hit | ||
| TraesCS1A01G430700.3 | 6e‐43 | XP_020175200.1 | Trimethylguanosine synthase‐like isoform X2 |
| TraesCS1A01G431300.1 | 2e‐38 | XP_020187356.1 | Proteinase inhibitor PSI‐1.2‐like |
| TraesCS1A01G433600.1 | 3e‐50 | KAE8773310.1 | Disease resistance protein RGA2 |
| TraesCS1A01G437500.2 | No hit | ||
| TraesCS1A01G439200.1 | 1e‐26 | VAH11693.1 | Unnamed protein product |
| TraesCS1A01G441500.1 | 2e‐42 | VAH11737.1 | Unnamed protein product |
| TraesCS1A01G442400.2 | 8e‐51 | XP_020190989.1 | TATA‐binding protein‐associated factor BTAF1‐like |
| TraesCS1A01G444100.1 | 4e‐40 | XP_020157674.1 | Two‐component response regulator ORR42‐like |
| TraesCS1B01G397300.1 | 9e‐50 | VAH22346.1 | Unnamed protein product |
| TraesCS1B01G400200.1 | 3e‐30 | XP_020197381.1 | 65‐kDa microtubule‐associated protein 3‐like |
| TraesCS1B01G401300.1 | 1e‐49 | XP_020187904.1 | Oligopeptide transporter 7‐like isoform X4 |
| TraesCS1B01G402400.1 | No hit | ||
| TraesCS1B01G403300.1 | No hit | ||
| TraesCS1B01G406000.1 | 1e‐26 | YP_874698.1 | Ribosomal protein S15 (chloroplast) |
| TraesCS1B01G407300.1 | 3e‐27 | XP_020166758.1 | GDSL esterase/lipase At5g45910‐like |
| TraesCS1B01G413800.1 | 1e‐53 | KAE8794788.1 | Putative sodium/metabolite cotransporter BASS1, chloroplastic |
| TraesCS1B01G417500.1 | 5e‐39 | VAH22694.1 | Unnamed protein product |
| TraesCS1B01G421200.1 | 4e‐34 | AKJ77990.1 | Endosperm transfer cell specific PR60 precursor |
| TraesCS1B01G423900.1 | 8e‐30 | VAH22812.1 | Unnamed protein product |
| TraesCS1B01G439200.1 | 2e‐38 | XP_020170913.1 | Disease resistance protein RPP13‐like |
| TraesCS1B01G451600.1 | 3e‐43 | XP_020178665.1 | Putative receptor‐like protein kinase At4g00960 |
| TraesCS1B01G472200.1 | No hit | ||
| TraesCS1B01G473100.1 | 8e‐22 | VAH23875.1 | Unnamed protein product |
| TraesCS1B01G481000.1 | 2e‐44 | VAH23967.1 | Unnamed protein product |
| TraesCS1D01G379300.1 | No hit | ||
| TraesCS1D01G398900.1 | 9e‐28 | XP_020178387.1 | Tropinone reductase homolog At5g06060‐like |
The closest meaningful blast hit in GenBank nr is reported, if there is one. Otherwise the closest blast hit is reported, or no hit is reported if there was none at 1e‐20.
Target protein accessions from GenBank nr for local unicycler assemblies of simulated long reads derived from Zymoseptoria tritici
| Accession | Description |
|---|---|
| AAD23831.1 | NAD‐dependent formate dehydrogenase |
| AAD40111.1 | 3‐Isopropylmalate dehydrogenase |
| AAL30834.1 | Anaphase‐promoting complex protein |
| ABD92790.2 | Mitogen‐activated protein kinase |
| ABD94604.1 | Nonribosomal peptide synthetase |
| ACS91347.1 | Serine/threonine‐protein kinase |
| ADU79051.1 | DNA lyase |
| AKA94181.1 | Lanosterol 14‐alpha‐demethylase |
| ALP48286.1 | RNA polymerase II second largest subunit |
| ANQ91929.1 | Eburicol 14 alpha‐demethylase |
Ten groups of maize enzyme accessions used to target local assemblies in maize and wheat
| Activity | GenBank accession nos. |
|---|---|
| Cellulose synthase | NP_001104955.2, NP_001104956.2, NP_001104959.2, NP_001105236.2, NP_001105574.1, NP_001105672.1, NP_001292792.1 |
| Ferredoxin | NP_001104851.1, NP_001136908.1, NP_001150750.1, NP_001336742.1, XP_020394593.1, XP_020405634.1 |
| Hexokinase | NP_001123599.1, XP_008672065.1, XP_008674565.1, XP_008675068.1 |
| Histone deacetylase | NP_001104901.1, NP_001105402.2, XP_008673398.1, XP_008677775.1, XP_020396306.1 |
| Isocitrate dehydrogenase | AQK53344.1, AQK89292.1, AQK97039.1, AQK88693.1, NP_001295424.1, ONM16007.1, ONM58401.1 |
| Peptidylprolylisomerase | AQK62104.1, AQK70996.1, AQL06400.1, ONM03151.1, ONM04876.1, ONM54033.1 |
| Phosphoglucoisomerase | NP_001105368.1, XP_008651420.1 |
| Phosphoglucomutase | NP_001105405.1, NP_001105703.1, XP_008675355.1, XP_020395615.1 |
| Sucrose synthase | XP_008645119.1, XP_008679107.1, XP_020399433.1, XP_023156234.1 |
| Transaminase | NP_001149818.2, NP_001278682.1, XP_008645517.1, XP_008668890.1, XP_008672129.1 |
Primers for PCR validation of local assemblies
| Target | Forward | Reverse |
|---|---|---|
| Contig 28‐1 | CGCTTGCGTCTGTACTGTGTT | CGAAAGAACTCACGAAACACG |
| Contig 28‐2 | GCTAACTTGCACTTGTTCTCG | CATATGATAAAACCCACCTCG |
| Contig 28‐3 | CAAATGCATTAAATAGCGTGC | GGTCCTTGATGCTTGTGTTCT |
| Contig 16‐1 | AGCTGAATGATAAATGCGGTA | TGGTGAGGTAGCAGGAACTACT |
| Contig 16‐2 | TGCACTCCATTGATATTTCTCG | CCAAACCCAAAAGGAAAAGTC |
| Contig 21‐1 | TGCTAAGTGCGTACAAAAGGAA | AATTGGTGCAAGAACAAGTGAC |
| Contig 21‐2 | TTGCTATTTCTAGCCCCATCC | CTTGTGAAGCGTACACGAATG |
| Contig 24‐1 | CTCGGAAGTTTATGGTAACCG | CCACCACTCAAACAACCACTA |
| Contig 24‐2 | ATCCTTGGGTCAGGTTCTCAT | ACTTGAAGAAGCGTCAGCTCT |
| Contig 24‐3 | TTTGCCTGTTGAGATGCATAG | ACGGTTGTACTTCCTCCATCA |
| Contig 24‐4 | CCACTAGCGCAAATCCCTGTA | ACTGAAGGCAAGATGGGGTCT |
| Contig 24‐5 | CTCGGTATTTTCTTGGGATTTG | CTTTGACTGGCGGTATACGAG |
| Contig 24‐6 | CGGAGCTGTACAAGGAGAGAC | AGTGTCTATCCCGAAAGCAGA |
| Hfr‐1 genomic | ACACGCACACACACAATCCT | CAACACCCAGGCACGTACTA |
| Hfr‐1 promoter | TGGTGGTCTCCAAGGTGAAAGACTGA | TTAGCTAGGATTGTGTGTGTGCGTGTGT |
| Hfr‐1 CS copy1 | TCCAGAAAACCCCAGATGCT | CAACACCCAGGCACGTACTA |
| Hfr‐2 promoter | ACTGGCCTTCATGGCTGCCCAGATCCAA | CTCTCCTCGCTCCCTGCTTGCACGCTAC |
| CA666657 #1 | CCTCTCCCGAACAATGGAAGGATTGC | GGCACGGATCTTGATGCAGAATGGAT |
| CA666657 #2 | AAGGTTCATCAAAATCAATTTCGTTGTCG | CGGAGGATGGGATGCTCTCAATGACAA |
Forward and reverse PCR primers are listed 5′–3′.
Abbreviation: CS, Chinese Spring.
Nested Genomewalker primers are listed.
FIGURE 1Examples of changes in contig length and count over cycles of read retrieval and assembly. Solid, dashed and dotted blue lines are respectively the longest length, mean length and count of target‐matching contigs. Solid, dashed and dotted black lines are respectively the longest length, mean length and count for all generated contigs. (a) Monotonically increasing contig length with cycle number in Zymoseptoria tritici. Simulated long reads were assembled with unicycler. A single contig was produced at each cycle. (b) Relatively monotonic contig growth in wheat for simulated short reads assembled with spades. (c) Stepwise contig growth of maize transaminase for actual short reads assembled with cap3. (d) Constant contig size for singly fragmented, simulated long reads in wheat assembled with phrap. (e) Irregular, generally increasing contig size for singly fragmented, simulated long reads in wheat assembled with phrap. (f) Irregular, generally level contig size for singly fragmented simulated long reads in wheat assembled with phrap
FIGURE 2More examples of changes in contig length and count over cycles of read retrieval and assembly. Line types and colours conform to Figure 1. (a) Irregular contig size, generally decreasing between cycles 5 and 18, for singly fragmented simulated long reads in wheat assembled with phrap. (b) Conspicuously peaked contig length at cycle 14 for singly fragmented simulated long reads in wheat assembled with phrap. (c) Alternating contig lengths for singly fragmented simulated long reads in wheat assembled with phrap. (d) Limit cycle with three values of contig length for singly fragmented simulated long reads in wheat assembled with phrap. (e) Brief alternation of contig lengths for cycles 10–14 from singly fragmented simulated long reads in wheat assembled with phrap. (f) A second instance of greatly fluctuating length and count of contigs similar to Figure 1f
Means over all target sequences for identities and maximum contig lengths obtained from any cycle of local assembly
| Species | Assembler | ReadL | ReadC | Comp | NMat | MatContLen | NAll | AllContLen | Identities |
|---|---|---|---|---|---|---|---|---|---|
|
|
| 2 × 150 bp | 60× | 49/50 | 3.66 | 1619.10 ± 28.68 | 6.18 | 1619.10 ± 28.68 | 98.76 |
|
|
| 2 × 150 bp | 5× | 50/50 | 2.48 | 1112.18 ± 57.38 | 7.53 | 1113.52 ± 57.13 | 97.30 |
|
|
| 2 × | 5× | 50/50 | 3.82 | 946.16 ± 29.53 | 7.30 | 946.16 ± 29.53 | 98.56 |
|
|
| 600 bp | 20× | 50/50 | 2.66 | 4681.46 ± 213.62 | 3.81 | 4681.46 ± 213.62 | 99.43 |
|
|
|
| 20× | 50/50 | 2.77 | 4901.34 ± 188.09 | 4.35 | 4901.34 ± 188.09 | 99.77 |
|
|
| 7–14 kb | 60× | 6/50 | 1.66 | 27,554.75 ± 6007.7 | 2.97 | 27,607.17 ± 6021.7 | 98.69 |
|
|
| 7–14 kb | 60× | 17/50 | 1.21 | 115,521.1 ± 6132.5 | 1.22 | 115,521.1 ± 6132.5 | 90.05 |
|
|
| 7–14 kb | 100× | 21/50 | 1.26 | 31,326.21 ± 4345.4 | 1.36 | 31,554.13 ± 4330.4 | 90.00 |
|
|
| 7–14 kbf1A | 5× | 30/50 | 4.12 | 1425.65 ± 71.89 | 68.62 | 1818.12 ± 68.06 | 90.20 |
|
|
|
| 5× | 47/50 | 1.89 | 27,053.51 ± 1521.38 | 313.30 | 28,880.51 ± 1351.12 | 95.42 |
|
|
| 7–14 kbf2 | 5× | 49/50 | 8.37 | 7396.69 ± 266.05 | 454.75 | 9026.37 ± 295.26 | 90.07 |
|
|
|
| 5× | 49/50 | 8.15 | 9920.18 ± 451.68 | 361.15 | 14,756.82 ± 804.90 | 90.00 |
|
|
| 2 × 150 bp | 60× | 19/22 | 3.55 | 1662.05 ± 149.41 | 5.71 | 1662.05 ± 149.41 | 98.29 |
|
|
| 7–14 kb | 60× | 10/10 | 1.01 | 139,855.4 ± 3057.9 | 1.02 | 139,855.4 ± 3057.9 | 90.80 |
|
|
| 7–85 kbf1 | 35× | 9/10 | 12.38 | 2054.60 ± 245.53 | 734.14 | 3065.60 ± 274.16 | 90.93 |
|
|
|
| 35× | 6/10 | 5.86 | 69,011.70 ± 4275.95 | 1576.62 | 72,174.80 ± 4206.15 | 95.39 |
|
|
| 7–85 kb | 35× | 1/10 | 3.20 | 63,517.33 ± 5922.12 | 5.08 | 63,731.67 ± 5807.39 | 99.54 |
|
|
| 2 × 150 bp | 32× | 10/10 | 29.15 | 5260.60 ± 368.88 | 70.06 | 5260.60 ± 368.88 | 99.57 |
|
|
| 2 × 150 bp | 7× | 10/10 | 37.87 | 2602.00 ± 354.17 | 96.52 | 2602.00 ± 354.17 | 98.92 |
|
|
| 2 × 150 bp | 7× | 10/10 | 14.34 | 3294.10 ± 442.11 | 54.20 | 3294.10 ± 442.11 | 97.11 |
Key to column headings and codes: Species, Triticum aestivum, Zea mays and Zymoseptoria tritici; Assembler, self‐explanatory; ReadL, read length; ReadC, read coverage; Comp, number of founding query accessions that completed all 21 cycles; NMat, mean count of contigs that matched the founding query sequence per cycle over all completed cycles; MatContLen, mean length of longest contigs that matched the founding query sequence, ±SEM; NAll, mean count of all contigs per cycle over all completed cycles; AllContLen, mean length of all longest contigs, ±SEM, taken from the file with the longest contig that matched the founding query sequence; Identities, percentage identity of founding query sequence to the closest‐matching contig. Means were taken over all founding query accessions that completed 21 cycles. Symbol P in the Assembler column designates a protein query. Symbol f1 in the read‐length column designates single fragmentation of long reads to 600‐base fragments, while f2 designates double fragmentation of long reads to 490‐ and 610‐base fragments that were gathered together to create the illusion of doubled read coverage.
Length and percentage nucleotide identity to the B73 genome assembly for longest contigs produced at any cycle of local assembly
| Group |
| SRcap3 | SRphrap | SRspades | LR1cap3 | LR1phrap | LRcanu |
|---|---|---|---|---|---|---|---|
| Cellulose synthase | 7 | 4341 | 5983 | 6934 | 2435 | 80,550 | 37,544 |
| 0.987 | 0.959 | 0.998 | 0.913 | 0.955 | 0.994 | ||
| Ferredoxin | 6 | 1899 | 2534 | 3917 | 1685 | 72,297 | 67,820 |
| 0.995 | 0.984 | 0.998 | 0.899 | 0.916 | 0.998 | ||
| Hexokinase | 4 | 1817 | 2040 | 6488 | 1425 | 51,448 | 73,014 |
| 0.987 | 0.970 | 0.996 | 0.912 | 0.961 | 0.998 | ||
| Histone deacetylase | 5 | 1950 | 2325 | 4574 | 3652 | 94,221 | 37,815 |
| 0.993 | 0.977 | 0.996 | 0.904 | 0.963 | 0.993 | ||
| Isocitrate dehydrogenase | 7 | 1670 | 2070 | 6430 | 2107 | 58,619 | 68,339 |
| 0.988 | 0.984 | 0.995 | 0.908 | 0.963 | 0.997 | ||
| Peptidylprolylisomerase | 6 | 1919 | 2179 | 4493 | 2157 | 70,564 | 85,620 |
| 0.987 | 0.957 | 0.993 | 0.915 | 0.955 | 0.999 | ||
| Phosphoglucoisomerase | 2 | 2029 | 4313 | 5932 | 1108 | 78,023 | — |
| 0.974 | 0.960 | 0.997 | 0.916 | 0.952 | — | ||
| Phosphoglucomutase | 4 | 4627 | 4567 | 4851 | 1656 | 52,436 | 85,954 |
| 0.997 | 0.993 | 0.994 | 0.904 | 0.955 | 0.996 | ||
| sucrose synthase | 4 | 3519 | 4417 | 5471 | 2904 | 72,015 | 57,554 |
| 0.998 | 0.977 | 0.995 | 0.908 | 0.941 | 0.993 | ||
| Transaminase | 5 | 2249 | 2513 | 3516 | 1417 | 59,944 | 57,996 |
| 0.992 | 0.985 | 0.993 | 0.911 | 0.958 | 0.995 | ||
|
Matching contigs Nonmatching contigs | 50 | 373 | 140 | 285 | 129 | 85 | 26 |
| 2 | 1 | 3 | 0 | 0 | 0 |
Key to fields: Group, enzyme activity or count of contigs that did or did not match the genome of B73; N, count of target protein sequences; SRcap3, cap3 assembly of separated 2 × 150‐bp reads; SRphrap, phrap assembly of separated 2 × 150‐bp reads; SRspades, spades assembly of 2 × 150‐bp reads; LR1cap3, cap3 assembly of PacBio reads singly fragmented to 600‐bp chunks; LR1phrap, phrap assembly of PacBio reads singly fragmented to 600‐bp chunks; LRcanu, canu assembly of intact PacBio reads. Read coverage and mean contig accuracy are given in Table 5. Key to rows: for each enzyme function, the top row is the length of the longest contig obtained with the field‐designated assembler, and the bottom row is the fraction of bases that match the B73 genome over all target‐matching contigs. The nonmatching contigs match bacterial variants of isocitrate dehydrogenase and phosphoglucomutase.
canu assembly failed for phosphoglucoisomerase.
Distribution of lengthening of longest contigs for 50 seeding sequences over 21 cycles of slag operating on simulated reads from Triticum aestivum
| Program | Read length | Read depth | 1.00–1.39 | 1.40–1.79 | 1.80–2.19 | 2.20–2.59 | 2.60–2.99 | 3.00+ |
|---|---|---|---|---|---|---|---|---|
|
| 2 × 150 | 5× | 7 | 22 | 10 | 5 | 2 | 4 |
|
| 2 × | 5× | 9 | 16 | 12 | 6 | 3 | 4 |
|
| 2 × 150 | 60× | 0 | 32 | 10 | 1 | 2 | 4 |
|
| 7–14kf | 5× | 29 | 4 | 1 | 0 | 0 | 0 |
|
|
| 5× | 3 | 6 | 6 | 9 | 5 | 18 |
|
| 7–14kd | 5× | 20 | 10 | 8 | 7 | 4 | 0 |
|
|
| 5× | 42 | 5 | 2 | 0 | 0 | 0 |
The six numerical columns at right are counts of seeding sequences that produced a ratio of maximum contig length to initial contig length within the stated range. Code kf indicates single fragmentation, while kd indicates double fragmentation.
FIGURE 3Detailed alignment of a phrap assembly of simulated pyrosequencing reads to the simulated genome from which the reads were sampled. Thirteen contigs represent nine simulated alleles of a 7‐kb locus. Polymorphic nucleotide (SNP) positions are colour‐coded by allele of origin. The bottom contig in black shows superimposed all variant positions, and a central, conserved region is evident by a paucity of SNPs, as intended. This central region exceeds any read in length, it is under‐represented in the assembly, and the assembled copies of it in contigs 10 and 11 are consensuses of two different alleles. Contig 13 is a consensus sequence of multiple alleles throughout. Contigs 1–8 represent a single left or right flank of an allele, and contigs 9–12 are flanks of different alleles joined in the central region. The depicted contig depth (9) equals the number of alleles (9), which was typical for alleles that differed this much in sequence
FIGURE 4Effect of short‐read polishing on contig length and accuracy in singly‐fragmented long reads of wheat assembled with cap3. Solid line is contig length, dashed line is blast‐hit length, and dotted line is percentage identity vs. the Chinese Spring genome. Means are presented for all target‐matching contigs from all 21 cycles of assembly
Comparative contig and runtime statistics for slag, atram2 and srassembler
| Property | SLAGB73 | aTRAMB73 | SRAB73 | SLAGfStanley | aTRAMfStanley | SLAGhStanley | aTRAMhStanley | SLAGqStanley | aTRAMqStanley |
|---|---|---|---|---|---|---|---|---|---|
| Enzymes completed | 10 | 10 | 4 | 10 | 10 | 10 | 9 | 10 | 1 |
| Contig count | 25.50 | 57.3 | 4.00 | 44.00 | 101.9 | 36.30 | 48.33 | 28.70 | 131.00 |
| Contig/loci ratio | 0.73 | 0.96 | 0.46 | 1.34 | 1.99 | 1.10 | 0.94 | 0.87 | 2.55 |
| Mean length | 2043.21 | 1826.45 | 1862.10 | 1084.02 | 1295.19 | 1130.38 | 987.71 | 1335.64 | 1067.95 |
| Maximum length | 5260.60 | 5252.20 | 2858.25 | 3434.70 | 3552.0 | 3125.50 | 2758.11 | 3178.80 | 4622.00 |
| Percentage matched | 99.65 | 99.47 | 100.00 | 97.86 | 99.16 | 97.83 | 98.57 | 97.50 | 98.41 |
| Number of cycles | 21.00 | 15.9 | 6.00 | 21.00 | 11.50 | 21.00 | 4.89 | 21.00 | 1.00 |
| Seconds per cycle | 2406.02 | 1469.00 | 32271.97 | 677.34 | 590.77 | 166.87 | 1930.72 | 83.88 | 962.00 |
| Resident memory (bytes) | 1.386e + 10 | 8.707e + 09 | 1.199e + 07 | 2.520e + 10 | 3.578e + 09 | 1.275e + 10 | 2.280e + 10 | 6.382e + 09 | 2.720e + 09 |
| Virtual memory (bytes) | 5.813e + 10 | 1.846e + 11 | 5.834e + 08 | 7.314e + 11 | 9.964e + 09 | 2.983e + 10 | 7.253e + 10 | 7.210e + 10 | 8.776e + 09 |
| Maximum page faults | 29.50 | 151.70 | 0.00 | 30.70 | 214.67 | 26.40 | 130.67 | 30.60 | 271.00 |
Column headings B73, fStanley, hStanley and qStanley refer respectively to the read sets from maize “B73” and the full, half and quarter read sets from wheat “Stanley.” The enzymes are given in Table 3. All rows below the first are means over the number of runs given in the first row. Percentage match refers to contig identity with the B73 or Chinese Spring reference genomes. Memory is the maximum at any point during the run.
Calculation of expected loci and contig/loci ratio
| Program |
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|---|
| Read set | B73 | B73 | B73 | CS full | CS full | CS half | CS half | CS quarter | CS quarter |
| e‐value | 1e‐20 | 1e‐10 | 1e‐50 | 1e‐20 | 1e‐10 | 1e‐20 | 1e‐10 | 1e‐20 | 1e‐10 |
| Contig count | 25.50 | 57.3 | 4.00 | 44.00 | 101.9 | 36.30 | 48.33 | 28.70 | 131.00 |
| Loci in genome | 34.9 | 59.4 | 8.75 | 32.9 | 51.3 | 32.9 | 51.3 | 32.9 | 51.3 |
| Contig:loci ratio | 0.73 | 0.96 | 0.46 | 1.34 | 1.99 | 1.10 | 0.94 | 0.87 | 2.55 |
Contig count came from Table 8. Number of loci was estimated from the distribution of blastn hits in the appropriate genome, with a minimum of 10,000 bases between loci.
Mean cycle durations for slag and atram2 subdivided by enzyme and read set
| Read set | Enzyme |
|
| Ratio |
|---|---|---|---|---|
| Zea | Cellulose synthase | 3589.90 | 855.33 | 4.197 |
| Zea | Ferredoxin | 1609.24 | 3049.90 | 0.528 |
| Zea | Hexokinase | 1720.48 | 2023.14 | 0.850 |
| Zea | Histone deacetylase | 2494.33 | 2583.86 | 0.965 |
| Zea | Isocitrate dehydrogenase | 1827.24 | 344.33 | 5.307 |
| Zea | Peptidylprolyl isomerase | 4696.67 | 496.86 | 9.453 |
| Zea | Phosphoglucoisomerase | 1658.00 | 103.11 | 16.080 |
| Zea | Phosphoglucomutase | 1750.10 | 2292.52 | 0.763 |
| Zea | Sucrose synthase | 2657.57 | 1082.00 | 2.456 |
| Zea | Transaminase | 2055.24 | 122.00 | 16.846 |
| Full Stanley | Cellulose synthase | 1312.67 | 252.43 | 5.200 |
| Full Stanley | Ferredoxin | 223.05 | 793.50 | 0.281 |
| Full Stanley | Hexokinase | 414.52 | 308.25 | 1.345 |
| Full Stanley | Histone deacetylase | 389.29 | 230.57 | 1.688 |
| Full Stanley | Isocitrate dehydrogenase | 978.90 | 460.17 | 2.127 |
| Full Stanley | Peptidylprolyl isomerase | 582.81 | 205.48 | 2.836 |
| Full Stanley | Phosphoglucoisomerase | 346.67 | 1512.31 | 0.229 |
| Full Stanley | Phosphoglucomutase | 215.52 | 1836.75 | 0.117 |
| Full Stanley | Sucrose synthase | 1777.62 | 400.89 | 4.434 |
| Full Stanley | Transaminase | 531.05 | 7966.31 | 0.067 |
| hemiStanley | Cellulose synthase | 407.29 | 1458.00 | 0.279 |
| hemiStanley | Ferredoxin | 80.62 | 155.00 | 0.520 |
| hemiStanley | Hexokinase | 164.71 | 509.00 | 0.324 |
| hemiStanley | Histone deacetylase | 90.81 | 669.00 | 0.136 |
| hemiStanley | Isocitrate dehydrogenase | 196.10 | 638.00 | 0.307 |
| hemiStanley | Peptidylprolyl isomerase | 155.67 | — | — |
| hemiStanley | Phosphoglucoisomerase | 104.33 | 143.44 | 0.727 |
| hemiStanley | Phosphoglucomutase | 71.62 | 220.00 | 0.326 |
| hemiStanley | Sucrose synthase | 217.33 | 3820.62 | 0.057 |
| hemiStanley | Transaminase | 178.95 | 667.00 | 0.268 |
Duration included the initial alignment of protein queries to the nucleotide reads database.
Ending status of atram2 runs
| Outcome | B73 | Stanley | ||
|---|---|---|---|---|
| B73 | Full Stanley | Half‐Stanley | Quarter‐Stanley | |
| 21 completed | 6 | 2 | 1 | 0 |
| No contigs updated | 3 | 1 | 0 | 0 |
| Assembly failed after first cycle | 1 | 0 | 8 | 1 |
| Assembly failed at first cycle | 0 | 0 | 1 | 9 |
| Database locked | 0 | 6 | 0 | 0 |
| Out of time | 0 | 1 | 0 | 0 |
The wall time limit for each run was 30 hr. atram2 stopped in four instances where contigs did not grow between cycles of read retrieval. In all but one instance of assembly failure, spades’s exit status was 21, which was apparently related to insufficient read depth for the diversity of reads.