| Literature DB >> 20519285 |
Andrey O Kislyuk1, Lee S Katz, Sonia Agrawal, Matthew S Hagen, Andrew B Conley, Pushkala Jayaraman, Viswateja Nelakuditi, Jay C Humphrey, Scott A Sammons, Dhwani Govil, Raydel D Mair, Kathleen M Tatti, Maria L Tondella, Brian H Harcourt, Leonard W Mayer, I King Jordan.
Abstract
MOTIVATION: New sequencing technologies have accelerated research on prokaryotic genomes and have made genome sequencing operations outside major genome sequencing centers routine. However, no off-the-shelf solution exists for the combined assembly, gene prediction, genome annotation and data presentation necessary to interpret sequencing data. The resulting requirement to invest significant resources into custom informatics support for genome sequencing projects remains a major impediment to the accessibility of high-throughput sequence data.Entities:
Mesh:
Year: 2010 PMID: 20519285 PMCID: PMC2905547 DOI: 10.1093/bioinformatics/btq284
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Summary of sequencing projects used in the pipeline development
| Strain ID | Sequence type | Serogroup | Geographic origin | Date collected | Genome size | Closest reference | Substitutions per position versus ref. | Total reads | Total bases sequenced | Average read length | Coverage | Instrument standard |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NM13220 | ST-7 | A | Philippines | 2005 | 2.2M | Z2491 | 0.076 | 197 067 | 47 569 493 | 241 | 21× | GS-20 |
| NM10699 | ST-32 | B | Oregon, USA | 2003 | 2.2M | MC58 | 0.053 | 418 751 | 81 775 264 | 195 | 37× | GS-20 |
| NM15141 | ST-11 | C | New York, USA | 2006 | 2.2M | FAM18 | 0.028 | 378 773 | 94 288 660 | 249 | 42× | GS-20 |
| NM9261 | ST-11 | W135 | Burkina Faso | 2002 | 2.2M | FAM18 | 0.030 | 206 634 | 69 957 473 | 338 | 31× | GS Ti |
| NM18575 | ST-2859 | A | Burkina Faso | 2003 | 2.2M | Z2491 | 0.033 | 283 888 | 84 013 571 | 296 | 38× | GS Ti |
| NM5178 | ST-32 | B | Oregon, USA | 1998 | 2.2M | MC58 | 0.050 | 270 332 | 88 664 981 | 328 | 40× | GS Ti |
| NM15293 | ST-32 | B | Georgia, USA | 2006 | 2.2M | MC58 | 0.054 | 276 733 | 90 951 566 | 329 | 41× | GS Ti |
| BBE001 | N/A | N/A | Georgia, USA | 1956 | 5.3M | RB50 | 0.056 | 566 834 | 229 098 141 | 404 | 43× | GS Ti |
| BBF579 | N/A | N/A | Mississippi, USA | 2007 | 5.3M | RB50 | 0.104 | 533 099 | 228 467 710 | 429 | 43× | GS Ti |
Data for each strain are presented in rows.
aSequence type denotes the allelic profile assigned by multilocus sequence typing (MLST; Holmes et al., 1999; Maiden et al., 1998) on the basis of seven loci within well-conserved house-keeping genes.
bNeisseria meningitidis isolates are divided into serogroups by immunochemistry of polysaccharides present in their antiphagocytic capsule.
cThe region in which each strain was originally collected.
dStrain ID of the closest complete genome available in GenBank, as determined by 16S RNA phylogeny as well as whole-genome sequence identity, which agreed in all cases.
eInsertions, deletions and substitutions per position of genome as compared against the closest reference.
fCoverage denotes the average number of sequencing reads overlapping at a given position in the genome, calculated as the total number of bases sequenced divided by the estimated length of the genome.
gThe standard of the 454 pyrosequencing instrument and reagents used to sequence the data.
hSequence typing and serotyping was not performed on B.bronchiseptica.
Fig. 1.Chart of data flow, major components and subsystems in the pipeline. Three subsystems are presented: genome assembly, feature prediction and functional annotation. Each subsystem consists of a top-level execution script managing the input, output, format conversion and combination of results for a number of components. A hierarchy of scripts and external programs then performs the tasks required to complete each stage. The legend for the flowchart indicates the identities of the distinct pipeline components: data, pipeline component, optional component, external component and external, optional component.
Fig. 2.Comparative analysis of draft assembly with MAUVE. The top pane represents the active assembly; vertical lines indicate contig boundaries (gaps). The reference genomes are arranged in subsequent panes in order of phylogenetic distance. Blocks of synteny (LCBs) are displayed in different colors (an inversion of a large block is visible between panes 1–2 and 3–5). Most gaps within LCBs were joined in the manually assisted assembly, while considering factors such as sequence conservation on contig flanks and presence of protein-coding regions.
Summary of assembler performance
| Strain ID | Newbler statistics | AMOScmp statistics | Automatic combined assembly | Manual combined assembly | ||||
|---|---|---|---|---|---|---|---|---|
| Contigs >500 nt, total size | N50 | Contigs >500 nt, total size | N50, longest contig | Contigs >500 nt, total size | N50, longest contig | Contigs >500 nt, total size | % gapfill, longest contig | |
| NM13220 | 175 2.07M | 22K 106K | 202 2.06M | 21K 77K | 195 2.25M | 31K 107K | 57 2.30M | 1.8% 398K |
| NM10699 | 102 2.10M | 52K 143K | 116 2.10M | 43K 113K | 83 2.17M | 59K 143K | 40 2.18M | 1.1% 435K |
| NM15141 | 147 2.06M | 33K 171K | 190 2.05M | 22K 115K | 139 2.21M | 36K 171K | 50 2.28M | 2.0% 759K |
| NM9261 | 99 2.09M | 51K 184K | 133 2.07M | 37K 170K | 128 2.16M | 64K 231K | 27 2.21M | 1.6% 866K |
| NM18575 | 133 2.09M | 30K 172K | 147 2.09M | 29K 88K | 220 2.40M | 53K 231K | N/A | N/A |
| NM5178 | 89 2.13M | 56K 136K | 107 2.12M | 42K 131K | 104 2.17M | 59K 136K | N/A | N/A |
| NM15293 | 92 2.08M | 52K 144K | 110 2.06M | 42K 132K | 107 2.10M | 59K 144K | N/A | N/A |
| BBE001 | 146 5.05M | 70K 212K | 178 5.04M | 61K 173K | 214 5.03M | 80K 252K | N/A | N/A |
| BBF579 | 272 4.84M | 57K 88K | 321 4.84M | 46K 94K | 272 | 57K 88K | N/A | N/A |
Data for each strain are presented in rows. Statistics from standalone assemblers (Newbler and AMOScmp) are presented together with results of the combining protocol (default output of the pipeline) and an optional, manually assisted predictive gap closure protocol.
aN50 is a standard quality metric for genome assemblies that summarizes the length distribution of contigs. It represents the size N such that 50% of the genome is contained in contigs of size N or greater. Greater N50 values indicate higher quality assemblies.
bNo improvement was detected from the combined assembly in strain BBF579, and the original Newbler assembly was automatically selected.
cThe manual combined assembly protocol was not performed for these projects.
Prediction algorithm performance comparison and statistics
| Strain ID | Gene predictions by GeneMark | Gene predictions by Glimmer3 | Gene predictions by BLAST | ORFs with full consensus | ORFs with partial consensus | Total gene predic- tions reported | tRNAs predicted by tRNAScan-SE |
|---|---|---|---|---|---|---|---|
| NM13220 | 2530 | 2725 | 1353 | 1325 | 974 | 2299 | 52 |
| NM10699 | 2366 | 2494 | 1317 | 1284 | 826 | 2110 | 51 |
| NM15141 | 2411 | 2578 | 1369 | 1343 | 841 | 2184 | 57 |
| NM9261 | 2370 | 2553 | 1341 | 1308 | 802 | 2110 | 51 |
| NM18575 | 2751 | 2927 | 1495 | 1448 | 1023 | 2471 | 63 |
| NM5178 | 2377 | 2510 | 1315 | 1281 | 816 | 2097 | 52 |
| NM15293 | 2062 | 2040 | 1285 | 1261 | 802 | 2063 | 51 |
| BBE001 | 4793 | 4793 | 2744 | 2732 | 2067 | 4799 | 48 |
| BBF579 | 4649 | 4646 | 2652 | 2635 | 2021 | 4656 | 48 |
Data for each strain are presented in rows. Prediction counts from the three standalone gene prediction methods are presented. Counts of protein-coding gene predictions reported by our algorithm and tRNA genes are also shown. Data presented are based on the automatic combined assemblies from Table 2.
aNumber of ORFs with protein-coding gene predictions where all three predictors agreed exactly or with a slight difference in the predicted start site.
bORFs where only two of the three predictors made a prediction.
cTotal protein-coding gene predictions reported by the pipeline.
Fig. 3.Schematics of combining strategy for prediction stage. BLAST alignment start, which may not coincide exactly with a start codon, is pinned to the closest start codon. Then, a consensus or most upstream start is selected.
Fig. 4.Example functional annotation listing of a N.meningitidis gene in the Neisseria Base. Draft genome data are shown including gene location, prediction and annotation status, peptide statistics, BLAST hits, signal peptide properties, transmembrane helix presence, DNA and protein sequence. All names, locations, functional annotations and other fields are searchable, and gene data are accessible from GBrowse genome browser tracks.
Feature annotation statistics
| Strain ID | Total number of CDS | Signal peptides | Transmembrane helices | Conserved hypothetical proteins | Putative uncharacterized proteins | Functional assignment inferred from homology | Virulence factors |
|---|---|---|---|---|---|---|---|
| NM13220 | 2299 | 326 (14.2%) | 184 (8.0%) | 10 (0.4%) | 708 (30.8%) | 603 (26.2%) | 36 (1.6%) |
| NM10699 | 2110 | 310 (14.7%) | 180 (8.5%) | 5 (0.2%) | 652 (30.9%) | 577 (27.3%) | 45 (2.1%) |
| NM15141 | 2184 | 317 (14.5%) | 173 (7.9%) | 16 (0.7%) | 590 (27.0%) | 583 (26.7%) | 50 (2.3%) |
| NM9261 | 2110 | 303 (14.4%) | 166 (7.9%) | 13 (0.6%) | 591 (28.0%) | 558 (26.4%) | 37 (1.8%) |
| NM18575 | 2471 | 349 (14.1%) | 193 (7.8%) | 13 (0.5%) | 725 (29.3%) | 668 (27.0%) | 48 (1.9%) |
| NM5178 | 2097 | 298 (14.2%) | 177 (8.4%) | 3 (0.1%) | 646 (30.8%) | 572 (27.3%) | 45 (2.1%) |
| NM15293 | 2063 | 304 (14.7%) | 168 (8.1%) | 6 (0.3%) | 613 (29.7%) | 567 (27.5%) | 47 (2.3%) |
| BBE001 | 4799 | 977 (20.4%) | 368 (7.7%) | 9 (0.2%) | 807 (16.8%) | 1184 (24.7%) | 54 (1.1%) |
| BBF579 | 4656 | 934 (20.1%) | 339 (7.3%) | 9 (0.2%) | 739 (15.9%) | 1171 (25.2%) | 45 (1.0%) |
Data for each strain are presented in rows. Data presented are based on the automatic combined assemblies from Table 2 and the gene predictions from Table 3.
aTotal putative protein-coding sequences analyzed.
bAs predicted by SignalP (Bendtsen et al., 2004); percentage of total CDS indicated in parentheses.
cAs predicted by TMHMM (Krogh et al., 2001).
dAs predicted by BLASTp alignment against VFDB (Chen et al., 2005; Yang et al., 2008); http://www.mgc.ac.cn/VFs/.