| Literature DB >> 25523208 |
Holly B Bratcher, Craig Corton, Keith A Jolley, Julian Parkhill, Martin C J Maiden1.
Abstract
BACKGROUND: Highly parallel, 'second generation' sequencing technologies have rapidly expanded the number of bacterial whole genome sequences available for study, permitting the emergence of the discipline of population genomics. Most of these data are publically available as unassembled short-read sequence files that require extensive processing before they can be used for analysis. The provision of data in a uniform format, which can be easily assessed for quality, linked to provenance and phenotype and used for analysis, is therefore necessary.Entities:
Mesh:
Year: 2014 PMID: 25523208 PMCID: PMC4377854 DOI: 10.1186/1471-2164-15-1138
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Velvet assembly statistics of 108* genomes analyzed at 1605 core meningococcal loci
| Multiplex group (number of libraries) | Sequence read length | Most common k-mer | Average contig count | Average N50 | Average longest contig | Average genome length | Average number of core loci identified | Average number of incomplete loci found |
|---|---|---|---|---|---|---|---|---|
| A (12) | 54 | 39 | 374 | 18450 | 66014 | 2088067 | 1603 | 45 |
| B (12) | 54 | 35 | 407 | 16252 | 63032 | 2073517 | 1603 | 39 |
| C (12) | 54 | 39 | 339 | 20375 | 76751 | 2087187 | 1602 | 43 |
| D (12) | 54 | 41 | 322 | 22022 | 79870 | 2086109 | 1604 | 34 |
| E (12) | 54 | 37 | 396 | 15813 | 59228 | 2108812 | 1602 | 43 |
| F (11) | 54 | 41 | 345 | 18364 | 66689 | 2085872 | 1603 | 40 |
| G (12) | 54 | 39 | 369 | 16595 | 61439 | 2107242 | 1602 | 46 |
| H (12) | 54 | 37 | 329 | 20557 | 77125 | 2095768 | 1603 | 46 |
| I (11) | 54 | 39 | 386 | 13842 | 53816 | 2085635 | 1604 | 54 |
| J (7) | 54 | 41 | 403 | 22586 | 104539 | 2109587 | 1600 | 37 |
| K (7) | 76 | 63 | 295 | 31378 | 122718 | 2143841 | 1604 | 26 |
|
|
|
|
|
|
|
|
|
|
*The total number of genomes analysed is 120 and includes: 5 genomes sequenced a second time using 54 base reads (J) and 7 genomes sequenced a second time using 76 base reads (K).
Two genomes failed to sequence in their original groups (F and I), these genomes were subsequently rerun in group J.
The increase in read length used for multiplex group K produced larger than expected assembly improvements. A significant drop in the number of contigs and a corresponding increase in the N50 value were achieved with the relatively small 22 base read increase. Therefore, additional increased base read lengths should continue to increase the coverage of long repeat regions and decrease the number of contigs per assembly.
Figure 1Location of eMLST and antigen genes within the meningococcal genome. CGView map of the Neisseria meningitidis reference genome, FAM18, showing the placement of the conventional and extended MLST loci and the 3 antigen genes (4 typing fragments) used to assess sequence accuracy of the de novo high-throughput assembly method across the genome.
Comparison of Sanger derived MLST and AGST loci to their respective assembled genome
| Typing locus | Original Sanger derived allele | Illumina derived allele | Retested Sanger derived allele | Number of bases, likely cause of discrepancy |
|---|---|---|---|---|
|
| ||||
|
| 1 | 10 | 10 | 9, mislabelled |
|
| 8 | 34 | 34 | 1, editing error |
|
| 1 | 547 | 547 | 1, editing error |
| 14 | 60 | 60 | 1, editing error | |
|
| ||||
|
| 9 | 5 | 5 | 35, mislabelled |
|
| 3 | 17 | 17 | 2, editing error |
| 12 | 1 | 1 | 19, mislabelled | |
|
| 2 | 1 | 1 | 6, mislabelled |
| 4 | 19 | 19 | 2, editing error | |
|
| 11 | 87 | 87 | 10, mislabelled |
| 10 | 86 | 86 | 1, editing error | |
| 33 | 42 | 42 | 1, editing error | |
|
| 8 | 78 | 78 | 6, mislabelled |
|
| 7 | 11 | 11 | 14, mislabelled |
|
| 1 | 18 | 18 | 2, editing error |
|
| 4 | 56 | 56 | 2, editing error |
| 29 | 56 | 56 | 20, mislabelled | |
| 4 | 5 | 5 | 8, mislabelled | |
| 8 | 56 | 56 | 8, mislabelled | |
|
| 7 | 20 | 20 | 3, editing error |
| 2 | 6 | 6 | 21, mislabelled | |
|
| ||||
| PorA VR1 | 7 | 18-1 | 18-1 | 30, mislabelled |
| PorA VR2 | 1-1 | 1-2 | 1-2 | 3, mislabelled |
| 16 | 3 | 3 | 30, mislabelled | |
| 15 | 15-1 | 15-1 | 1, editing error | |
| 14-1 | 14 | 14 | 3, repeat sequence* | |
| fHpb | 25 | 5 | 5 | 198, mislabelled |
| 39 | 16 | 16 | 5, mislabelled | |
| 24 | 14 | 14 | 196, mislabelled | |
| 5 | 22 | 22 | 5, mislabelled | |
| 35 | 32 | 32 | 51, mislabelled |
*repeat sequence compression during assembly; this occurred in 4 different isolates.
Re-sequenced genome comparisons sequence differences identified among four re-sequenced genomes and their respective finished sequence
| a. | Missing Sequence | Failed assembly of repeat sequence tracts | Paralogous cross identification | Sequencing Discrepancy | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1876 | 1693839 | 4 (0.2%) | 5408 | 2 (0.2%) | 3039 | 5 (0.3%) | 4737 | 12 (0.6%) | 32 |
|
| 1914 | 1767562 | 9 (0.5%) | 21895 | 6 (0.3%) | 9070 | 8 (0.4%) | 6990 | 9 (0.5%) | 24 |
|
| 1904 | 1718346 | 7 (0.4%) | 14484 | 10 (0.5%) | 13125 | 12 (0.6%) | 4974 | 25 (1.3%) | 90 |
|
| 1975 | 1784201 | 8 (0.4%) | 16716 | 7 (0.4%) | 14458 | 17 (0.9%) | 4824 | 25 (1.3%) | 76 |
|
| ||||||||||
|
|
|
|
|
| ||||||
|
| 1872 (99.8%) | 1801 (96.2%) | 19 (1.0%) | 51 (2.7%) | ||||||
|
| 1905 (99.5%) | 1775 (93.2%) | 23 (1.2%) | 107 (5.6%) | ||||||
|
| 1897 (99.6%) | 1757 (92.6%) | 47 (2.5%) | 93 (4.9%) | ||||||
|
| 1967 (99.2%) | 1821 (92.6%) | 49 (2.5%) | 97 (4.9%) | ||||||
§ For each CDS that had either a failed assembly or paralogous cross-identification error the entire CDS length was counted as affected.
Sequence differences were identified using the BIGSdb Genome Comparator tool. All transposase CDS were removed from the analysis. The Z2491 and FAM18 were originally sequenced and finished using ABI373 and 377 or ABI 3700 methods in 2000 and 2007 respectively, and the H44/76 and G2136 genomes were originally sequenced and finished in 2011 using Roche 454 FLX and capillary-based sequencing.
Figure 2KEGG functional and informational processing pathways identified in the meningococcal genome. Loci from the core gene list were used to search the pathway database KEGG for functional and informational pathways. A total 537 loci of the 1605 core genes have assigned Enzyme Commission numbers (EC). The figure shows the breakdown of the genes in to three main groups and specific associated pathways: orange – genetic information processing (4 pathways, 20%), purple – metabolism (12 pathways, 73%), green – environmental information processing (2 pathways, 6%).
Figure 3128 representative genomes from the 20th and 21st Century. The relationships between meningococcal isolates are represented by two datasets in which (a.) 1605 core meningococcal loci (cgMLST) or (b.) 53 ribosomal protein genes (rMLST), a subset of the 1605 core loci, are used. In both trees major phylogenetic groups are noted A-D. For cross-compatible identification, and where there are 2 or more strains per lineage are present, the major MLST derived clonal complexes (cc) are identified by colour: red – ST-1 cc, purple – ST-5 cc, pink – ST-4 cc, teal – ST-37 cc, yellow – ST-11 cc, orange – ST-8 cc, green – ST-41/44 cc, blue – ST-32 cc, grey – ST-269 cc, olive – ST-18 cc. Capsular types other than A, B, or C are noted in parentheses, accept for Lineage 11 which are labelled (cps B and cps C). Unlabelled nodes are undefined lineages and currently do not have a clonal complex association; a full list of lineage and associated clonal complex nomenclature can be found in Table 4.
Proposed Whole Genome Lineage Nomenclature
| WGS nomenclature | MLST nomenclature |
|---|---|
| Lineage 11 ^ | ST-11 cc |
| Lineage 3 ^ | ST-41/44 cc |
| Lineage 23 ^ | ST-23 cc |
| Lineage 1 | ST-1 cc |
| Lineage 2 | ST-269 cc |
| Lineage 4 | ST-4 cc |
| Lineage 5 | ST-32 cc |
| Lineage 6 | ST-60 cc |
| Lineage 7 | ST-750 cc |
| Lineage 8 | ST-8 cc |
| Lineage 9 | ST-92 cc |
| Lineage 10 | ST-5 cc |
| Lineage 12 | ST-53 cc |
| Lineage 13 | ST-213 cc |
| Lineage 14 | ST-174 cc |
| Lineage 15 | ST-1157 cc |
| Lineage 16 | ST-116 cc |
| Lineage 17 | ST-175 cc |
| Lineage 18 | ST-18 cc |
| Lineage 19 | ST-198 cc |
| Lineage 20 | ST-103 cc |
| Lineage 21 | ST-212 cc |
| Lineage 22 | ST-22 cc |
| Lineage 24 | ST-106 cc |
| Lineage 25 | ST-162 cc |
| Lineage 26 | ST-167 cc |
| Lineage 27 | ST-178 cc |
| Lineage 28 | ST-181 cc |
| Lineage 29 | ST-226 cc |
| Lineage 30 | ST-231 cc |
| Lineage 31 | ST-254 cc |
| Lineage 32 | ST-282 cc |
| Lineage 33 | ST-292 cc |
| Lineage 34 | ST-334 cc |
| Lineage 35 | ST-35 cc |
| Lineage 36 | ST-364 cc |
| Lineage 37 | ST-37 cc |
| Lineage 38 | ST-376 cc |
| Lineage 39 | ST-461 cc |
| Lineage 40 | ST-549 cc |
| Lineage 41 | ST-865 cc |
| Lineage 42 | ST-1117 cc |
| Lineage 43 | ST-1136 cc |
| Lineage 44 | ST-4821 cc |
| Lineage 45 | ST-4240/6688 cc |
^ Distinct sub-lineages present; proposal to use decimal based (i.e. 11.1, 11.2, etc.) system for defined sub-lineages.
To simplify and differentiate between MLST typing and whole genome based typing we propose a lineage nomenclature that is associated with defined clonal complex (cc). The data includes all PubMLST Neisseria database isolates.
Figure 4Location of rMLST genes within the meningococcal genome.