Literature DB >> 31429698

The importance of genome sequence quality to microbial comparative genomics.

Abstract

The quality of microbial genome sequences has been a concern ever since the emergence of genome sequencing. The quality of the genome assemblies is dependent on the sequencing technology used and the aims for which the sequence was generated. Novel sequencing and bioinformatics technologies are not intrinsically better than the older technologies, although they are generally more efficient. In this correspondence, the importance for comparative genomics of additional manual assembly efforts over autoassembly and careful annotation is emphasized.

Entities: Chemical Disease Species

Keywords: Annotations; Assembly; Genomes; Sequencing; Taxonomy

Mesh：

Year: 2019 PMID： 31429698 PMCID： PMC6701015 DOI： 10.1186/s12864-019-6014-5

Source DB: PubMed Journal: BMC Genomics ISSN： 1471-2164 Impact factor: 3.969

Main article

In my recent research, I have on several occasions dealt with bacterial genome sequences that were of low quality (here defined as “genome sequence assemblies that contain many contigs, and eventually with obvious misassemblies and unresolved plasmid sequences). A major problem is that the quality of these genome sequences is not indicated in the relevant databanks or in the associated literature, even though basic methods for genome quality assessment are available [1-3]. As some of the low-quality genomes can be of potential interest, we may invest considerable time to finally conclude that these genomes are not of much use for us. It is my opinion that this loss of time can be avoided by simple means. New technologies are always taken skeptically. Already when I was working with 454 sequencing technology, homopolymers were a major concern [4]. The same problem was observed later with reads from IonTorrent systems [5, 6]. Assembly of short reads from technologies such as Illumina often yielded assemblies with a large number of contigs. Genome assemblies with long reads from PacBio SMRT sequencing or more recently Oxford NanoPore MinION sequencing are often superior in assembly due to the low number of resulting contigs (often complete bacterial genomes) but there are still concerns regarding the high error frequencies and reliability [7-9]. Many of these problems can be resolved by some time with an assembly specialist, improving the assembly quality remarkably. The large number of contigs after assembly is one of the major problems that were observed when using short-read sequencing technologies. A recent publication on the intraspecies taxonomy of the plant pathogen Pseudomonas syringae included genomes with up to 5099 contigs [10]. The quality of these genome sequences may be fine for taxonomical analysis where most parameters like average nucleotide identities (ANI) [11] or genome-to-genome distance calculation (GGDC) [12] are not dependent on the integrity of annotations. However, for comparative genomics searching for individual gene sequences, these fragmented genomes are not applicable. Just do the back-of-the-envelope calculation: having a mean genome size of around 6 Mb per genome [10], this would indicate that the size of an average contig in a genome sequence with 5000 contigs would be around 1.2 kb. Having an average coding density of 85% and an average gene size of 1 kb for bacteria, this would indicate that there is maximally one full gene per contig, but it more often happens that you find two fragmented genes on the contig boundaries. This certainly limits the use of such an assembly. It should be stated that often a large number of contig gaps cannot be resolved, but this is dependent on the genome. We recently sequenced two genomes of P. syringae using 2 × 300 base paired-end Illumina sequencing, and obtained a large number of contigs (214 and 246 contigs, respectively) [13]. In these genomes, many of the contig breaks are caused by the presence of insertion sequence (IS) elements. As IS elements are typically around 1.2–1.5 kb, a shotgun library with 500 bp inserts is not suitable for positioning the IS elements, present in multiple copies in the same genome. For this reason, our research group now prefers to use PacBio sequencing with a high coverage to improve the quality of genome assemblies from species that harbor a large number of IS elements [14, 15]. Still, manual inspection after sequencing was required to solve some sequence problems. On the other hand, it should also be stated that most genomes sequenced with Illumina technology can easily be improved in their quality by some additional steps of assembly (Fig. 1). Within our research group, we commonly spend up to one week per genome to reduce the number of contigs from an Illumina assembly. After autoassembly, we first perform a read mapping against the FastA file of the de novo assembly using SeqMan NGen (DNASTAR, Madison, WI, USA). This program has a special workflow, which allows the mapping of reads over the border of the contigs, which, when using 2 × 300 base reads, often gives more than 200 bp additionally on the left and right side of the contig. Manually checking the mapped reads in SeqMan Pro (DNASTAR) will uncover assembly errors based on false joints as these repeats will have a higher coverage on part of contigs than the average coverage. Such contig may be split before the next step.

Fig. 1

Flow diagram for high quality genome assemblies as used in the author’s institution. To follow the process described in the text, the parts involved in step 1 and step 2 are shaded, whereas all other processes belong to step 3. Black arrows: follow-up processes, blue arrows: information flow, grey arrow: potential follow-up process The second step is to perform an assembly of all contigs from the resulting FastA file in SeqMan against each other. Here, several contigs may already be joined based on the additional sequence information, as overlaps are generated. Additionally, this process will eliminate many of the small contigs, which may be included inside other contigs. These will be checked if validly included. When a reference genome of the same species is available, this sequence can also be used to map reads against, followed by combining mapped and de novo contigs in SeqMan. However, this may introduce other problems due to misassembled regions. Afterwards, the overlaps need to be checked carefully, as in case of contig forks, contigs may be joined erroneously. Read mapping using SeqMan NGen followed by manual analysis of mapped reads using SeqMan Pro can solve this kind of issues. When a complete genome, closely enough related as determined by ANI [11] or GGDC [12], is available, the program MAUVE [16] can be used to sort all contigs against the reference genome [17]. Using the synteny between the genomes from BLASTN analyses, several gaps may be closed. Others, potentially erroneously joined in the previous step, may have to be split again. The process has to be repeated several times to yield the FastA file of a final high quality draft genome assembly, as not all gaps can be resolved (e.g. rRNA operons). After annotation, information can be derived from the contigs that could lead to improved contig assembly, e.g., when a contig represents a plasmid. The above mentioned process often yields closure of plasmid sequences from draft genomes [18], but also routinely a reduction of the total number of contigs to under 50 contigs per genome [19-21] with near complete removal of small contigs. Due to a thorough quality check at every assembly step by repeated read mapping and visual checking (Fig. 1), we make sure not to aggressively reduce the number of contigs by combining contigs that do not belong together [22, 23]. As the raw reads are generally available from databanks, the workflow (Fig. 1) would be possible for submitted genome sequences as well [24], but the effort is substantial and success is not guaranteed. The problem with long-read technologies is not the number of contigs, but the quality of the individual read sequences. By using sufficiently large number of reads or additional reads from a short-read technology for assembly, the quality of the assembly can be improved significantly. However, if a genome is only used for. Taxonomic analysis, sequence errors based on lower coverage are not intrinsically detected. Unfortunately, such genomes will all the same appear in comparative studies, influencing their quality [25]. We recently retrieved the genome sequence, generated with MinION sequencing, of a bacterium described as “Kluyvera intestini” GT-16 [26]. This genome clustered closely to the genomes of two recently described novel species in the genus Phytobacter [27]. A simple test with ANI showed that strain GT-16 belongs to the species Phytobacter diazotrophicus (T.H.M. Smits and F. Rezzonico, unpublished). After the analysis of the genome sequence with the comparative genomics program EDGAR [28, 29] together with several other genomes of Phytobacter and related genera, we noticed that inclusion of the GT-16 genome sequence led to a drastic drop in the number of core genes. Reannotation using Prokka [30] did not improve the situation, and the summary of the annotation indicated a large number of pseudogenes. An examination of the annotation showed that these pseudogenes were caused from frame shifts, presumably originating in sequencing errors in the reads used. Interestingly enough, the same authors had previously published a draft genome of the same strain based on Illumina reads [31]. Combination of the data in a hybrid assembly approach would have yielded a high-quality genome [32, 33]. In my job as section editor, but also prior to this, I have encountered many manuscripts in which the authors described only the sequencing and automatic assembly of genomes, often prior to comparative genomics. I have identified many manuscripts that are based on such work, and I have rejected some of them due to lack of basic genome information. Investing a little time in assembly and quality control can resolve assembly mistakes, yielding a lower number of contigs, and can allow identification and closure of plasmids. This little bit of extra time helps editors and reviewers to estimate the quality of genomes used for comparative genomic study, but also the research community to more effectively use genome sequences for various purposes. Problems based on the quality of genome assemblies, as described in this correspondence, would then be minimized. In the end, the benefitfrom good quality genome assemblies in databanks [34, 35] is a win-win situation for all researchers in genomics..

34 in total

1. Preserving accuracy in GenBank.

Authors: M I Bidartondo
Journal: Science Date: 2008-03-21 Impact factor: 47.728

Review 2. Next-generation DNA sequencing methods.

Authors: Elaine R Mardis
Journal: Annu Rev Genomics Hum Genet Date: 2008 Impact factor: 8.929

3. Emended description of the genus Phytobacter, its type species Phytobacter diazotrophicus (Zhang 2008) and description of Phytobacter ursingii sp. nov.

Authors: Marcelo Pillonetto; Lavinia N Arend; Helisson Faoro; Helena R S D'Espindula; Jochen Blom; Theo H M Smits; Marcelo T Mira; Fabio Rezzonico
Journal: Int J Syst Evol Microbiol Date: 2017-11-10 Impact factor: 2.747

4. A novel plasmid pEA68 of Erwinia amylovora and the description of a new family of plasmids.

Authors: Emadeldeen Ismail; Jochen Blom; Alain Bultreys; Milan Ivanović; Aleksa Obradović; Joop van Doorn; Maria Bergsma-Vlami; Martine Maes; Anne Willems; Brion Duffy; Virginia O Stockwell; Theo H M Smits; Joanna Puławska
Journal: Arch Microbiol Date: 2014-09-02 Impact factor: 2.552

5. Clarification of Taxonomic Status within the Pseudomonas syringae Species Group Based on a Phylogenomic Analysis.

Authors: Margarita Gomila; Antonio Busquets; Magdalena Mulet; Elena García-Valdés; Jorge Lalucat
Journal: Front Microbiol Date: 2017-12-07 Impact factor: 5.640

6. Complete Genome Sequence of Pseudomonas viridiflava CFBP 1590, Isolated from Diseased Cherry in France.

Authors: Michela Ruinelli; Jochen Blom; Joël F Pothier
Journal: Genome Announc Date: 2017-07-27

7. Comparative genomics and pathogenicity potential of members of the Pseudomonas syringae species complex on Prunus spp.

Authors: Michela Ruinelli; Jochen Blom; Theo H M Smits; Joël F Pothier
Journal: BMC Genomics Date: 2019-03-05 Impact factor: 3.969

8. Reducing assembly complexity of microbial genomes with single-molecule sequencing.

Authors: Sergey Koren; Gregory P Harhay; Timothy P L Smith; James L Bono; Dayna M Harhay; Scott D Mcvey; Diana Radune; Nicholas H Bergman; Adam M Phillippy
Journal: Genome Biol Date: 2013 Impact factor: 13.583

Review 9. The long reads ahead: de novo genome assembly using the MinION.

Authors: Carlos de Lannoy; Dick de Ridder; Judith Risse
Journal: F1000Res Date: 2017-07-07

10. Complete Genome Sequence of Kluyvera intestini sp. nov., Isolated from the Stomach of a Patient with Gastric Cancer.

Authors: George Tetz; Maria Vecherkovskaya; Paul Zappile; Igor Dolgalev; Aristotelis Tsirigos; Adriana Heguy; Victor Tetz
Journal: Genome Announc Date: 2017-10-26

11 in total

1. Identification and profiling of microbial community from industrial sludge.

Authors: Pooja Sharma; Surendra Pratap Singh
Journal: Arch Microbiol Date: 2022-04-01 Impact factor: 2.552

2. Comparative Analysis and Data Provenance for 1,113 Bacterial Genome Assemblies.

Authors: David A Yarmosh; Juan G Lopera; Nikhita P Puthuveetil; Patrick Ford Combs; Amy L Reese; Corina Tabron; Amanda E Pierola; James Duncan; Samuel R Greenfield; Robert Marlow; Stephen King; Marco A Riojas; John Bagnoli; Briana Benton; Jonathan L Jacobs
Journal: mSphere Date: 2022-05-02 Impact factor: 5.029

Review 3. Future of Probiotics and Prebiotics and the Implications for Early Career Researchers.

Authors: Irina Spacova; Hemraj B Dodiya; Anna-Ursula Happel; Conall Strain; Dieter Vandenheuvel; Xuedan Wang; Gregor Reid
Journal: Front Microbiol Date: 2020-06-24 Impact factor: 5.640

4. An Educational Bioinformatics Project to Improve Genome Annotation.

Authors: Zoie Amatore; Susan Gunn; Laura K Harris
Journal: Front Microbiol Date: 2020-12-07 Impact factor: 5.640

5. Comparative Metabologenomics Analysis of Polar Actinomycetes.

Authors: Sylvia Soldatou; Grímur Hjörleifsson Eldjárn; Andrew Ramsay; Justin J J van der Hooft; Alison H Hughes; Simon Rogers; Katherine R Duncan
Journal: Mar Drugs Date: 2021-02-10 Impact factor: 5.118

6. Staphylococcus aureus Genomes Harbor Only MpsAB-Like Bicarbonate Transporter but Not Carbonic Anhydrase as Dissolved Inorganic Carbon Supply System.

Authors: Sook-Ha Fan; Elisa Liberini; Friedrich Götz
Journal: Microbiol Spectr Date: 2021-11-03

Review 7. Resolving taxonomic confusion: establishing the genus Phytobacter on the list of clinically relevant Enterobacteriaceae.

Authors: Theo H M Smits; Lavinia N V S Arend; Sofia Cardew; Erika Tång-Hallbäck; Marcelo T Mira; Edward R B Moore; Jorge L M Sampaio; Fabio Rezzonico; Marcelo Pillonetto
Journal: Eur J Clin Microbiol Infect Dis Date: 2022-02-15 Impact factor: 3.267

8. Antibiotic Resistance Characteristics of Pseudomonas aeruginosa Isolated from Keratitis in Australia and India.

Authors: Mahjabeen Khan; Fiona Stapleton; Stephen Summers; Scott A Rice; Mark D P Willcox
Journal: Antibiotics (Basel) Date: 2020-09-14

9. Comparative Genomics across Three Ensifer Species Using a New Complete Genome Sequence of the Medicago Symbiont Sinorhizobium (Ensifer) meliloti WSM1022.

Authors: Laura Baxter; Proyash Roy; Emma Picot; Jess Watts; Alex Jones; Helen Wilkinson; Patrick Schäfer; Miriam Gifford; Beatriz Lagunas
Journal: Microorganisms Date: 2021-11-25

10. Synergistic interaction between the type III secretion system of the endophytic bacterium Pantoea agglomerans DAPP-PG 734 and the virulence of the causal agent of olive knot Pseudomonas savastanoi pv. savastanoi DAPP-PG 722.

Authors: Chiaraluce Moretti; Fabio Rezzonico; Benedetta Orfei; Chiara Cortese; Alba Moreno-Pérez; Harrold A van den Burg; Andrea Onofri; Giuseppe Firrao; Cayo Ramos; Theo H M Smits; Roberto Buonaurio
Journal: Mol Plant Pathol Date: 2021-07-16 Impact factor: 5.663