Literature DB >> 29036533

gVolante for standardizing completeness assessment of genome and transcriptome assemblies.

Osamu Nishimura1, Yuichiro Hara1, Shigehiro Kuraku1.   

Abstract

MOTIVATION: Along with the increasing accessibility to comprehensive sequence information, such as whole genomes and transcriptomes, the demand for assessing their quality has been multiplied. To this end, metrics based on sequence lengths, such as N50, have become a standard, but they only evaluate one aspect of assembly quality. Conversely, analyzing the coverage of pre-selected reference protein-coding genes provides essential content-based quality assessment, but the currently available pipelines for this purpose, CEGMA and BUSCO, do not have a user-friendly interface to serve as a uniform environment for assembly completeness assessment.
RESULTS: Here, we introduce a brand-new web server, gVolante, which provides an online tool for (i) on-demand completeness assessment of sequence sets by means of the previously developed pipelines CEGMA and BUSCO and (ii) browsing pre-computed completeness scores for publicly available data in its database section. Completeness assessments performed on gVolante report scores based on not just the coverage of reference genes but also on sequence lengths (e.g. N50 scaffold length), allowing quality control in multiple aspects. Using gVolante, one can compare the quality of original assemblies between their multiple versions (obtained through program choice and parameter tweaking, for example) and evaluate them in comparison to the scores of public resources found in the database section.
AVAILABILITY AND IMPLEMENTATION: gVoalte is freely available at https://gvolante.riken.jp/. CONTACT: shigehiro.kuraku@riken.jp.
© The Author 2017. Published by Oxford University Press.

Entities:  

Mesh:

Year:  2017        PMID: 29036533      PMCID: PMC5870689          DOI: 10.1093/bioinformatics/btx445

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

As the accessibility to comprehensive sequence information increases, the demand for assessing their continuity and completeness has been multiplied. Comprehensive sequence information currently emerging is mostly prepared by de novo sequence assembly. Products of genome and transcriptome assemblies are often not thoroughly assessed because of the time-consuming nature of assembly program executions, and it is also not straightforward to assess them on a uniform criterion. Metrics based on sequence lengths, such as N50, have become a standard for assessing de novo assemblies, but they can overestimate the completeness upon overassembly and obviously cannot evaluate compositional aspects. The program pipeline CEGMA (Parra ) was recommended for completeness assessment of genome assemblies based on the coverage of pre-selected reference protein-coding genes by the project Assemblathon2 (Bradnam ). More recently, the program pipeline BUSCO was introduced for the same purpose (Simao ). These pipelines are becoming more frequently used, but they are executable only from Unix command-line, and do not have a user-friendly interface.

2 Implementation

CEGMA ver. 2.5 and BUSCO ver. 1.22 and ver. 2.0.1 (equivalent to v3.0.1 as the latest modification was only a refactoring), as well as all the programs required by these pipelines, were built on a Linux server, which provides a web interface using Secure Socket Layer (SSL) encryption. The user information and the file uploaded by the user are used only for completeness assessment and are erased after a minimal time to ensure anonymity. The web server also hosts pages for user instruction including a step-by-step manual. Base compositions and length-based metrics are reported, based on the script assemblathon_stats.pl available at https://github.com/ucdavis-bioinformatics/assemblathon2-analysis/.

3 Functions

3.1 Executing an assessment

gVolante is a web server designed to make the best use of the reference gene set CVG previously introduced for more accurate completeness assessment for vertebrates (Hara ). It achieves a method of handy analysis execution without command line operation and allows the standardized scoring of completeness on a uniform computational environment. In addition, an analysis on gVolante gives a concise report of length-based metrics and base compositions (Fig. 1).
Fig. 1

Functions of gVolante. The web server provides two functions, ‘Analysis’ in the upper row and ‘Database’ in the lower row. Using gVolante, one can compare the quality of original assemblies and evaluate them in comparison to the scores of public resources found in the database section, for content-based decision-making for more comprehensive downstream analyses

Functions of gVolante. The web server provides two functions, ‘Analysis’ in the upper row and ‘Database’ in the lower row. Using gVolante, one can compare the quality of original assemblies and evaluate them in comparison to the scores of public resources found in the database section, for content-based decision-making for more comprehensive downstream analyses In the ‘Analysis’ page, the user is guided to first upload a sequence file and enter an arbitrary project name and one’s email address. The file to be uploaded is expected to, but does not have to, be a de novo assembly product of no more than 10 GB. Compressed files are also accepted. Additional inputs include a cut-off length for computing N50 lengths. Because typical de novo assembly products contain a number of short sequences, e.g. shorter than 500 bp, the length cut-off can have a large impact on N50 values and should thus be deliberately set.

3.2 Gene search pipeline: CEGMA or BUSCO

Only BUSCO is designed to assess transcriptome assemblies and peptide sequences, which is usually completed within an hour. For assessing genome assemblies though, one needs to carefully choose which search pipeline is used. An assessment of a genome assembly with CEGMA and BUSCO can take longer than a day. When CEGMA is selected, the user needs to choose a taxonomic property of the species of interest (mammal, vertebrate or other). This specifies the maximum lengths of intronic and flanking sequences of the candidate genic regions in the CEGMA pipeline (Parra ).

3.3 Choice of a reference gene set

Completeness assessment should be performed with a careful consideration for the compatibility of a reference gene set with the taxonomic position of the species of interest. gVolante provides a choice between the reference gene set ‘CEG’ associated with CEGMA, our original gene set for vertebrates ‘CVG’ (Hara ), and some of the gene sets provided with BUSCO—the latter applies only when BUSCO is chosen as a gene search pipeline. CVG is designed to prevent (i) overestimation of assembly completeness, which can be caused by an expanded gene repertoire owing to whole genome duplication in the vertebrate lineage and (ii) underestimation caused by confusion of lineage-specific gene loss with missing from an assembly (Hara ).

3.4 Browsing pre-computed completeness scores

We uniformly employed CEGMA and the reference gene set CVG which showed more accurate assessment than other configurations (Hara ) and executed completeness assessments on selected public sequence resources. This combination was necessitated by the vast range of targets covering the whole taxon Vertebrata. They consisted of 73 genomes and 17 transcriptome nucleotide sequences of diverse vertebrates including cyclostomes and cartilaginous fishes. The obtained completeness scores are tabulated in the ‘Database’ page on the web server. One can click on the row of an individual analysis result to view it in detail (Fig. 1), and further explore which CVG components were identified or missing in the assessment. When one is urged to carefully verify the presence or absence of a particular CVG component, he or she is guided to perform molecular phylogeny inference using the aLeaves-MAFFT online suite (Kuraku ).
  6 in total

1.  CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes.

Authors:  Genis Parra; Keith Bradnam; Ian Korf
Journal:  Bioinformatics       Date:  2007-03-01       Impact factor: 6.937

2.  BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs.

Authors:  Felipe A Simão; Robert M Waterhouse; Panagiotis Ioannidis; Evgenia V Kriventseva; Evgeny M Zdobnov
Journal:  Bioinformatics       Date:  2015-06-09       Impact factor: 6.937

3.  Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species.

Authors:  Keith R Bradnam; Joseph N Fass; Anton Alexandrov; Paul Baranay; Michael Bechner; Inanç Birol; Sébastien Boisvert; Jarrod A Chapman; Guillaume Chapuis; Rayan Chikhi; Hamidreza Chitsaz; Wen-Chi Chou; Jacques Corbeil; Cristian Del Fabbro; T Roderick Docking; Richard Durbin; Dent Earl; Scott Emrich; Pavel Fedotov; Nuno A Fonseca; Ganeshkumar Ganapathy; Richard A Gibbs; Sante Gnerre; Elénie Godzaridis; Steve Goldstein; Matthias Haimel; Giles Hall; David Haussler; Joseph B Hiatt; Isaac Y Ho; Jason Howard; Martin Hunt; Shaun D Jackman; David B Jaffe; Erich D Jarvis; Huaiyang Jiang; Sergey Kazakov; Paul J Kersey; Jacob O Kitzman; James R Knight; Sergey Koren; Tak-Wah Lam; Dominique Lavenier; François Laviolette; Yingrui Li; Zhenyu Li; Binghang Liu; Yue Liu; Ruibang Luo; Iain Maccallum; Matthew D Macmanes; Nicolas Maillet; Sergey Melnikov; Delphine Naquin; Zemin Ning; Thomas D Otto; Benedict Paten; Octávio S Paulo; Adam M Phillippy; Francisco Pina-Martins; Michael Place; Dariusz Przybylski; Xiang Qin; Carson Qu; Filipe J Ribeiro; Stephen Richards; Daniel S Rokhsar; J Graham Ruby; Simone Scalabrin; Michael C Schatz; David C Schwartz; Alexey Sergushichev; Ted Sharpe; Timothy I Shaw; Jay Shendure; Yujian Shi; Jared T Simpson; Henry Song; Fedor Tsarev; Francesco Vezzi; Riccardo Vicedomini; Bruno M Vieira; Jun Wang; Kim C Worley; Shuangye Yin; Siu-Ming Yiu; Jianying Yuan; Guojie Zhang; Hao Zhang; Shiguo Zhou; Ian F Korf
Journal:  Gigascience       Date:  2013-07-22       Impact factor: 6.524

4.  Assessing the gene space in draft genomes.

Authors:  Genis Parra; Keith Bradnam; Zemin Ning; Thomas Keane; Ian Korf
Journal:  Nucleic Acids Res       Date:  2008-11-28       Impact factor: 16.971

5.  aLeaves facilitates on-demand exploration of metazoan gene family trees on MAFFT sequence alignment server with enhanced interactivity.

Authors:  Shigehiro Kuraku; Christian M Zmasek; Osamu Nishimura; Kazutaka Katoh
Journal:  Nucleic Acids Res       Date:  2013-05-15       Impact factor: 16.971

6.  Optimizing and benchmarking de novo transcriptome sequencing: from library preparation to assembly evaluation.

Authors:  Yuichiro Hara; Kaori Tatsumi; Michio Yoshida; Eriko Kajikawa; Hiroshi Kiyonari; Shigehiro Kuraku
Journal:  BMC Genomics       Date:  2015-11-18       Impact factor: 3.969

  6 in total
  77 in total

1.  Shark genomes provide insights into elasmobranch evolution and the origin of vertebrates.

Authors:  Yuichiro Hara; Kazuaki Yamaguchi; Koh Onimaru; Mitsutaka Kadota; Mitsumasa Koyanagi; Sean D Keeley; Kaori Tatsumi; Kaori Tanaka; Fumio Motone; Yuka Kageyama; Ryo Nozu; Noritaka Adachi; Osamu Nishimura; Reiko Nakagawa; Chiharu Tanegashima; Itsuki Kiyatake; Rui Matsumoto; Kiyomi Murakumo; Kiyonori Nishida; Akihisa Terakita; Shigeru Kuratani; Keiichi Sato; Susumu Hyodo; Shigehiro Kuraku
Journal:  Nat Ecol Evol       Date:  2018-10-08       Impact factor: 15.460

2.  Genome-wide patterns of divergence and introgression after secondary contact between Pungitius sticklebacks.

Authors:  Yo Y Yamasaki; Ryo Kakioka; Hiroshi Takahashi; Atsushi Toyoda; Atsushi J Nagano; Yoshiyasu Machida; Peter R Møller; Jun Kitano
Journal:  Philos Trans R Soc Lond B Biol Sci       Date:  2020-07-13       Impact factor: 6.237

3.  Genome analysis provides insights into the biocontrol ability of Mitsuaria sp. strain TWR114.

Authors:  Malek Marian; Takashi Fujikawa; Masafumi Shimizu
Journal:  Arch Microbiol       Date:  2021-04-21       Impact factor: 2.552

4.  Aquatic Adaptation and Depleted Diversity: A Deep Dive into the Genomes of the Sea Otter and Giant Otter.

Authors:  Annabel C Beichman; Klaus-Peter Koepfli; Gang Li; William Murphy; Pasha Dobrynin; Sergei Kliver; Martin T Tinker; Michael J Murray; Jeremy Johnson; Kerstin Lindblad-Toh; Elinor K Karlsson; Kirk E Lohmueller; Robert K Wayne
Journal:  Mol Biol Evol       Date:  2019-12-01       Impact factor: 16.240

5.  Comparative Transcriptomics Reveals Patterns of Adaptive Evolution Associated with Depth and Age Within Marine Rockfishes (Sebastes).

Authors:  Joseph Heras; Andres Aguilar
Journal:  J Hered       Date:  2019-05-07       Impact factor: 2.645

6.  A novel family of secreted insect proteins linked to plant gall development.

Authors:  Aishwarya Korgaonkar; Clair Han; Andrew L Lemire; Igor Siwanowicz; Djawed Bennouna; Rachel E Kopec; Peter Andolfatto; Shuji Shigenobu; David L Stern
Journal:  Curr Biol       Date:  2021-03-02       Impact factor: 10.834

7.  High-quality carnivoran genomes from roadkill samples enable comparative species delineation in aardwolf and bat-eared fox.

Authors:  Rémi Allio; Marie-Ka Tilak; Celine Scornavacca; Nico L Avenant; Andrew C Kitchener; Erwan Corre; Benoit Nabholz; Frédéric Delsuc
Journal:  Elife       Date:  2021-02-18       Impact factor: 8.140

8.  Complete Genome Sequence of Ferrigenium kumadai An22, a Microaerophilic Iron-Oxidizing Bacterium Isolated from a Paddy Field Soil.

Authors:  Takeshi Watanabe; Ashraf Khalifa; Susumu Asakawa
Journal:  Microbiol Resour Announc       Date:  2021-07-08

9.  De novo genome sequencing of mycoparasite Mycogone perniciosa strain MgR1 sheds new light on its biological complexity.

Authors:  Anil Kumar; V P Sharma; Satish Kumar; Manoj Nath
Journal:  Braz J Microbiol       Date:  2021-06-17       Impact factor: 2.214

10.  Parallel evolution of trehalose production machinery in anhydrobiotic animals via recurrent gene loss and horizontal transfer.

Authors:  Yuichiro Hara; Reira Shibahara; Koyuki Kondo; Wataru Abe; Takekazu Kunieda
Journal:  Open Biol       Date:  2021-07-14       Impact factor: 6.411

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.