| Literature DB >> 33180771 |
Hyungtaek Jung1,2, Tomer Ventura3, J Sook Chung4, Woo-Jin Kim5, Bo-Hye Nam6, Hee Jeong Kong6, Young-Ok Kim6, Min-Seung Jeon7, Seong-Il Eyun7.
Abstract
Eukaryotic genome sequencing and de novo assembly, once the exclusive domain of well-funded international consortia, have become increasingly affordable, thus fitting the budgets of individual research groups. Third-generation long-read DNA sequencing technologies are increasingly used, providing extensive genomic toolkits that were once reserved for a few select model organisms. Generating high-quality genome assemblies and annotations for many aquatic species still presents significant challenges due to their large genome sizes, complexity, and high chromosome numbers. Indeed, selecting the most appropriate sequencing and software platforms and annotation pipelines for a new genome project can be daunting because tools often only work in limited contexts. In genomics, generating a high-quality genome assembly/annotation has become an indispensable tool for better understanding the biology of any species. Herein, we state 12 steps to help researchers get started in genome projects by presenting guidelines that are broadly applicable (to any species), sustainable over time, and cover all aspects of genome assembly and annotation projects from start to finish. We review some commonly used approaches, including practical methods to extract high-quality DNA and choices for the best sequencing platforms and library preparations. In addition, we discuss the range of potential bioinformatics pipelines, including structural and functional annotations (e.g., transposable elements and repetitive sequences). This paper also includes information on how to build a wide community for a genome project, the importance of data management, and how to make the data and results Findable, Accessible, Interoperable, and Reusable (FAIR) by submitting them to a public repository and sharing them with the research community.Entities:
Year: 2020 PMID: 33180771 PMCID: PMC7660529 DOI: 10.1371/journal.pcbi.1008325
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Summary of recently published chromosome-level genome assemblies in aquaculture species using long-read sequences,.
| Scientific name | GS (Gb) | Final output | Input detail and depth (×) | BAs | Reference | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| AGS (Gb) | FSN | N50 (Mb) | IM | PacBio | ONT | 10xGC | Hi-C (×) | ||||
| Fish | |||||||||||
| 0.83/DP | 0.88 | 24 | 1.1 | 63 | 109 | 233 | Sex determination genes and chromosomes | [ | |||
| 0.81/DP | 0.79 | 26 | 29.85 | 50 | 76 | 20 | Chromosome rearrangement and spawning time | [ | |||
| 1.11/DP | 1.14 | 24 | 46.03 | 49 | 96 | 100 | Chromosome-level reference genome | [ | |||
| 1.07/DP | 1.09 | 24 | 46.2 | 134 | 0.2 | Innate immunity and growth | [ | ||||
| 0.87/DP | 0.81 | 24 | 3.85 | 132 | 66 | 189 | Maternal reproductive system | [ | |||
| 0.65/DP | 0.67 | 24 | 22.34 | 321 | 109 | Chromosome-level reference genome | [ | ||||
| 0.78/DP | 0.77 | 24 | 33.5 | 116 | 80 | 118 | Chromosome-level reference genome | [ | |||
| 0.72/DP | 0.73 | 26 | 25.8 | 70 | 53 | 200 | Chromosome-level reference genome | [ | |||
| Shellfish | |||||||||||
| 1.33/DP | 1.22 | 19 | 65.93 | 148 | 148 | 136 | 123 | 154 | Chromosome-level reference genome | [ | |
aThis table represents a selection of recent aquaculture genome works focusing on whole-genome assemblies using BioNano and/or Hi-C data (at least 1 technology used) since 2018. In addition, the table does not include any pure TGS/SGS/hybrid genome assemblies without BioNano/Hi-C data, single-cell sequencing, or transcriptomes. If the original report had no estimated input depth, this was calculated from the raw data. For the most recent global statistics, we highly recommend visiting the associated GenBank BioProject.
bAGS, assembled genome size; BAs, biological applications; DP, diploid; FSN, final scaffold number (pseudochromosome number); GS, genome size; IM, Illumina (combined paired-end [PE] and mate-pair [MP] reads); ONT, Oxford Nanopore Technology; PacBio, Pacific Bioscience; SGS, second-generation sequencing; TGS, third-generation sequencing; 10xGC, 10x Genomics Chromium.
Commonly used tools and programs for genome assembly.
| Name | Official link | Main feature |
|---|---|---|
| De novo genome assemblers for TGS reads | ||
| Falcon/HGAP | Diploid-aware mode including trim, correction, and consensus for PacBio reads | |
| CANU | A fork of the Celera Assembler including trim, correction, and consensus for TGS reads | |
| SMARTdenovo | De novo assembler including all-vs.-all raw read alignments without an error correction stage for TGS reads | |
| MECAT | Ultrafast mapping, error correction, and de novo assembly tools for single-molecule sequencing reads | |
| Flye | A repeat graph mode including trim, correction, and consensus with polishing for TGS reads | |
| Shasta | A run-length representation of ONT reads | |
| De novo genome assemblers for SGS reads | ||
| ABySS2 | An assembler intended for SGS PE and linked-reads | |
| AllPath-LG | Uses a unipath graph from the | |
| MEGAHIT | An ultrafast and memory-efficient assembler for SGS reads | |
| SOAPdenovo | De Bruijn graph assembler with an error correction stage | |
| De novo genome assemblers for hybrid reads | ||
| MaSuRCA | An assembler combining the benefits of the de Bruijn and Overlap-Layout-Consensus assembly approaches for SGS and TGS reads | |
| Reference-guided/assistance assemblers | ||
| Ragout | Chromosome-level scaffolding | |
| RaGOO | Pseudochromosome construction | |
| RGAAT | Genome assembly and annotation | |
| Haplotype/phase assemblers | ||
| Falcon-Unzip | PacBio reads | |
| Falcon-Phase | PacBio reads | |
| Triobinning | ONT reads | |
| Platanus-allee | SGS and TGS reads | |
| WhatsHap | SGS and TGS reads | |
| IntegratedPhasing | SGS and TGS reads | |
| HaploConduct | SGS and TGS reads | |
| HaplotypeAssembler | SGS and TGS reads | |
ONT, Oxford Nanopore Technology; PacBio, Pacific Bioscience; PE, paired-end; SGS, second-generation sequencing; TGS, third-generation sequencing.
Fig 1Recommended flowchart for genome assembly and annotation.
NGS, next-generation sequencing.
Commonly used genome annotation tools and programs.
| Name | Official link | Main feature |
|---|---|---|
| Online pipeline | ||
| NCBI | Eukaryotic genome annotation. An automatic pipeline with flexibility and speed. Good for beginners and easy to use. | |
| Prokaryotic genome annotation. An automatic pipeline with flexibility and speed. Good for beginners. | ||
| Ensembl | Genome annotation. An automatic pipeline for importing external data or using predictive algorithms. Good for beginners and easy to use. | |
| GenSAS | Integrates with JBrowse and Apollo. An automatic platform and pipeline for genome structural and functional annotation. A user-friendly interactive portal that includes visualization and editing. Good for beginners and easy to use. | |
| GO FEAT | Genome and transcriptome. A rapid automatic platform for functional annotation and enrichment. A user-friendly portal that can export results in different output formats. Good for beginners and easy to use. | |
| Blast2GO | Functional annotation. An automatic platform as a standalone application that has high throughput and is interactive. A user-friendly program with easy start-up and low maintenance. Good for beginners, but the pro version requires a commercial license. | |
| AmiGO | GO and GO enrichment analysis. A user-friendly web-based platform. Requires some configuration of public databases with Perl, JavaScript, and Linux for the standalone application. A good web resource for beginners, but local installation requires bioinformatics support. | |
| eggNOG | Database of orthologous groups and functional annotation. An automatic platform and pipeline for any genome that scales with speed and flexibility (15 and 2.5 times faster than BLAST and InterProScan, respectively). Requires some configuration of public databases with various computer languages for a standalone application. A good web resource for beginners, but local installation requires bioinformatics support. | |
| KAAS | Ortholog assignment and pathway mapping. An automatic platform but has a limited number of query sequences. A good web resource for beginners, but local installation requires bioinformatics support. | |
| Augustus | Gene/genome structure and annotation using ab initio and transcript-based prediction. An automatic platform and pipeline for eukaryotic genomes. Requires some configuration of public databases with various computer languages and dependencies for a standalone application. A good web resource for beginners, but local installation requires bioinformatics support. | |
| GAAP | A semiautomated genome assembly and annotation pipeline. | |
| Command line interface | ||
| BRAKER | Gene/genome structure and annotation using a combination of GeneMark-ET, Augustus, and RNA-seq evidence. A fully automated training platform for novel eukaryotic genomes. Requires 2 input files: an RNA-seq alignment file in BAM format and a corresponding genome file in fasta format. Good for intermediate and advanced users due to the requirement of several semi-unsupervised pipelines and dependencies in local installation. | |
| MAKER | Gene/genome structure and annotation pipeline. An easy-to-use semiautomatic pipeline for the de novo annotation of newly sequenced genomes for updating existing annotations to reflect new evidence or just to combine annotations, evidence, and quality control statistics for use with other GMOD programs such as G/JBrowse, Chado, and Apollo. Good for intermediate and advanced users due to the requirement of several semi-unsupervised pipelines and dependencies in local installation. | |
| Cufflinks | Transcriptome assembly and differential expression analysis of RNA-seq. A semiautomatic pipeline that includes TopHat (read mapping) and CummeRbund (visualization and exploration). Good for intermediate and advanced users due to the requirement of several pipelines and dependencies in local installation. | |
| StringTie | A fast and highly efficient assembler of RNA-seq alignment that allows users to quantitate full-length transcripts representing multiple splice variants for each gene locus. A semiautomatic pipeline using a BAM alignment input file with RNA-seq read mappings (produced and converted by TopHat, HISAT2, and Samtools). Good for intermediate and advanced users due to the requirement of several pipelines and dependencies in local installation. | |
| GLEAN | An unsupervised learning system for gene structure prediction. A semiautomatic pipeline without prior training. Lacks proper documentation and resources to run programs. Might be good for advanced users due to the requirement of several pipelines and dependencies in local installation. | |
| BLAST | A specialized algorithm to find regions of local similarity between sequences. A semiautomatic pipeline for understanding biological sequences. A good web resource for beginners, but local installation requires bioinformatics support. | |
| Modeler | Software combining ab initio gene predictions and protein/transcript evidence into weighted consensus gene structures. A semiautomatic pipeline with a flexible and intuitive framework for gene structure annotation. Good for intermediate and advanced users due to the requirement of several pipelines and dependencies in local installation. | |
| GSNAP | A genomic mapping and alignment program for mRNA and ESTs. A semiautomatic pipeline for gene structure annotation. Good for intermediate and advanced users due to the requirement of several pipelines, configurations, and dependencies in local installation. | |
| SNAP | Semi-HMM-based nucleic acid parser gene prediction tool. A semiautomatic pipeline for gene structure annotation. Good for intermediate and advanced users due to the requirement of several pipelines, configurations, and dependencies in local installation. | |
| TopHat | A fast splice junction mapper for RNA-seq. A semiautomatic pipeline that includes Bowtie and HISAT2 (read aligner). Good for intermediate and advanced users due to the requirement of several pipelines and dependencies in local installation. | |
| PASA | Program for assembling spliced alignments for genome annotation and gene structures. A semiautomatic pipeline for gene structure annotation but useful for genome-guided and de novo RNA-seq assemblies to generate a comprehensive transcript database. Good for intermediate and advanced users due to the requirement of several pipelines and dependencies in local installation. | |
| Evigan | Predicts genes by integrating multiple evidence sources. An automated annotation program that employs a Dynamic Bayesian Network. Model parameters are estimated by the Expectation–Maximization algorithm, thus eliminating the need to curate training data. Good for intermediate users due to the local installation requirement. | |
| Noncoding RNAs | ||
| Ensembl | Automatic annotation of noncoding genes but requires registration. A good web resource for beginners. | |
| LncFunTK | Functional annotation of long noncoding RNAs. An easy-to-use automatic pipeline for newly assembled genomes but requires several input files such as expression profiles (GTF format), TF binding profiles (BED format), and miRNA-binding profiles. This is a good web resource for beginners but might be better for intermediate and advanced users due to the requirement of several input files, pipelines, configurations, and dependencies in local installation. | |
| NONCODE | Database for noncoding RNAs except tRNAs and rRNAs. An automatic pipeline including 6 steps, format normalization (BED or GTF), combination, filtering protein-coding RNA, information retrieval, advanced annotation, and web presentation. This has a good user-friendly web interface for beginners, but it might be better for intermediate and advanced users due to the requirement of several pipelines, configurations, and dependencies in local installation. | |
| deebBase | Small RNAs, lncRNAs, and circular RNAs | |
| lncRNAdb | A database that provides comprehensive annotations of eukaryotic long noncoding RNAs. An easy-to-use open public resource. An automatic pipeline for single sequences and a semiautomatic pipeline for multiple sequences with bioinformatic scripts. This has a good user-friendly web interface for beginners but it might be better for intermediate and advanced users due to the requirement of several pipelines, configurations, and dependencies in local installation. | |
| Repeat element | ||
| RepeatMasker | A program to screen for interspersed repeats and low-complexity DNA sequences. A fast and sensitive semiautomatic pipeline for assembled genomes. Good for intermediate and advanced users due to the requirement of several databases, pipelines, and dependencies in local installation. | |
| RepeatRunner | A CGL-based program that integrates RepeatMasker with blastx to identify repetitive elements. A semiautomatic pipeline for assembled genomes. Good for intermediate and advanced users due to the requirement of several databases, configurations, pipelines, and dependencies in local installation. | |
| RepBase | A database of prototypic sequences representing repetitive DNA from different eukaryotic species. A semiautomatic pipeline for genome sequencing projects. This has a good user-friendly web interface for beginners but it might be better for intermediate and advanced users due to the requirement of several pipelines, configurations, and dependencies in local installation. | |
BAM, binary alignment map; BED, browser extensible data; ESTs, expressed sequence tags; GO, gene ontology; GTF, gene transfer format; HMM, hidden Markov model; RNA-seq, RNA sequencing; TF, transcription factor.