| Literature DB >> 29568489 |
Victoria Dominguez Del Angel1, Erik Hjerde2, Lieven Sterck3,4, Salvadors Capella-Gutierrez5,6, Cederic Notredame7,8, Olga Vinnere Pettersson9, Joelle Amselem10, Laurent Bouri1, Stephanie Bocs11,12,13, Christophe Klopp14, Jean-Francois Gibrat1,15, Anna Vlasova8, Brane L Leskosek16, Lucile Soler17, Mahesh Binzer-Panchal17, Henrik Lantz17.
Abstract
As a part of the ELIXIR-EXCELERATE efforts in capacity building, we present here 10 steps to facilitate researchers getting started in genome assembly and genome annotation. The guidelines given are broadly applicable, intended to be stable over time, and cover all aspects from start to finish of a general assembly and annotation project. Intrinsic properties of genomes are discussed, as is the importance of using high quality DNA. Different sequencing technologies and generally applicable workflows for genome assembly are also detailed. We cover structural and functional annotation and encourage readers to also annotate transposable elements, something that is often omitted from annotation workflows. The importance of data management is stressed, and we give advice on where to submit data and how to make your results Findable, Accessible, Interoperable, and Reusable (FAIR).Entities:
Keywords: Annotation; Assembly; DNA; FAIR; Genome; NGS; Workflows
Year: 2018 PMID: 29568489 PMCID: PMC5850084 DOI: 10.12688/f1000research.13598.1
Source DB: PubMed Journal: F1000Res ISSN: 2046-1402
Figure 1. Timeline and comparison of different sequencing technologies.
The data is based on the throughput metrics for the different platforms since their first instrument version came out. The figure visualises the results by plotting throughput in raw bases versus read length. Data released under CC BY 4.0 International license. doi 10.6084/m9.figshare.100940.
Examples of time and computer resources used by software dedicated to assembly and annotation.
SPAdes is an assembler designed for the assembly of small genomes using short reads. Smartdenovo is a de novo assembler for PacBio and Oxford Nanopore (ONT) data. The REPET package is a software suite dedicated to detect, classify and annotate repeats. EuGene is an open integrative gene finder for eukaryotic and prokaryotic genomes. Processing time and RAM used will be affected by amount of input data, complexity of data, and genome size.
| Reference Genome | Size | Software | Input
| CPU/RAM Available | Real time | Max RAM
|
|---|---|---|---|---|---|---|
|
| 4 972 754 bp | SPAdes v3.10 | 200x Illumina reads (760 MB) | 4 CPU/16GB RAM | 2h17m3s | 2,94GB |
| 12 CPU/256GB RAM | 38m8s | 9,37GB | ||||
|
| 100 272 607 bp | Smartdenovo | 20x Pacbio P6C4
| 8 CPU/16GB RAM | 24m47s | 1,92GB |
| 80x Pacbio P6C4
| 8 CPU/16GB RAM | 5h38m16s | 7,29GB | |||
| REPET v2.5 |
| 8 CPU/16 GB RAM | 1h53m11s
| 8,96GB | ||
| Eugene v4.2a |
| 8 CPU/32 GB RAM | 5h2m30s | 16,94GB | ||
|
| 134 634 692 bp | Smartdenovo | 20x Pacbio P5C3
| 8 CPU/16GB RAM | 1h16m20s | 2,4GB |
| REPET v2.5 |
| 8 CPU/16 GB RAM | 5h6m23s
| 10,25GB | ||
| Eugene v4.2a |
| 8 CPU/32 GB RAM | 6h17m18s | 17,25GB | ||
|
| 324 761 211 bp | Eugene v4.2a |
| 8 CPU/188 GB RAM | 41h27m13s | 72,5GB |
Figure 2. General steps in a genome assembly workflow.
Input and output data are indicated for each step.
Figure 3. Simplified Illustration of a structural genome annotation using Combiners
On the left, the diagram shows a typical assembly process. At the end of the process, scaffolds or chromosomes ready to be annotated are obtained. These scaffolds are then annotated using two different methods. The first method is called ab-initio and requires a known set of training genes. Once the ab initio tool has been trained it can be used to predict other similarly structured genes. The second similarity-based approach relies on experimental evidence such as CDSs, ESTs, or RNA-seq to build gene models. Combiners (such as Maker or Eugene) can then incorporate all of these results, eliminate incongruences, and present gene models best supported by all methods.
Figure 4. Functional Annotation Pipelines.
This schema is showing a typical functional annotation pipeline, in which functional roles are assigned to coding sequences (CDSs) inferred in the gene prediction process. The process implements three parallel routes for the definition of functions. The first refers to proteins domains and motifs, the second for orthology search and finally the third is applied to homology search. At the end, the output from the three different sources is put together for more valuable predictions.