| Literature DB >> 25553065 |
Robert Ekblom1, Jochen B W Wolf1.
Abstract
Genome sequencing projects were long confined to biomedical model organisms and required the concerted effort of large consortia. Rapid progress in high-throughput sequencing technology and the simultaneous development of bioinformatic tools have democratized the field. It is now within reach for individual research groups in the eco-evolutionary and conservation community to generate de novo draft genome sequences for any organism of choice. Because of the cost and considerable effort involved in such an endeavour, the important first step is to thoroughly consider whether a genome sequence is necessary for addressing the biological question at hand. Once this decision is taken, a genome project requires careful planning with respect to the organism involved and the intended quality of the genome draft. Here, we briefly review the state of the art within this field and provide a step-by-step introduction to the workflow involved in genome sequencing, assembly and annotation with particular reference to large and complex genomes. This tutorial is targeted at scientists with a background in conservation genetics, but more generally, provides useful practical guidance for researchers engaging in whole-genome sequencing projects.Entities:
Keywords: bioinformatics; conservation genomics; genome assembly; next generation sequencing; vertebrates; whole - genome sequencing.
Year: 2014 PMID: 25553065 PMCID: PMC4231593 DOI: 10.1111/eva.12178
Source DB: PubMed Journal: Evol Appl ISSN: 1752-4571 Impact factor: 5.183
Figure 1Workflow of a typical de novo whole-genome sequencing project. Black boxes with white text indicate genomic resources becoming available during the course of the project. From the top: wet-lab procedures, de novo assembly bioinformatic pipeline, postassembly analyses of additional population-wide sampling (population genomics), conservation genomic questions to address and analyses to perform (conservation genomic applications). Bullet points within the white star in the bottom part of the figures represent ultimate goals in conservation biology that can be addressed using genomic information combined with high-quality ecological data.
| Alignment | Similarity-based arrangement of DNA, RNA or protein sequences. In this context, subject and query sequence should be orthologous and reflect evolutionary, not functional or structural relationships |
| Annotation | Computational process of attaching biologically relevant information to genome sequence data |
| Assembly | Computational reconstruction of a longer sequence from smaller sequence reads |
| Barcode | Short-sequence identifier for individual labelling (barcoding) of sequencing libraries |
| BAC | (Bacterial artificial chromosome) DNA construct of various length (150–350 kb) |
| cDNA | Complementary DNA synthesized from an mRNA template |
| Contig | A contiguous linear stretch of DNA or RNA consensus sequence. Constructed from a number of smaller, partially overlapping, sequence fragments (reads) |
| Coverage | Also known as ‘sequencing depth’. |
| Refers to the reconstruction of contiguous sequences without making use of any reference sequence | |
| EST library | Expressed sequence tag library. A short subsequence of cDNA transcript sequence |
| Fosmid | A vector for bacterial cloning of genomic DNA fragments that usually holds inserts of around 40 kb |
| GC content | The proportion of guanine and cytosine bases in a DNA/RNA sequence |
| Gene ontology (GO) | Structured, controlled vocabularies and classifications of gene function across species and research areas |
| InDel | Insertion/deletion polymorphism |
| Insert size | Length of randomly sheared fragments (from the genome or transcriptome) sequenced from both ends |
| K-mer | Short, unique element of DNA sequence of length k, used by many assembly algorithms |
| Library | Collection of DNA (or RNA) fragments modified in a way that is appropriate for downstream analyses, such as high-throughput sequencing in this case |
| Mapping | A term routinely used to describe alignment of short sequence reads to a longer reference sequence |
| Masking | Converting a DNA sequence [A,C,G,T] (usually repetitive or of low quality) to the uninformative character state N or to lower case characters [a,c,g,t] ( |
| Massively parallel (or next generation) sequencing | High-throughput sequencing nano-technology used to determine the base-pair sequence of DNA/RNA molecules at much larger quantities than previous end-termination (e.g. Sanger sequencing) based sequencing techniques |
| Mate-pair | Sequence information from two ends of a DNA fragment, usually several thousand base-pairs long |
| N50 | A statistic of a set of contigs (or scaffolds). It is defined as the length for which the collection of all contigs of that length or longer contains at least half of the total of the lengths of the contigs |
| N90 | Equivalent to the N50 statistic describing the length for which the collection of all contigs of that length or longer contains at least 90% of the total of the lengths of the contigs |
| Optical map | Genomewide, ordered, high-resolution restriction map derived from single, stained DNA molecules. It can be used to improve a genome assembly by matching it to the genomewide pattern of expected restriction sites, as inferred from the genome sequence |
| Paired-end sequencing | Sequence information from two ends of a short DNA fragment, usually a few hundred base pairs long |
| Read | Short base-pair sequence inferred from the DNA/RNA template by sequencing |
| RNA-Seq | High-throughput shotgun transcriptome (cDNA) sequencing. Usually not used synonymous to RNA-sequencing which implies direct sequencing of RNA molecules skipping the cDNA generation step |
| Scaffold | Two or more contigs joined together using read-pair information |
| Transcriptome | Set of all RNA molecules transcribed from a DNA template |
| • Availability of appropriate computational resources | |
| • Collaboration with sequencing facility and bioinformatics groups | |
| • Plan for amount and type of sequencing data needed | |
| • Does funding allow to produce sufficient sequence coverage? If not, alternative approaches should be considered rather than producing a poor, low coverage, assembly | |
| • Familiarization with data handling pipelines and file formats (see below) | |
| • High-quality DNA sample (with individual metadata) | |
| • Plan for analyses and publication | |
| • | |
| • | |
| • | |
| • Library preparation and Sequencing: Mardis ( | |
| • Quality filtering/preprocessing: Patel and Jain ( | |
| • Genome assembly: Nagarajan and Pop ( | |
| • Assembly evaluation: Earl et al. ( | |
| • Genome annotation: Yandell and Ence ( | |
| • Mapping: Li and Durbin ( | |
| • Data handling: Li et al. ( | |
| • Variant calling: Nielsen et al. ( | |
| • Haplotype-based approaches: Browning and Browning ( | |
| • Population genomic summary statistics: Nielsen et al. ( | |
| • Galaxy ( | |
| • Amazon cloud ( | |
| • Windows Azure ( | |
| • Magellan: Cloud Computing for Science ( | |
| • Web Apollo ( | |
| • NCBI BioProject ( | |
| • Genomes OnLine Database ( | |
| • ENSEMBL genome database ( | |
| • UCSC Genome Browser ( | |
| • fastQCtoolkit for data preprocessing ( | |
| • Plants: | |
| • Animals: | |
| • FASTA | Nucleotide sequence (file extension .fas or .fa) |
| • FASTQ | Nucleotide sequence including quality scores |
| • SAM | Sequence alignment |
| • BAM | Binary version of SAM |
| • GFF3 | Annotation |
| • GTF | Annotation |
| • BED | Annotation |
| • VCF | Variant calling |
Figure 2Simplified illustration of the assembly process and terminology. Shotgun sequencing: short fragments of DNA from the target organism are sequenced at random positions across the genome to a given depth of coverage. Fragments can consist of single reads (typically 50–1000 bp) or of paired-end reads of varying insert size (note that paired-end reads can even overlap). Mate-pair libraries span larger genomic regions (∼2–20 kb inserts) with reads generally facing outwards and can be complemented with fosmid-end libraries (∼40 kb inserts). Genome assembly: (A) short-read de novo assemblers extend the disperse sequence information from the reads into continuous stretches called contigs. Contigs usually reflect the consensus sequence and do not contain any polymorphisms. (B) Paired-end reads provide additional information on whether a read is supported for a given contig. (C) Some assemblers such as ALLPATHS-LG work with overlapping read pairs that are joined into a virtual longer read prior to the assembly. Read pairs from mate-pair or fosmid-end libraries can be used to order and orient contigs into scaffolds. Gap size between contigs is estimated from the expected length of mate-pairs and marked with ‘N's (indicated by hatched grey boxes). Long reads from single molecule sequencing provide an alternative. Annotation: gene models can be inferred in silico by prediction algorithms, by lifting over information from genomes of related organisms and by using transcriptome data (RNA-seq, expressed sequence tag) from the target organism itself. Spliced reads from RNA-seq data as indicated at the bottom of the figure provide valuable evidence for splice junctions and various isoforms of a gene.
Some recently sequenced vertebrate genomes in species of conservation concern.
| Species | Red list category | Sequencing technology | Assembly algorithm | Contig N50 (bp) | Sequencing coverage | Number of authors | References |
|---|---|---|---|---|---|---|---|
| Chimpanzee | EN | Sanger | PCAP | 53000 | 6× | 67 | Consortium ( |
| Mammoth | EX | Roche 454 | NA | NA | <1× | 22 | Miller et al. ( |
| Panda | EN | Illumina GA | SOAPdenovo | 39886 | 56× | 123 | Li et al. ( |
| Orang-utan | CR | Sanger | PCAP | 15654 | 6× | 101 | Locke et al. ( |
| Cod | VU | Roche 454 | Newbler | 2778 | 40× | 42 | Star et al. ( |
| Tasmanian devil | EN | Roche 454/Illumina GAIIx | Newbler/CABOG | 9495 | 14× | 30 | Miller et al. ( |
| African elephant | VU | Sanger (ABI3730) | ARACHNE (reference assisted) | 2900 | 2× | 60 | Lindblad-Toh et al. ( |
| Tarsier | NT | Sanger (ABI3730) | ARACHNE (reference assisted) | 2900 | 2× | 60 | Lindblad-Toh et al. ( |
| Polar bear | VU | Illumina HiSeq 2000 | SOAPdenovo | 3596 | 100× | 26 | Miller et al. ( |
| Puerto Rican parrot | CR | Illumina HiSeq 2000 | Ray | 6983 | 27× | 14 | Oleksyk et al. ( |
| Gorilla | CR | Sanger/Illumina | Phusion assembler/ABySS | 11800 | 50× | 71 | Scally et al. ( |
| Bonobo | EN | Roche 454 | Celera Assembler | 67000 | 25× | 41 | Prufer et al. ( |
| Yak | VU | Illumina HiSeq 2000 | SOAPdenovo | 20400 | 65× | 48 | Qiu et al. ( |
| Aye-aye | NT | Illumina GAIIx | CLC bio Assemler | 3650 | 38× | 10 | Perry et al. ( |
| Coelacanth | CR | Illumina HiSeq 2000 | ALLPATHS-LG | 12700 | 61× | 91 | Amemiya et al. ( |
| Saker falcon | EN | Illumina HiSeq 2000 | SOAPdenovo | 31200 | 113× | 25 | Zhan et al. ( |
| Tibetan antelope | EN | Illumina GAIIx | SOAPdenovo (reference assisted) | NA | Not reported | 11 | Kim et al. ( |
| Bluefin tuna | LC | Roche 454/Illumina GAIIx | Newbler/Bowtie | 7588 | 54× | 24 | Nakamura et al. ( |
| Darwin's finch | LC | Roche 454 | Newbler | Not reported | 4× | 19 | Rands et al. ( |
| Straw coloured fruit bat | NT | Illumina HiSeq 2000 | CLC bio | 27140 | 17× | 7 | Parker et al. ( |
| King cobra | VU | Illumina GAIIx | CLC/SSPACE | 3980 | 40× | 36 | Vonk et al. ( |
| Burmese python | VU | Roche 454/Illumina HiSeq 2000 | Newbler/SOAPdenovo | 10700 | 49× | 39 | Castoe et al. ( |
| Chinese softshell turtle | VU | Illumina HiSeq 2000 | SOAPdenovo | 22000 | 106× | 34 | Wang et al. ( |
| Tiger | EN | Illumina HiSeq 2000 | SOAPdenovo | 29800 | 118× | 58 | Cho et al. ( |
| Minke whale | LC | Illumina HiSeq 2000 | SOAPdenovo | 22571 | 128× | 55 | Yim et al. ( |
| Northern bobwhite | NT | Illumina HiSeq 2000 | CLC | 45400 | 142× | 12 | Halley et al. ( |
| Black grouse | LC | SOLiD 5500xl | SOAPdenovo (reference assisted) | 1238 | 127× | 5 | Wang et al. ( |
| White rhinoceros | NT | Illumina HiSeq 2000 | ALLPATHS-LG | 93000 | 91× | 10 | Di Palma et al. unpublished data |
Red list categories: EX, extinct; CR, critically endangered; EN, endangered; VU, vulnerable, NT, near threatened; LC, least concern.
Not red-listed, but likely to be affected by overfishing.
Not red-listed, but endemic to a small geographic region.
Not currently red-listed but, subject to extensive exploitation or within group of endangered taxa.
Not globally red-listed, but with several small and isolated regional populations.