| Literature DB >> 27110816 |
Wenming Xiao1, Leihong Wu2, Gokhan Yavas3, Vahan Simonyan4, Baitang Ning5, Huixiao Hong6.
Abstract
Even though each of us shares more than 99% of the DNA sequences in our genome, there are millions of sequence codes or structure in small regions that differ between individuals, giving us different characteristics of appearance or responsiveness to medical treatments. Currently, genetic variants in diseased tissues, such as tumors, are uncovered by exploring the differences between the reference genome and the sequences detected in the diseased tissue. However, the public reference genome was derived with the DNA from multiple individuals. As a result of this, the reference genome is incomplete and may misrepresent the sequence variants of the general population. The more reliable solution is to compare sequences of diseased tissue with its own genome sequence derived from tissue in a normal state. As the price to sequence the human genome has dropped dramatically to around $1000, it shows a promising future of documenting the personal genome for every individual. However, de novo assembly of individual genomes at an affordable cost is still challenging. Thus, till now, only a few human genomes have been fully assembled. In this review, we introduce the history of human genome sequencing and the evolution of sequencing platforms, from Sanger sequencing to emerging "third generation sequencing" technologies. We present the currently available de novo assembly and post-assembly software packages for human genome assembly and their requirements for computational infrastructures. We recommend that a combined hybrid assembly with long and short reads would be a promising way to generate good quality human genome assemblies and specify parameters for the quality assessment of assembly outcomes. We provide a perspective view of the benefit of using personal genomes as references and suggestions for obtaining a quality personal genome. Finally, we discuss the usage of the personal genome in aiding vaccine design and development, monitoring host immune-response, tailoring drug therapy and detecting tumors. We believe the precision medicine would largely benefit from bioinformatics solutions, particularly for personal genome assembly.Entities:
Keywords: assembly; genome; personal genome; quality metrics; sequencing
Year: 2016 PMID: 27110816 PMCID: PMC4932478 DOI: 10.3390/pharmaceutics8020015
Source DB: PubMed Journal: Pharmaceutics ISSN: 1999-4923 Impact factor: 6.321
Basic statistics for the recent releases of human reference genome build.
| Genome Build # | Release Year | Total Genome Length | Total Non-N Bases | N50 | Number of Gaps | # of Scaffolds | # Unplaced Scaffolds |
|---|---|---|---|---|---|---|---|
| 35 | 2004 | 3,091,649,889 | 2,866,200,199 | 38,509,590 | 292 | 377 | 86 |
| 36 | 2006 | 3,104,054,490 | 2,881,649,121 | 38,509,590 | 292 | 367 | 88 |
| 37 | 2009 | 3,137,144,693 | 2,897,299,566 | 46,395,641 | 357 | 249 | 59 |
| 38 | 2013 | 3,209,286,105 | 3,049,316,098 | 67,794,873 | 875 | 473 | 169 |
Comparison of current common NGS platforms.
| Platform | Mode | Read-Length | Reads Passing Filter per Run | Output | Run Time | Quality | Cost/Run | Instrument Price |
|---|---|---|---|---|---|---|---|---|
| Illumina HiSeq 2000/2500 | High-Output | 1 × 36–2 × 125 | 4 B | 128 GB–1 TB | 1–6 days | Q30 ≥ 80% | ~$29K | $740K |
| Rapid | 1 × 36–2 × 150 | 600 M | 18 GB–300 GB | 7–60 h | Q30 ≥ 75% | ~$8K | ||
| Illumina HiSeq X ten | X ten | 2×150 | 5.3–6 B | 1.6–1.8 TB | <3 days | Q30 ≥ 75% | ~$12K | $1M* |
| Roche 454 FLX system | Titanium XL+ | 700 | 1 M | 700 MB | 23 h | 99.997% | ~$6K | ~$500K |
| Life Technologies Ion Torrent | Proton I | 200 | 165 M | ~10 GB | 2–4 h | ~$1000 | $149K | |
| Proton II | 100 | 660 M | ~32 GB | 2–4 h | ||||
| Intelligent Biosystems (Qiagen) | MAX-Seq | 2 × 55 | 75 M/lane | 132 GB | 2.5 days | ~$1200 | ~$270K | |
| Mini-20 | 2 × 100 | 20 M/lane | 80 GB | ~$150–300/sample | $125K | |||
| PacBio RS | RS II | 10–15 KB | 50 K | 500 MB–1 GB | 4 h | >99.999% | ~$400 | ~$700K |
| Oxford Nanopore | miniON | >200 KB | no fixed run time (~1 bp per nanosecond) | ≤$900 | ~$1000 |
* K: thousand; M: million; B: billion; kb: kilobase; MB: millionbase; GB: gigabase; TB: terabase; h: hour.
Genome de novo assembly and post-assembly approaches.
| Approaches | Commonly Used Tools | Notes |
|---|---|---|
| EULER, ALLPATHS, Velvet, ABySS, SOAPdenovo, | For shorter reads (25–100 bp) assembly | |
| Overlap-layout-consensus (OLC) | SSAKE, SHARCGS, VCAKE, Celera Assembler, Arachne, PCAP, HGAP, | For longer reads (100–800 bp) and long reads assembly |
| Contigs orientation and visualization | AlignGraph, ABACAS, CONTIGuator, Projector2, OSLay and r2cat, | |
| Extending contigs and filling gaps | IMAGE, GAA program, Reconciliator, GAPFiller, Pilon | |
| Reads error correction | ICORN, AutoEditor, REAPR | |
| Unmapped reads Annotation | RATT, Ensembl, GARSA and SABIA, | |
Figure 1Common flowchart of hybrid assembly to integrate short and long reads. The combination can be at the reads level, i.e., using short reads to correct the errors in long reads. Alternatively, long reads or their derived contigs could be used as bridges to join or fill-in gaps of contigs assembled with short reads.
Parameters for quality assessment of assembled genome.
| Parameters | Notes | |
|---|---|---|
| Number of contigs | total number of assembled contigs | |
| Max length of contigs | the longest contig | |
| Min length of contigs | the shortest of contig | |
| Total length of contigs | sum of the length of all contigs | |
| Nx_plot | contig length for x% of the bases of assembled contigs, where 0 < x < 100 | |
| NGx_plot | contig length for x% of the bases of reference genome, where 0 < x < 100 | |
| NAx_plot | contig length for x% of the bases of assembled contigs after correction, where 0 < x < 100 | |
| NGAx_plot | contig length for x% of the bases of the reference genome after correction, where 0 < x < 100 | |
| Number of misassembles | total number of assembly errors, include miss-join, base error, false indel, | |
| miss-join | number of miss-join | |
| base error | number of base error | |
| false indel | number of false indel | |
| Number of misassembled contigs (parsimony) | number of contigs with assembly errors | |
| Total length of misassembled contigs | sum of the length of misassembled contigs | |
| Unaligned cotigs | total number of contigs could not be mapped to the reference genome | |
| alternative human reference | could be mapped to alternative human reference genomes | |
| nonhuman primate genome references | could be mapped to nonhuman primate reference genomes | |
| Ambiguously mapped contigs | cotigs mapped to multiple location on the reference genome | |
| Fragment coverage distribution (FCD) | local assembly error detected by fragment coverage of assembled contigs by sequence reads | |
| Genome coverage fraction | percentage of the reference genome covered by assemblies | |
| Known gene complete coverage fraction | percentage of known gene covered completely by assemblies | |
| Known gene partial coverage fraction | percentage of known gene covered partially by assemblies | |
| Know exon complete coverage fraction | percentage of known exon covered completely by assemblies | |
| Know exon partial coverage fraction | percentage of known exon covered partially by assemblies | |
| Duplication ratio (multiplicity) | ratio of total length of aligned contigs | |
| Alignable ratio (validity) | ratio of total aligned contigs | |
| GC content | percentage of GC content in assembled contigs | |
| Number of SNVs | total number of single nucleotide variation (SNV) detected in assembled contigs | |
| Number of SNPs | total number of single nucleotide polymorphism (SNP) detected in assembled contigs | |
| Number of small indels | total number of small indels detected in assembled contigs | |
| Number of inversion | total number of inversion detected in assembled contigs | |
| Number of translocation | total number of translocation detected in assembled contigs | |
| SNVs/100 kb | number of SNVs per 100 kb block | |
| SNPs/100 kb | number of SNPs per 100 kb block | |
| indels/100 kb | number of small indels per 100 kb block | |
Figure 2Potential use of a personal genome in future clinical settings.