| Literature DB >> 23885890 |
Georgios A Pavlopoulos1, Anastasis Oulas, Ernesto Iacucci, Alejandro Sifrim, Yves Moreau, Reinhard Schneider, Jan Aerts, Ioannis Iliopoulos.
Abstract
Elucidating the content of a DNA sequence is critical to deeper understand and decode the genetic information for any biological system. As next generation sequencing (NGS) techniques have become cheaper and more advanced in throughput over time, great innovations and breakthrough conclusions have been generated in various biological areas. Few of these areas, which get shaped by the new technological advances, involve evolution of species, microbial mapping, population genetics, genome-wide association studies (GWAs), comparative genomics, variant analysis, gene expression, gene regulation, epigenetics and personalized medicine. While NGS techniques stand as key players in modern biological research, the analysis and the interpretation of the vast amount of data that gets produced is a not an easy or a trivial task and still remains a great challenge in the field of bioinformatics. Therefore, efficient tools to cope with information overload, tackle the high complexity and provide meaningful visualizations to make the knowledge extraction easier are essential. In this article, we briefly refer to the sequencing methodologies and the available equipment to serve these analyses and we describe the data formats of the files which get produced by them. We conclude with a thorough review of tools developed to efficiently store, analyze and visualize such data with emphasis in structural variation analysis and comparative genomics. We finally comment on their functionality, strengths and weaknesses and we discuss how future applications could further develop in this field.Entities:
Year: 2013 PMID: 23885890 PMCID: PMC3726446 DOI: 10.1186/1756-0381-6-13
Source DB: PubMed Journal: BioData Min ISSN: 1756-0381 Impact factor: 2.522
Figure 1DNA sequencing.DNA sequencing: 1st step: The DNA of interest is purified and extracted. 2nd step: Creation of multiple copies of DNA. 3nd step: DNA is shattered into smaller pieces. 4rd step: DNA fragment sequencing. 5th step: A computer maps the small pieces to an already known reference genome.
Figure 2DNA assembly.DNA assembly: 1st step: The DNA is purified and extracted. 2nd step: DNA is fragmented into smaller pieces. 3rd step: DNA fragment sequencing. 4th step: A computer matches the overlapping parts of the fragments to get a continuous sequence. 5th step: The whole sequence is reassembled. No prior knowledge about the DNA sequence is necessary.
Figure 3SNP example. A difference in a single nucleotide between two DNA fragments from different individuals. In this case we say that there are two alleles: C and T.
Figure 4Structural Variations. This figure illustrates the basic structural variations. A) Inversion. B) Translocation within the same chromosome. C) Translocation across different chromosomes. D) Duplication. E) Deletion.
Figure 5PEM signatures. Basic PEM signatures. A) Insertion. B) Deletion. C) Inversion. More PEM signatures are visually presented in [47].
Figure 6Read depth. Read depth: A) Fragments of DNA (Reads) are mapped to the original reference genome. B) Plotting the frequency of each nucleotide that was mapped at the reference genome.
Figure 7FASTQ file. 1st line always starts with the symbol ‘@’ followed by the sequence identifier. 2nd line contains the sequence. 3rd line starts with the symbol ‘+’ symbol which is optionally followed by the same sequence identifier and any description. It indicates the end of the sequence and the beginning of the quality score string. 4th line contains the quality score (QS) in ASCII format. The current example shows an Illumina representation.
Figure 8BAM/SAM files. Example of an alignment to the reference sequence (pileup). A) Read r001/1 and r001/2 constitute a read pair; r003 is a chimeric read; r004 represents a split alignment. B) The corresponding SAM file and their tags for each field.
Figure 9VCF file. This figure demonstrates an example of a CVF file. A) Different types of variations and polymorphisms that can be stored in CVF format. B) Example of a CVF format and its fields.
Software for predicting structural variations
| BreakDancer [ | | X | | X | X | X | X | X | BAM, SAM | |
| CNV-seq [ | X | | X | X | X | | | | Map locations from a BAM file (by SAM tools) | |
| GASV [ | | X | | X | X | X | X | X | BAM | |
| HyDRa [ | | X | | X | X | X | X | X | Tab-delimiteddiscordant paired-end mappings | |
| MoDIL [ | | X | | X | X | | | | Software specific | |
| | ||||||||||
| MrFast [ | X | | | X | X | | | | FASTA, FASTQ | |
| NovelSeq [ | | X | | X | | | | | Software specific | |
| PEMer [ | | X | | X | X | X | X | X | SVdB API | |
| Pindel [ | | X | X | X | X | | | | BAM,SAM,FASTA, FASTQ | |
| rSW-seq [ | X | | X | X | X | | | | Tab-delimited file denoting the tumor/normal status for each of aligned read positions | |
| VariationHunter [ | | X | | X | X | X | | | Software specific | |
| VarScan [ | X | | X | X | X | | | | Pileup, VCF | |
Variant annotators
| Annotate-it [ | SNPs, miRNA, Gene, Custom | OMIM, dbSNP, 200 Danish genomes, NHLBI Exomes, 1000 Genomes |
| KGGSeq [ | Indels, SNPs, Gene | dbSNP, 1000 Genomes |
| ANNOVAR [ | Indels, SNPs, miRNAs, Gene, Custom | dbSNP, NHLBI Exomes, 1000 Genomes |
| Anntools [ | Indels, SNPs, miRNAs, Gene, Custom | dbSNP, 1000 Genomes |
| SeqAnt [ | Indels, SNPs, Gene | dbSNP, 1000 Genomes |
| SVA [ | Indels, SNPs, Gene, Custom | OMIM, dbSNP, 1000 Genomes |
| TREAT [ | Indels, SNPs, Gene | OMIM, dbSNP, 1000 Genomes |
| VAAST [ | Indels, SNPs | - |
| VarioWatch [ | SNPs, Gene | OMIM, dbSNP, 1000 Genomes |
| Var-MD [ | SNPs | - |
| VarSifter [ | Indels, SNPs | - |
Alignment tools
| ABySS Explorer [ | ||||
| CLC Genomics workbench | ||||
| EagleView [ | ||||
| Hawkeye [ | ||||
| LookSeq [ | ||||
| MagicViewer [ | ||||
| MapView [ | ||||
Genome browsers
| AnnoJ [ | ||||
| Argo | ||||
| CGView [ | Implemented in Java and it comes with its own API | |||
| Combo [ | ||||
| Ensembl [ | ||||
| GBrowse [ | ||||
| Genome Projector [ | ||||
| IGB [ | ||||
| IGV [ | ||||
| UCSC Cancer Genomics Browser [ | ||||
| UCSC Genome Browser [ | ||||
| X:map [ |
Comparative genomics
| Cinteny [ | Fast identification of syntenic regions | ||
| ggbio [ | Views of particular genomic regions and genome-wide overviews | ||
| GenomeComp [ | A tool for summarizing, parsing and visualizing a genome wide sequence comparison | ||
| •BLAST output file | |||
| Circos [ | Developed to identify and analyze similarities and differences between larger genomes | ||
| DHPC [ | Visualization of large-scale genome sequences by mapping sequences into a two-dimensional using the space-filling function of Hilbert-Peano mapping. | ||
| HilbertVis [ | Functions to visualize long vectors of integer data by means of Hilbert curves | ||
| In-GAVsv [ | Detection and visualization of structural variation from paired-end mapping data and detection of larger insertions and complex variants with lower false discovery rate | ||
| Meander [ | It is mainly developed to visually discover and explore structural variations in a genome based on Read-Depth and Pair-end information | ||
| MEDEA [ | Genomic feature densities and genome alignments of circular genomes | ||
| MizBee [ | Synteny browser for exploring conservation relationships in comparative genomics data | ||
| Seevolution [ | Interactive 3D environment that enables visualization of diverse genome evolution processes | ||
| •Simultaneous visualization of multiple organisms related by a phylogeny. | |||
| •3D models of circular and linear chromosomes | |||
| Sybil [ | Comparative genome data, with a particular importance on protein and gene clustered data | ||
| VISTA [ | Global DNA sequence alignments of arbitrary length |