| Literature DB >> 26106619 |
A Mesut Erzurumluoglu1, Santiago Rodriguez1, Hashem A Shihab2, Denis Baird1, Tom G Richardson3, Ian N M Day1, Tom R Gaunt3.
Abstract
Recent technological advances have created challenges for geneticists and a need to adapt to a wide range of new bioinformatics tools and an expanding wealth of publicly available data (e.g., mutation databases, and software). This wide range of methods and a diversity of file formats used in sequence analysis is a significant issue, with a considerable amount of time spent before anyone can even attempt to analyse the genetic basis of human disorders. Another point to consider that is although many possess "just enough" knowledge to analyse their data, they do not make full use of the tools and databases that are available and also do not fully understand how their data was created. The primary aim of this review is to document some of the key approaches and provide an analysis schema to make the analysis process more efficient and reliable in the context of discovering highly penetrant causal mutations/genes. This review will also compare the methods used to identify highly penetrant variants when data is obtained from consanguineous individuals as opposed to nonconsanguineous; and when Mendelian disorders are analysed as opposed to common-complex disorders.Entities:
Mesh:
Year: 2015 PMID: 26106619 PMCID: PMC4461748 DOI: 10.1155/2015/923491
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Figure 1Steps in whole-exome sequencing. Understanding how the VCF file was created is important, as it can give an idea about where something may have gone wrong. The stages proceed from top to bottom and we have proposed “consideration points” for each step (below each title).
Tools for aligning reads to a reference genome.
| Name | References | Comment |
|---|---|---|
| BFAST | [ | These aligners use similar algorithms to determine contiguous sequences; however MAQ and BWA are widely used and have been praised for their computational efficiency and multiplatform compatibility [ |
| BWA | [ |
These are some of the many tools built for aligning reads produced from high throughput sequencing. Some have made speed their main purpose whereas others have paid more attention to annotating the files produced (such as mapping quality).
Tools for identifying variation from a reference genome using NGS reads.
| Name | References | URL | Comment |
|---|---|---|---|
| GATK | [ |
| (i) Arguably the most established genome analysis toolkit |
|
| |||
| QCALL | [ |
| (i) Theoretically calls “high quality” SNPs even from low-coverage sequencing data |
|
| |||
| PyroBayes | [ |
| (i) Theoretically makes “confident” base calls even in shallow read coverage for reads produced by Pyrosequencing machines. |
|
| |||
| SAMTools | [ |
| (i) Computes genotype likelihoods |
|
| |||
| SOAPsnp | [ |
| (i) Part of the reliable SOAP family of bioinformatics tools |
|
| |||
| Control-FREEC | [ |
| (i) Identifies copy number variations (CNVs) between case and controls from sequencing data |
|
| |||
| Atlas2 | [ |
| (i) Calls SNPs and indels for WES data |
GATK, SOAPsnp, and SAMTools have constantly been cited in large genetic association projects indicating their ease of use, reliability, and functionality. However, this is also helped by the fact that they have additional features. There are other tools such as Beagle [68], IMPUTE2 [86], and MaCH [87] which have modules for SNP and genotype calling but are mostly used for their main purpose such as imputation and haplotype phasing.
Figure 2Post-VCF file procedures (example for sequencing data). Every step here can be automated through the use of pipelines and bioinformatics tools. Whilst performing the steps listed above, one must always bear in mind the assumptions behind the procedures. Where feasible, ranking of rare SNVs would be advised over filtering as it allows the researcher to observe all variants as a continuum from most likely to least likely.
Tools for predicting variant effects, identifying neutral and pathogenic mutations.
| Name | Reference | MCC | Comments |
|---|---|---|---|
| *SIFT | [ | 0.30 (unweighted) | It is a highly cited with many projects using and citing it since 2001, uses available evolutionary information and is continually updated, is easy to use through VEP, and provides two classifications: “deleterious” and “tolerated.” |
|
| |||
| *PolyPhen-2 | [ | 0.43 | It provides a high quality multiple sequence alignment pipeline and is optimized for high-throughput analysis of NGS data, is cited and used by many projects of different types, is easy to use through VEP, and provides three classifications: “probably damaging,” “possibly damaging,” and “benign.” |
|
| |||
| *FATHMM | [ | 0.72 | It is a high performing prediction tool. Clear examples are available on the website. It offers flexibility to the user for weighted (trained using inherited disease causing mutations) and unweighted (conservation-based approach) predictions and also offers protein domain-phenotype association information, and has options for cancer-specific predictions (FATHMM-Cancer) and predictions for noncoding variants (FATHMM-MKL). |
|
| |||
| GERP++ (and GERP) | [ | N/A | It determines constrained elements within the human genome; therefore variants in them are likely to induce functional changes. Can provide unique details about the candidate variant(s). |
|
| |||
| PhyloP | [ | N/A | It helps detect nonneutral substitutions, similar aim with GERP. |
|
| |||
| CADD | [ | — | It provides annotation and scores for all variants in the genome considering a wide range of biological features. |
|
| |||
| GWAVA | [ | — | It provides predictions for variants in the noncoding part of the genome. |
|
| |||
| *SNAP | [ | 0.47 | It predicts the effects of nonsynonymous polymorphisms and is cited and used many times and should be used to check whether the predicted effect is matched by the putative causal variant. However it was labelled “too slow” for high throughput analyses by [ |
|
| |||
| PupaSuite | [ | — | It identifies functional SNPs using the SNPeffect [ |
|
| |||
| Mutation Assessor-2 | [ | — | It predicts the impact of protein mutations and is user friendly website and accepts many formats. |
|
| |||
| *PANTHER | [ | 0.53 (unweighted) | It predicts the effect of amino acid change based on protein evolutionary relationships. It provides a number ranging from 0 (neutral) to −10 (most likely deleterious) and allows the user to decide on the “deleteriousness” threshold. It is constantly updated making it a very reliable tool. |
|
| |||
| CONDEL-2 | [ | — | It combines FATHMM and Mutation Assessor (as of version 2) in order to improve prediction. It theoretically outperforms the tools it is using in comparison to when the tools are used individually. |
|
| |||
| *MutPred | [ | 0.63 | It predicts whether a missense mutation is going to be harmful or not based on a variety of features such as sequence conservation, protein structure, and functional annotations and is praised in recent comparative study by [ |
|
| |||
| *SNPs&GO | [ | 0.65 | It is reported to have performed best amongst many prediction tools in [ |
|
| |||
| Human Splicing Finder | [ | N/A | It predicts the effect of noncoding variants in terms of alteration of splicing. Useful for compound heterozygotes if one allele is intronic. |
|
| |||
| Others | [ | 0.19 | *nsSNPAnalyzer (requires 3D structure coordinates), *PhD SNP, *Polyphen (not supported any more), and PMUT |
Many methods have been developed to predict the functional effect of variants in the genome. Many of the tools listed above use different features and datasets to predict these effects. This is not an exhaustive list of all prediction tools but a collection of the most used/cited ones.
*Comprehensive information about the prediction tool including accuracy, specificity, and sensitivity available in [43, 46]. N/A: not applicable. MCC: Matthew's correlation coefficient. MCCs are obtained from [43].
Figure 3Finding “the one” in Mendelian disorders. Searching for the causal variant (using a WES example). After potentially causal variants are identified, one must put into practice what past literature suggests about the disorder and make certain decisions about which path to follow in Figure 3. Familial (very rare) disorders are more likely to be following a recessive mode of inheritance; thus family data is crucial (to rule out the possibility of de novo mutations). Also it is crucial to include as many family members as possible. For common Mendelian disorders, if the disorder is following a recessive inheritance model, the possibility of the existence of compound heterozygotes should be taken into account when fitting the data into a recessive model. Finally, functional postanalysis of candidate variant(s), especially in mouse knockouts, can be crucial. This figure is here to serve as an example and by no means reflects an exhaustive model; there are alternative routes that researchers can take to identify Mendelian causal variants. ∗If a consanguineous family, identifies regions where there are long runs of homozygosity (LRoH) for each individual, and amongst these regions, the ones which are shared by the affected and not by the unaffected.
Figure 5Filtering steps applied to all mutations in the exome (primary ciliary dyskinesia example). After all the filtering steps in the above figure are applied, the total will be reduced to a single candidate. The numbers here are for illustration purposes only (adapted from [39]). Homozygosity step is added as PCD is an autosomal recessive disorder. Φ mutations are “predicted high impact” mutations as proposed by Alsaadi et al. [39] (see PHI_SO_terms.txt in Supplementary data).
Figure 4Summary of whole analysis process. DNA sample to identification of variant. The tools mentioned here are the ones we prefer to use for a variety of reasons such as having user-friendly documentation, ease of use, performance, multiplatform compatibility, and speed. See Supplementary Material and Methods for examples of parameters/commands to use where applicable.
What is needed for a genetic study?
| Material | Notes |
|---|---|
| “Sufficient” number of high-quality sequencing/genotype data | Amount needed can vary from one proband and a few family members (for very rare Mendelian disorders) to thousands of cases and controls (for certain common complex disorder/traits) |
|
| |
| List of candidate genes | Websites such as |
|
| |
| Identification of variant calling tool | Such as in |
|
| |
| Identification of variant effect predictor tool | Such as in |
|
| |
| Knowledge of human population variation databases | That is, HapMap, 1000 Genomes Project, EVS, dbSNP, and internal databases |
|
| |
| Knowledge of databases storing information about genes and their products | That is, OMIM, Gene (NCBI), GeneCards, Unigene (NCBI), GEO Profiles (NCBI), HomoloGene (NCBI), and Mouse knockout databases (such as |
The most important factors when carrying out a genetic association study are (i) the availability of reliable data (ii) bioinformatics and biological expertise, and (iii) careful planning.