| Literature DB >> 28808973 |
Gabriel A Leiva-Torres1,2,3, Nestor Nebesio1,2,3, Silvia M Vidal4,5,6.
Abstract
The clinical course of any viral infection greatly differs in individuals. This variation results from various viral, host, and environmental factors. The identification of host genetic factors influencing inter-individual variation in susceptibility to several pathogenic viruses has tremendously increased our understanding of the mechanisms and pathways required for immunity. Next-generation sequencing of whole exomes represents a powerful tool in biomedical research. In this chapter, we briefly introduce whole-exome sequencing in the context of genetic approaches to identify host susceptibility genes to viral infections. We then describe general aspects of the workflow for whole-exome sequence analysis together with the tools and online resources that can be used to identify and annotate variant calls, and then prioritize them for their potential association to phenotypes of interest.Entities:
Keywords: Antiviral immunity; Exome; Gene annotation; Host genetics; Read depth; Sequence alignment; Variant annotation; Variant calling; Whole-exome sequencing
Mesh:
Year: 2017 PMID: 28808973 PMCID: PMC7120756 DOI: 10.1007/978-1-4939-7237-1_14
Source DB: PubMed Journal: Methods Mol Biol ISSN: 1064-3745
Definition of terms (in alphabetic order)
| Term | Meaning |
|---|---|
| Haplotype | A set of alleles that commonly segregate together and are defined as regions of extended linkage disequilibrium, which in humans is often up to 100 kb in length. |
| Indel | Insertions and deletions in a genome; the second most common type of variation after SNPs. |
| Minor allele frequency (MAF) | Refers to the frequency at which the second most common allele occurs in population. |
| Penetrance | Describes the proportion of individuals with a mutation or risk variant who have the disease. Incomplete penetrance is said when individuals carrying pathogenic mutations manifest no disease phenotype. |
| Rare allele | Allele present with MAF <1% (PMID: 19293820) |
| SNP | Single nucleotide polymorphism. Variation of a single nucleotide base, with the minor allele present in at least 1% of alleles in the population. |
| SNV | Single nucleotide variant. Minor allele frequency undefined. |
Commonly used tools and weblinks for whole- exome sequence data analysis pipeline
| Tool | weblink |
|---|---|
|
| |
| Ensembl |
|
| UCSC | http://genome.ucsc.edu |
|
| |
| FastQC | http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ |
|
| |
| Bowtie |
|
| Bfast | http://bfast.sourceforge.net |
| Mosaik | https://github.com/wanpinglee/MOSAIK |
| BWA | http://bio-bwa.sourceforge.net/ |
|
| |
| Picard tools |
|
| SAMTools |
|
|
| |
| GATK |
|
| SAMTools |
|
|
| |
| SnpEff |
|
| VEP |
|
| SIFT |
|
| PolyPhen2 |
|
|
| |
| PhyloP |
|
| GERP++ |
|
| CADD |
|
|
| |
| MSC |
|
| GAVIN |
|
|
| |
| ANNOVAR |
|
|
| |
| HGPS |
|
| KEGG | www.genome.jp/kegg/ |
| REACTOME |
|
| MPO |
|
| GEO |
|
| GXA |
|
| BioGPS |
|
| STRING |
|
| ToppGene |
|
| GeneMania |
|
Fig. 1Basic workflow and tools for whole- exome sequencing project. Following sequencing, reads undergo quality assessment and read alignment against a reference genome, followed by variant identification. The detected variants are annotated to infer their biological relevance. Then, variants are filtered based on quality of the read and frequency on the population. Then variants are prioritized based on the genetic hypothesis for the trait under study and knowledge about the candidate gene/protein. Ultimately, experimental validation is required to ascertain variant discovery. On the right the format outputs are indicated
Description of commonly used file formats in WES workflows
| Format | Characteristics |
|---|---|
| FASTQ file (.fastq) | Text file that stores nucleotide sequence and quality score for downstream analysis. There are typically four lines in a FASTQ file: (1) sequence identifier initialized “@”; (2) biological sequence of nucleotide reads (ACTG); (3) sequence identifier initialized “+”; (4) quality score of corresponding sequencing read, which is coded with ASCII characters. |
| Sequence alignment/map (SAM) file (.sam) | Text file that stores alignment information of short reads to reference genome. The SAM file contains multiple lines including a header initialized “@” and multiple lines for the sequence alignment. |
| Binary alignment/map (BAM) file (.bam) | Binary file (stored in a format that is only computer readable) containing the same information as the SAM file, the content of which has been compressed to reduce storage disk space and increase performance. |
| Browser extensible data (BED) file (.bed) | Tab-delimited text file that consists of several lines each representing a single genomic region, such as an exon. BED files provide the coordinates of those regions including chromosome, start and end positions, and additional fields can be added. |
| Variant call format (VCF) file (.vcf) | Text file containing meta-information lines (i.e., file format, date, or other information about the overall experiment), a header line naming the columns (chromosome #, position, ID, reference allele, alternative allele, quality, filte, infor), and then data lines each containing information about a position in the genome. It is a standardized text file format for representing SNP, indel, and structural variation calls. |
Databases of human genetic variation
|
| Weblink and description |
|---|---|
| Combined annotation dependent depletion database (CADD) |
Catalog of precomputed scores for all possible SNPs or small Indels of the reference genome and the 1000 Genomes obtained by combining 63 annotations (e.g., SIFT, GERP, others) through a machine-learning framework. |
| Single nucleotide polymorphism database (dbSNP) |
Broad collection of SNPs and Indels submitted by investigators worldwide and curated by NCBI. |
| Human gene mutation database (HGMD) |
A catalog of all published gene lesions responsible for human inherited disease. |
| Exome aggregation consortium (ExAC) |
Catalogue of exome variation in 60706 individuals some with adult onset diseases (Type 2 Diabetes, schizophrenia) patients presenting severe pediatric diseases have been excluded. |
| 1000 Genomes project |
Catalogue of genome variation with at least 1% frequency in the population based on whole-genome sequencing of 2504 individuals from 26 populations (including study cohorts for adult onset diseases). |
| NHLBI exome sequencing project (ESP6500) |
Catalogue of variation within 6500 exomes from well-phenotyped populations from various projects, e.g. Severe Asthma Research Project; Pulmonary Arterial Hypertension population; Acute Lung Injury cohort; Cystic Fibrosis cohort. |