| Literature DB >> 25587064 |
Min He1, Thomas N Person2, Scott J Hebbring3, Ethan Heinzen4, Zhan Ye5, Steven J Schrodi3, Elizabeth W McPherson6, Simon M Lin7, Peggy L Peissig5, Murray H Brilliant3, Jason O'Rawe8, Reid J Robison9, Gholson J Lyon10, Kai Wang11.
Abstract
BACKGROUND: Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis.Entities:
Keywords: big data; de novo mutations; inherited homozygous or compound heterozygous mutations; whole-exome sequencing; whole-genome sequencing
Mesh:
Year: 2015 PMID: 25587064 PMCID: PMC4382803 DOI: 10.1136/jmedgenet-2014-102907
Source DB: PubMed Journal: J Med Genet ISSN: 0022-2593 Impact factor: 6.318
Figure 1The basic framework of SeqHBase: the sequencing data include annotated variants, genetic variations of every whole-genome sequencing/whole-exome sequencing (WGS/WES) sample, and coverage (read depth) of each site of every WGS/WES sample. Users can load three different types of sequencing data into HBase by providing CSV files for variants, VCF or vcf.gz files for variations, BAM or pileup files generated by SAMtools for coverage. Then SeqHBase uses a MapReduce model to split the input data set into independent chunks that are processed by the map tasks in a completely parallel manner. Given a pedigree file for analysing a data set, SeqHBase extracts variant, variation and coverage information using reduce tasks in a parallel manner for each sample. Finally, SeqHBase uses inheritance information for detecting de novo, inherited homozygous or compound heterozygous mutations that may be disease-contributing in trios, nuclear families and/or extended families.
Extracted information from three types of input files
| Data source | Data type | Extracted information |
|---|---|---|
| Annotated variant files | Annotation | Chromosome, start position, end position, reference allele, alternative allele, allele frequency in the 1000 Genome Project and the NHLBI-ESP6500 project, ClinVar, biological function (such as SIFT, PolyPhen and CADD score) and many others |
| VCF files | Variation | Sample family ID, individual ID, called variant genotypes, read depths and Phred quality scores |
| BAM files | Coverage (read depth) | Coverage of each site of every sequencing sample (∼3 billion sites in a WGS) |
WGS, whole-genome sequencing.
Figure 2Description of three families used in our benchmarking study. (1) Family 1 is a five-member nuclear family in which the affected individual has Rodriguez syndrome. One plausible de novo mutation and one possible compound heterozygous mutation were detected. (2) Family 2 is a four-member nuclear family where the affected individual has idiopathic haemolytic anaemia. One plausible gene with compound heterozygous mutations was detected. (3) Family 3 is a 10-member extended family with three generations where the two affecteds have an undiagnosed disease manifesting with intellectual disability, autism, attention deficit hyperactivity disorder and other symptoms. An X linked de novo mutation with the mother of the two affecteds was detected. Both affecteds inherited the mutation.
Figure 3SeqHBase running time in seconds when run on different numbers of data nodes. The data set for Family 1 in figure 2 was used to evaluate the performance of SeqHBase. Each data node was configured with 6 GB memory, two CPUs (2.6 GHz) and 1 TB hard disk space. Note that the performance of SeqHBase using a single data node is not evaluated due to lack of disk space to manipulate five WGS data sets within the same virtual machine.
Brief results of family based sequencing data analysis*
| Family 1 | Family 2 | Family 3 | |
|---|---|---|---|
| Phenotype(s) | Rodriguez syndrome | Idiopathic haemolytic anaemia | Severe intellectual disability, autistic behaviours, attention deficit hyperactivity disorder and very distinctive facial features |
| Sequencing type | WGS | WES | WGS |
| Family members | 5 | 4 | 10 |
| # of affected(s) | 1 | 1 | 2 |
| # of de novo | 6 | 16 | 18 |
| # of autosomal recessive | 1 | 0 | 1 |
| # of X linked | 0 | 0 | 1 |
| # of comp het | 2 | 2 | 2 |
| Likely disease-contributing gene |
*Analysis criteria: variants with coverage of ≥20× for every individual, variant frequencies (minor allele frequency, MAF)≤0.01 in the 1000 Genome Project and EPS6500 populations, and variants that were annotated as being non-synonymous, stop-gain, stop-loss, splicing or frame-shift changes. Results were obtained following the filtering processes.
WES, whole-exome sequencing; WGS, whole-genome sequencing.