Literature DB >> 26258291

SpeedSeq: ultra-fast personal genome analysis and interpretation.

Colby Chiang^1,2, Ryan M Layer^3,4, Gregory G Faust², Michael R Lindberg², David B Rose², Erik P Garrison⁵, Gabor T Marth^3,4, Aaron R Quinlan^3,4, Ira M Hall^1,6.

Abstract

SpeedSeq is an open-source genome analysis platform that accomplishes alignment, variant detection and functional annotation of a 50× human genome in 13 h on a low-cost server and alleviates a bioinformatics bottleneck that typically demands weeks of computation with extensive hands-on expert involvement. SpeedSeq offers performance competitive with or superior to current methods for detecting germline and somatic single-nucleotide variants, structural variants, insertions and deletions, and it includes novel functionality for streamlined interpretation.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2015 PMID： 26258291 PMCID： PMC4589466 DOI： 10.1038/nmeth.3505

Source DB: PubMed Journal: Nat Methods ISSN： 1548-7091 Impact factor: 28.547

Technical advances in second-generation DNA sequencing technologies have reduced both the cost and time required to generate whole-genome sequencing (WGS) data, creating opportunities in healthcare and academic research to survey the human genome with unprecedented depth and scope. However, bottlenecks in computational processing and variant interpretation have hindered adoption of these technologies for time-sensitive and large-scale projects. Using a standard pipeline based on BWA[1], GATK[2], SAMtools[3], and Picard, processing a 50× human genome from raw sequence data to variant calls on a 32-thread server requires 60-70 hours (Supplementary Note 1). Furthermore, distinguishing pathogenic from benign mutations is a labor-intensive process that can take hours or days of manual curation per patient[4]. SpeedSeq is a suite of open-source software designed for rapid whole-genome variant detection and interpretation. Its modular architecture and universal formats confer adaptability to a variety of experimental designs and compatibility with standard industry software (Fig. 1a). It achieves superior processing efficiency through rapid duplicate marking with SAMBLASTER[5], through balanced parallelization of external applications, and by executing non-dependent pipeline components simultaneously. SpeedSeq translates raw 50× WGS data into prioritized single nucleotide variants (SNVs), short insertions and deletions (indels), and structural variants (SVs) in 13 hours on a single 16-core server with 128 GB of RAM (current cost: <$10,000), and our accelerated implementations show little to no difference compared to original software (Supplementary Note 1). This represents at minimum a several-fold speed increase over current practices using typical computing resources.

Figure 1

SpeedSeq workflow

(a) SpeedSeq converts raw reads into prioritized variants in 13 hours for a 50× human dataset. (b) Germline SNV (N=2,803,144) and (c) indel (N=364,031) receiver operating characteristic (ROC) curves over the Genome in a Bottle (GIAB) truth set for SpeedSeq, GATK Unified Genotyper (GATK-UG) and Haplotype Caller (GATK-HC). (d) Somatic SNV detection ROC curves for a simulated 50× tumor-normal pair using SpeedSeq and three other tools (N=875,206). Open circles (b-d) denote the data points reported in the main text. (e) SpeedSeq’s SV detection performance by quality score of all SVs (black), those with split-read and paired-end support (blue), and those with read-depth support from CNVnator (red), as validated by either PacBio/Moleculo long-reads or the 1000 Genomes Project. (f) Schematic of haplotype-based SV validation showing undetected (open circles), consistently segregating (black circles), and inconsistently segregating (red circles) SVs through the CEPH 1463 pedigree.

We assessed the accuracy of SpeedSeq’s SNV and indel calls against the Genome in a Bottle Consortium (GIAB) truth set in the well-studied human sample, NA12878 (2,803,144 SNVs and 364,031 indels)[6]. SpeedSeq achieved a sensitivity of 99.9% and 89.9% for germline SNVs and indels respectively, with acceptably low false discovery rates (FDR) (0.4% and 1.1%) (Fig. 1b,c). These detection rates exceeded the GATK Unified Genotyper’s (GATK-UG) sensitivity (SNVs: 99.7%, indels: 89.0%) with a similar FDR (SNVs: 0.5%, indels: 1.0%;). The GATK Haplotype Caller (GATK-HC) showed superior indel detection sensitivity (SNVs: 99.8%, indels: 95.7%) with lower FDR for both variant types (SNVs: 0.2%, indels: 0.6%). SpeedSeq’s implementation of FreeBayes therefore exhibits comparable – albeit slightly inferior – performance to GATK-HC when tested on the GIAB callset[7]. However, the GIAB truth set is biased towards GATK because it was primarily derived from GATK-based analyses. We therefore assessed SpeedSeq’s performance against an unbiased truth set of 689,788 SNVs at 2,177,040 sites (Illumina Omni 2.5) in which SpeedSeq attained the highest sensitivity at the minor expense of specificity compared to GATK-UG or GATK-HC (Supplementary Fig. 1). Miscalled variants were enriched in repetitive regions of the genome and adjacent to assembly gaps (Supplementary Note 2 and Supplementary Table 1). SpeedSeq also supports joint multi-sample variant calling and de novo germline mutation detection in families (Supplementary Note 3), which is crucial for clinical applications such as rapid newborn diagnosis[8]. Cancer genome analysis is a common WGS application in research and clinical environments, and can be a time-sensitive component of patient care. To emulate a WGS dataset from a heterogeneous tumor-normal pair, we defined NA12877 as the “normal” sample and pooled raw data from his 11 children in equal proportions to generate a single 50× “tumor” sample. The 875,206 SNVs present in the mother (NA12878) but absent from the father (NA12877) were defined as somatic mutations, with variant allele frequencies (VAFs) ranging from 0.05 to 0.5 (Supplementary Fig. 2a). Using this evaluation paradigm, we compared SpeedSeq’s performance to three other leading somatic variant calling tools: MuTect[9], SomaticSniper[10], and VarScan 2[11]. SpeedSeq recalled 96.6% of the somatic variants in the “tumor” with a FDR of 3.3%, outperforming SomaticSniper in both sensitivity and specificity, and delivering competitive performance against MuTect and VarScan 2 (Fig. 1d, Supplementary Fig. 2b,c). To test SpeedSeq’s performance on real cancer data, we obtained WGS reads (50× tumor, 30× normal) from five tumor-normal pairs with validated somatic mutations ascertained by deep exome sequencing from The Cancer Genome Atlas (TCGA). SpeedSeq recalled 96.4% of 2,746 orthogonally validated mutations across all five datasets including 98.8% of mutations in genes that have been causally implicated in cancer[12] (Supplementary Table 2). Ascertainment of structural variants – copy number variants (CNVs), balanced rearrangements and mobile element insertions – is a critical component of comprehensive genome analysis. SV detection poses two key technical challenges. First, SVs are extremely difficult to detect reliably[13]. Second, functional interpretation of SVs requires specialized logic due to their variable size and diverse configurations, and because SV breakpoints are often mapped imprecisely. Due to these challenges, few established genome analysis pipelines attempt to rigorously detect and interpret SVs. SpeedSeq achieves comprehensive SV analysis with a suite of three complementary tools that are sensitive to a range of SV signals. At its core is LUMPY, a state-of-the-art breakpoint detection tool that integrates split-read and discordant paired-end data[14] . Next, a custom parallelized implementation of CNVnator uses read-depth analysis to detect CNVs that may be invisible to LUMPY due unmappable or repetitive sequence at their breakpoints[15]. Finally, SpeedSeq genotypes SVs with SVTyper, a novel Bayesian likelihood algorithm that can operate on copy-neutral events such as inversions and translocations as well as CNVs. This step produces SV genotypes that are crucial for meaningful variant interpretation, as well as quantitative estimates of breakpoint allele frequencies that allow inference of the fraction of tumor cells that carry a particular variant. Measuring SV detection performance on real data is difficult due to the lack of established truth sets. If we accept the 1000 Genomes Project (1KGP) deletion callset for NA12878 as ground truth[16,17], SpeedSeq achieves a sensitivity of 61.9% (2089/3376) and positive predictive value of 60.8% (2089/3438) for detecting deletions, which is consistent with our recent comparative performance tests for LUMPY[14] and by inference shows that SpeedSeq achieves state-of-the-art SV detection relative to other tools. However, this test underestimates absolute performance due to known false positives and negatives in the 1KGP callset. We therefore developed a composite strategy in which SVs in NA12878 could be validated either by overlap with split-read mapping of deep (30×) long-read data from PacBio and Illumina Moleculo platforms or by overlap with 1KGP. Based on this hybrid approach, SVs with quality scores of 100 or greater show a positive predictive value of 86.0% (2823/3282) (Fig. 1e, Supplementary Fig. 3). Virtually none of these SVs are likely to have validated by random chance, as 100 permutations of the callset resulted in a validation rate of 0.073% (±6.1E-3, 95% CI). Moreover, SVTyper’s quality scores provide a tunable parameter for refining callsets to a desired confidence threshold. By requiring both paired-end and split-read support, users may generate an extremely high confidence callset of 1,663 SVs with a 97.8% validation rate. As an independent measure of SV detection and genotyping performance, we developed a haplotype-based test that exploits the structure of the CEPH 1463 pedigree. First, we phased the pedigree by SNV transmission to produce haplotype lineage maps, allowing us to attribute an average of 63.0% of the mappable genome of each F2 individual to a particular founding grandparent (Fig. 1f). Next, we performed joint SV detection on the pedigree to generate 1,722 high-confidence autosomal SVs that could be assigned to a founding grandparent by transmission, resulting in a truth set of 8,397 predicted SV observations across the 11 grandchildren with known genotypes. SpeedSeq showed a detection sensitivity of 90.2% (7,578/8,397) for these predicted SVs, encompassing 1,660 of the 1,722 unique variants (Supplementary Table 3). Among the SVs that were detected, SVTyper reported the correct genotype at 96.6% (6,845/7,083) of heterozygous variants and 72.3% (358/495) of homozygous variants. Moreover, the high specificity of this callset is apparent from the infrequency of Mendelian violations (5.0%) and the consistent co-segregation of SVs with SNV-based haplotypes (93.8%) (Supplementary Table 4). Results from SpeedSeq seamlessly integrate into the GEMINI variant interpretation framework, which annotates calls with information from external databases including dbSNP, ENCODE, ClinVar, CADD, ESP, and ExAC for efficient filtering with command line queries or a graphical browser interface[18]. In concert with SpeedSeq, we have made numerous enhancements to GEMINI, particularly in handling structural variants and interpreting somatic mutations. Users can rapidly prioritize somatic mutations through queries on two newly added databases: the COSMIC catalogue of somatic mutations in cancer[12] and DGIdb, the Drug-Gene Interaction database[19]. In addition, GEMINI can now identify structural variants that alter gene dosage or interrupt transcripts, as well as putative somatic gene fusions affecting COSMIC cancer genes. Finally, to provide an example of a typical cancer analysis interpretation, we performed somatic variant calling on the tumor-normal pair of an invasive breast carcinoma from TCGA that carries a known gene fusion[20]. With four concise commands and less than an hour of computation, we loaded the VCF file into GEMINI, filtered variant calls for high-confidence, clinically informative somatic mutations, and predicted gene fusion events (Fig. 2). These analyses demonstrate the ease with which high impact somatic point mutations and genomic rearrangements can be identified using the SpeedSeq framework.

Figure 2

Case study in a tumor-normal pair

A SpeedSeq workflow demonstrating the seven succinct commands required to process a tumor-normal pair (TCGA-E2-A14P) from raw FASTQ reads to clinically actionable somatic mutations with predicted damaging consequences. In this tumor, SpeedSeq detected a previously reported somatic gene fusion product between exon 1 of TBL1XR1 and exon 2 of PIK3CA[20].

Online methods

Software availability

The SpeedSeq v0.0.3a source code, documentation, and example data files are available in Supplementary Software, as well as at https://github.com/cc2qe/speedseq.

Hardware

All timings reported herein were performed on a single machine with 128 GB RAM and two Intel Xeon E5-2670 processors, each with 16 threads.

Data

We benchmarked SpeedSeq’s processing time using the NA12878 genome from the Illumina Platinum Genomes dataset (European Nucleotide Archive: ERP001960), which comprises 50× WGS datasets for each of the 17 members of the three-generation CEPH 1463 pedigree (Supplementary Fig. 4). Whole-genome sequencing data from five matched tumor-normal pairs and their orthogonally validated somatic mutations were obtained from The Cancer Genome Atlas (TCGA). These included three colorectal tumors (TCGA-A6-6141, TCGA-CA-6718, TCGA-D5-6540), one ovarian tumor (TCGA-13-0751), and one breast tumor (TCGA-B6-A0I6). Raw FASTQ reads were down-sampled to 50× coverage in the tumor and 30× coverage in the normal sample. Samples were processed with SpeedSeq for alignment, somatic mutations, and structural variants using default parameters and then loaded into GEMINI for variant interpretation. We also analyzed WGS data from a tumor-normal pair (63× tumor, 49× normal coverage) of a patient with an invasive breast carcinoma (TCGA-E2-A14P) containing a previously reported gene fusion between TBL1XR1 and PIK3CA[20].

FASTQ alignment and BAM processing

SpeedSeq aligns paired-end FASTQ files to the human GRCh37 reference genome with BWA-MEM, using the “-M” flag to mark shorter alignments as secondary. Aligned reads are streamed directly into SAMBLASTER[5], which seizes idle CPU cycles that are periodically liberated each time BWA reads a FASTQ data chunk into the buffer. Marking duplicates on the pre-sorted BAM file allows simultaneous extraction of discordant read-pairs and split-read alignments, followed by rapid sorting and BAM compression with Sambamba[21].

SNV and indel detection strategy

SpeedSeq runs FreeBayes version 0.9.16 with “--min-repeat-entropy 1” and “--experimental-gls” parameters for germline variant calling[7]. To increase specificity, SpeedSeq also requires at least one read on both the left and the right to support the variant allele. For somatic variant detection, SpeedSeq uses parameters tuned to increase sensitivity over low frequency variants (--pooled-discrete --genotype-qualities --min-alternate-fraction 0.05 --min-alternate-count 2 --min-repeat-entropy 1), and reports a somatic score (SSC) to estimate the confidence of each variant. The somatic score is the sum of the log odds ratios of the tumor (LOD) and normal (LOD) based on the genotype likelihood probabilities from FreeBayes (P and P for tumor and normal genotype probabilities respectively). The SSC is the preferred tuning parameter since it is robust to sequencing depth by design, however, the minimum alternate fraction and minimum alternate count may also be adjusted on the SpeedSeq command line. SpeedSeq’s implementation of FreeBayes is parallelized over 34,123 windowed regions of the GRCh37 genome using GNU Parallel[22]. We generated these regions, which average 84 kb in length, by partitioning the genome into bins of approximately equal numbers of reads based upon the aggregate coverage depth of all 17 members of the CEPH 1463 family pedigree and excluding high depth sequences (Supplementary Note 4 and Supplementary Fig. 5). This binning scheme balances the computational load over the FreeBayes instances by allocating processors based on the quantity of expected input data. It is 13.3-fold faster than the single-threaded version and 34.9% faster than naïve parallelization over each chromosome (Supplementary Note 1).

Structural variation detection and genotyping strategy

SpeedSeq runs LUMPY with “-msw 4 -tt 0 min_clip 20 min_non_overlap 101 min_mapping_threshold 20 discordant_z 5 back_distance 10”, and weights of 1 for both paired-end and split-read evidence. SpeedSeq’s implementation of CNVnator parallelizes the genome by chromosome and performs copy number segmentation with a window size of 100 bp. SVTyper is a maximum-likelihood Bayesian classification algorithm that infers an underlying genotype at each SV. Alignments at SV breakpoints either support the alternate allele with discordant or split-reads, or they support the reference allele with concordant reads/read-pairs that span the breakpoint. The ratio and quantity of these observations allow probabilistic inference of genotype likelihood. Under the assumption of diploidy, the set of possible genotypes at any locus is G = {reference, heterozygous, homozygous}. We defined the function S, where S(g) is the prior probability of observing a variant read in a single trial given a genotype g at any locus. These priors were set to 0.1, 0.4, 0.8 for reference, heterozygous, and homozygous deletions respectively. Assuming a random sampling of reads, the number of observed alternate (A) and reference (R) reads (scaled by mapping quality, 10^(-mapq/10)) will follow a binomial distribution B(A+R, S(g’)), where g’ ∈ G is the true underlying genotype. Using Bayes’s theorem we can derive the conditional probability of each underlying genotype state from the observed read counts (Eq. 4), assuming an a priori probability P(g) of 1/3 for each genotype. Finally, we calculate ĝ as the inferred genotype for the variant. Since the algorithm only interrogates SVs in the VCF file that have passed LUMPY filters as non-reference, it reports the more likely genotype of heterozygous or homozygous alternate states.

SNV and indel evaluation

We compared SpeedSeq’s germline SNV and indel variant calling against two independent truth sets for NA12878, one derived from the Genome in a Bottle (GIAB) NA12878 gold standard calls and the other based on Omni microarray data from the 1000 Genomes Project (1KGP). The GIAB 2.17 truth set contained 2,803,144 SNVs and 364,031 indels within highly confident regions (excluding segmental duplications, simple repeats, decoy sequence, and CNVs), spanning 2.2 Gb (77.6% of the mappable genome) for which non-variant sites could be confidently considered homozygous reference. The Omni microarray truth set contained 2,177,040 informative SNVs of which 689,788 were non-reference in NA12878, excluding markers within 50 bp of known indels. We aligned NA12878 raw reads from the Illumina Platinum data with SpeedSeq, and then called germline SNVs and indels using SpeedSeq default parameters. To evaluate SpeedSeq’s performance against other standard tools, we also processed the aligned BAM files according to the Genome Analysis Toolkit (version 3.2-2-gec30cee) best practices workflow, including realignment around indels, base recalibration, and variant calling with Unified Genotyper (GATK-UG) and Haplotype Caller (GATK-HC). Variant quality score recalibration was performed on the GATK results using a passing tranche filter of <99%. We normalized and compared variant calls according to the GIAB protocol, with vcfallelicprimatives, GATK’s LeftAlignAndTrimVariants, and VcfComparator[2,6]. We filtered variants for sensitivity and FDR against the GIAB truth set using a minimum quality score of 100 for GATK tools, and 1 for SpeedSeq (open circles, Fig. 1b,c). To evaluate performance in detecting somatic variants, we generated a simulated tumor-normal matched pair from the CEPH 1463 family Illumina Platinum data. The “tumor” dataset was an equal mixture of all 11 members of the F2 generation, down-sampled to 50× coverage and aligned with SpeedSeq. The father of the F2 generation (NA12877) represented the 50× matched normal sample. For inclusion in the somatic SNV truth set, we required a variant to be diallelic, autosomal, in the NA127878 GIAB truth set, and called by Real Time Genomics (RTG) as non-reference in NA12878 and reference in NA12877[6,23]. Additionally, variants were disqualified from the truth set if they violated Mendelian inheritance patterns. These criteria resulted in a set of 875,206 high confidence SNVs covering 77.6% of the mappable genome. The truth set of variants in the chimeric tumor followed the expected binomial pattern of inheritance in her children, with a peak at 0.5 VAF from homozygous SNVs in NA12878 (Supplementary Fig. 2a). We processed the simulated tumor data with SpeedSeq, MuTect 1.1.4, SomaticSniper, and VarScan 2 using parameters designed to target variants as low as 5% variant allele fraction. Receiver operating characteristic (ROC) curves were generated by varying somatic score (SSC) for SpeedSeq, SomaticSniper, and VarScan 2. For MuTect, which does not produce a single quality score for somatic variants, we varied the t_lod_fstar value to construct the ROC curve.

Structural variant evaluation

We constructed the 1KGP truth set by integrating deletions from the Pilot and Phase 1 callsets[16,17]. For long-read validation of SV breakpoints, we obtained 30× PacBio (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20131209_na12878_moleculo/) and Moleculo (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/working/20131209_na12878_pacbio/Schadt) data from 1KGP. We realigned the PacBio data with BWA-MEM 0.7.10 using the -x pacbio flag for consistency with the Moleculo alignments. Validations were performed according to previously published methods[14]. Custom scripts for this analysis are available at https://github.com/hall-lab/long-read-validation. To construct haplotype maps of the CEPH 1463 F2 genomes, we called SNVs with SpeedSeq on the entire 17-member pedigree, and phased SNVs by transmission at polymorphic sites in the parents. We smoothed the chromosomes for contiguous blocks of inheritance by selecting informative bases where 95% of each run of 101 SNVs reported a consistent parent-of-origin. We then merged regions that shared inheritance and were within 100 kb of each other. This allowed us to trace an average of 1.8 Gb (63.4%) of each F2 chromosome back to a particular grandparent, encapsulating meiotic crossovers that occurred in the F1 germline (Fig. 1f). We then used SpeedSeq to jointly call structural variants on the entire pedigree, filtering for deletions that had at least seven pieces of support in at least one member of the pedigree, had legal Mendelian transmission, and whose origin could be unambiguously attributed to a single grandparent. Variants for which the founding grandparent by SV transmission agreed with the founding grandparent by SNV phasing were considered to be concordant, with strong supporting evidence for their authenticity. To test whether the 1,722 informative SVs were representative of the dataset as a whole, and not of misleadingly high quality due to their ascertainment criteria, we assessed their validation rate as above using the 1KGP callset and long-read sequencing (Supplementary Table 4). The 1,722 informative SVs had a similar validation rate as the remaining 6,734 SVs, suggesting that they are representative of overall callset quality.

20 in total

1. Deep sequencing of patient genomes for disease diagnosis: when will it become routine?

Authors: Stephen F Kingsmore; Carol J Saunders
Journal: Sci Transl Med Date: 2011-06-15 Impact factor: 17.956

2. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls.

Authors: Justin M Zook; Brad Chapman; Jason Wang; David Mittelman; Oliver Hofmann; Winston Hide; Marc Salit
Journal: Nat Biotechnol Date: 2014-02-16 Impact factor: 54.908

3. Joint variant and de novo mutation identification on pedigrees from high-throughput sequencing data.

Authors: John G Cleary; Ross Braithwaite; Kurt Gaastra; Brian S Hilbush; Stuart Inglis; Sean A Irvine; Alan Jackson; Richard Littin; Sahar Nohzadeh-Malakshah; Mehul Rathod; David Ware; Len Trigg; Francisco M De La Vega
Journal: J Comput Biol Date: 2014-06 Impact factor: 1.479

4. Clinical interpretation and implications of whole-genome sequencing.

Authors: Frederick E Dewey; Megan E Grove; Cuiping Pan; Benjamin A Goldstein; Jonathan A Bernstein; Hassan Chaib; Jason D Merker; Rachel L Goldfeder; Gregory M Enns; Sean P David; Neda Pakdaman; Kelly E Ormond; Colleen Caleshu; Kerry Kingham; Teri E Klein; Michelle Whirl-Carrillo; Kenneth Sakamoto; Matthew T Wheeler; Atul J Butte; James M Ford; Linda Boxer; John P A Ioannidis; Alan C Yeung; Russ B Altman; Themistocles L Assimes; Michael Snyder; Euan A Ashley; Thomas Quertermous
Journal: JAMA Date: 2014-03-12 Impact factor: 56.272

5. SomaticSniper: identification of somatic point mutations in whole genome sequencing data.

Authors: David E Larson; Christopher C Harris; Ken Chen; Daniel C Koboldt; Travis E Abbott; David J Dooling; Timothy J Ley; Elaine R Mardis; Richard K Wilson; Li Ding
Journal: Bioinformatics Date: 2011-12-06 Impact factor: 6.937

6. GEMINI: integrative exploration of genetic variation and genome annotations.

Authors: Umadevi Paila; Brad A Chapman; Rory Kirchner; Aaron R Quinlan
Journal: PLoS Comput Biol Date: 2013-07-18 Impact factor: 4.475

7. DGIdb: mining the druggable genome.

Authors: Malachi Griffith; Obi L Griffith; Adam C Coffman; James V Weible; Josh F McMichael; Nicholas C Spies; James Koval; Indraniel Das; Matthew B Callaway; James M Eldred; Christopher A Miller; Janakiraman Subramanian; Ramaswamy Govindan; Runjun D Kumar; Ron Bose; Li Ding; Jason R Walker; David E Larson; David J Dooling; Scott M Smith; Timothy J Ley; Elaine R Mardis; Richard K Wilson
Journal: Nat Methods Date: 2013-10-13 Impact factor: 28.547

8. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

9. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples.

Authors: Kristian Cibulskis; Michael S Lawrence; Scott L Carter; Andrey Sivachenko; David Jaffe; Carrie Sougnez; Stacey Gabriel; Matthew Meyerson; Eric S Lander; Gad Getz
Journal: Nat Biotechnol Date: 2013-02-10 Impact factor: 54.908

10. LUMPY: a probabilistic framework for structural variant discovery.

Authors: Ryan M Layer; Colby Chiang; Aaron R Quinlan; Ira M Hall
Journal: Genome Biol Date: 2014-06-26 Impact factor: 13.583

193 in total

1. Sex-dependent dominance maintains migration supergene in rainbow trout.

Authors: Devon E Pearse; Nicola J Barson; Torfinn Nome; Guangtu Gao; Matthew A Campbell; Alicia Abadía-Cardoso; Eric C Anderson; David E Rundio; Thomas H Williams; Kerry A Naish; Thomas Moen; Sixin Liu; Matthew Kent; Michel Moser; David R Minkley; Eric B Rondeau; Marine S O Brieuc; Simen Rød Sandve; Michael R Miller; Lucydalila Cedillo; Kobi Baruch; Alvaro G Hernandez; Gil Ben-Zvi; Doron Shem-Tov; Omer Barad; Kirill Kuzishchin; John Carlos Garza; Steven T Lindley; Ben F Koop; Gary H Thorgaard; Yniv Palti; Sigbjørn Lien
Journal: Nat Ecol Evol Date: 2019-11-25 Impact factor: 15.460

2. Discovery and quality analysis of a comprehensive set of structural variants and short tandem repeats.

Authors: David Jakubosky; Erin N Smith; Matteo D'Antonio; Marc Jan Bonder; William W Young Greenwald; Agnieszka D'Antonio-Chronowska; Hiroko Matsui; Oliver Stegle; Stephen B Montgomery; Christopher DeBoever; Kelly A Frazer
Journal: Nat Commun Date: 2020-06-10 Impact factor: 14.919

3. 16GT: a fast and sensitive variant caller using a 16-genotype probabilistic model.

Authors: Ruibang Luo; Michael C Schatz; Steven L Salzberg
Journal: Gigascience Date: 2017-07-01 Impact factor: 6.524

4. Complex DNA structures trigger copy number variation across the Plasmodium falciparum genome.

Authors: Adam C Huckaby; Claire S Granum; Maureen A Carey; Karol Szlachta; Basel Al-Barghouthi; Yuh-Hwa Wang; Jennifer L Guler
Journal: Nucleic Acids Res Date: 2019-02-28 Impact factor: 16.971

5. Whole-genome sequencing overcomes pseudogene homology to diagnose autosomal dominant polycystic kidney disease.

Authors: Amali C Mallawaarachchi; Yvonne Hort; Mark J Cowley; Mark J McCabe; André Minoche; Marcel E Dinger; John Shine; Timothy J Furlong
Journal: Eur J Hum Genet Date: 2016-05-11 Impact factor: 4.246

6. ZSCAN10 expression corrects the genomic instability of iPSCs from aged donors.

Authors: Maria Skamagki; Cristina Correia; Percy Yeung; Timour Baslan; Samuel Beck; Cheng Zhang; Christian A Ross; Lam Dang; Zhong Liu; Simona Giunta; Tzu-Pei Chang; Joye Wang; Aparna Ananthanarayanan; Martina Bohndorf; Benedikt Bosbach; James Adjaye; Hironori Funabiki; Jonghwan Kim; Scott Lowe; James J Collins; Chi-Wei Lu; Hu Li; Rui Zhao; Kitai Kim
Journal: Nat Cell Biol Date: 2017-08-28 Impact factor: 28.824

7. Whole genome sequencing and novel candidate genes for CAKUT and altered nephrogenesis in the HSRA rat.

Authors: Kurt C Showmaker; Meredith B Cobb; Ashley C Johnson; Wenyu Yang; Michael R Garrett
Journal: Physiol Genomics Date: 2019-12-16 Impact factor: 3.107

8. SV2: accurate structural variation genotyping and de novo mutation detection from whole genomes.

Authors: Danny Antaki; William M Brandler; Jonathan Sebat
Journal: Bioinformatics Date: 2018-05-15 Impact factor: 6.937

9. Drivers and dynamics of a massive adaptive radiation in cichlid fishes.

Authors: Fabrizia Ronco; Michael Matschiner; Astrid Böhne; Anna Boila; Heinz H Büscher; Athimed El Taher; Adrian Indermaur; Milan Malinsky; Virginie Ricci; Ansgar Kahmen; Sissel Jentoft; Walter Salzburger
Journal: Nature Date: 2020-11-18 Impact factor: 49.962

10. Three interacting genomic loci incorporating two novel mutations underlie the evolution of diet-induced diabetes.

Authors: Yoram Yagil; Barak Markus; Refael Kohen; Chana Yagil
Journal: Mol Med Date: 2016-07-26 Impact factor: 6.354