Literature DB >> 27892959

novoBreak: local assembly for breakpoint detection in cancer genomes.

Zechen Chong¹, Jue Ruan², Min Gao³, Wanding Zhou¹, Tenghui Chen¹, Xian Fan¹, Li Ding⁴, Anna Y Lee⁵, Paul Boutros^5,6,7, Junjie Chen³, Ken Chen¹.

Abstract

We present novoBreak, a genome-wide local assembly algorithm that discovers somatic and germline structural variation breakpoints in whole-genome sequencing data. novoBreak consistently outperformed existing algorithms on real cancer genome data and on synthetic tumors in the ICGC-TCGA DREAM 8.5 Somatic Mutation Calling Challenge primarily because it more effectively utilized reads spanning breakpoints. novoBreak also demonstrated great sensitivity in identifying short insertions and deletions.

Entities: CellLine Chemical Disease Gene Species

Mesh：

Year: 2016 PMID： 27892959 PMCID： PMC5199621 DOI： 10.1038/nmeth.4084

Source DB: PubMed Journal: Nat Methods ISSN： 1548-7091 Impact factor: 28.547

Somatic structural variations (SVs) are major driving forces of tumor development and progression. Sporadic and recurrent chromosomal aberrations have been observed in most cancer types[1]. Many are desirable therapeutic targets. The advent of high-throughput next generation sequencing (NGS) technologies has made it possible to perform genome-wide detection of SVs at base pair resolution. As a result, unprecedented landscapes of SVs have been unveiled in various cancer genomes[2]. However, computational approaches[3] for detecting SVs from NGS data are limited in sensitivity and comprehensiveness[4]. One approach is to align short paired-end reads to a reference genome and identify signals in discordant read pairs[5], read depths[6], split reads[7], or their combinations[8]. Another approach is through targeted local assembly of aligned and partially aligned reads in candidate SV regions discovered a priori[9]. These approaches depend heavily on the accuracy of read alignments, which is often limited for reads spanning breakpoints or substantially different from the reference. In comparison, whole genome assembly approaches[10] are less biased. However, assembling a whole genome is computationally intensive[11] and results are often affected by repeats, polyploidy, read length and sequencing coverage. We developed a novel method, novoBreak, which obtains genome-wide local (glocal) assembly of breakpoints from clusters of reads sharing a set of k-mers (contiguous nucleotide sequences of length k) uniquely present in a subject genome (e.g., a tumor genome) but not in the reference genome or any control data (e.g., a matched normal genome) (Fig. 1 and Online Methods). When applied to somatic breakpoint detection from matched tumor and normal tissue data, novoBreak first constructs a hash table from the tumor reads, containing all the k-mers, their host reads and frequencies in the set (Online Methods and Supplementary Note 1). Next, it filters out k-mers representing reference alleles or sequencing errors, and retains those representing variants or novel sequences not present in the reference genome. It then queries the normal reads and further classifies the k-mers into 1) germline k-mers, those present in both the tumor and the normal genome, and 2) somatic k-mers, those present in the tumor but not the normal genome. Then, novoBreak identifies clusters of read pairs spanning each somatic breakpoint, and assembles each cluster of reads into contigs (Online Methods and Supplementary Note 2). By comparing the resulting high-quality contigs with the reference, novoBreak identifies breakpoints and associated SVs. Finally, novoBreak quantifies the amount of the supporting evidences at each breakpoint and outputs a final report.

Figure 1

The workflow of novoBreak algorithm

(a) Short paired-end tumor reads (pairs of grey and black bars connected by dashed lines) are dissected into constituent k-mers. Indexed k-mers are compared against the reference sequence and normal reads. Only somatic novo k-mers (red bars) unique in the tumor genome are kept, while germline (green bar) and reference k-mers (grey bars) are filtered out. (b) The cluster of reads spanning a breakpoint i are found in conjunction with a set of shared novo k-mers. A long contig (grey bar) containing a unique breakpoint sequence in the middle (highlighted in red) is assembled from the cluster of reads. (c) This assembled contig is aligned (dashed line) against the reference sequence to infer exact breakpoint and associated SV. (d) Each SV breakpoint is scored, ranked and output in a standard variant call format (VCF) file.

We examined the performance of novoBreak in the ICGC-TCGA DREAM 8.5 Somatic Mutation Calling Challenge (https://www.synapse.org/#!Synapse:syn312572), which aimed to identify the best algorithms for detecting somatic mutations in NGS data[12]. In each of the synthetic sub-challenges, a high coverage (60–80×) whole genome sequencing (WGS) bam file produced from a cell line or a patient tissue was divided into two parts (30–40× each). One part was treated as the normal data and the other as the tumor data containing mutations spiked-in by BAMSurgeon[13]. A series of in silico sub-challenges (IS) were implemented with increasing numbers, types of variants and cellular complexity. In total, 204 submissions were made by 27 teams that developed the most widely used SV detection tools such as Breakdancer[5], Delly[8], Pindel[7], and Manta (https://github.com/StructuralVariants/manta). NovoBreak consistently achieved the best balanced accuracy (sensitivity and precision) in IS2, IS3 and IS4 (Supplementary Table 1). Almost all the top performing tools achieved high precision (>0.98) after stringent filtering. NovoBreak excelled in higher sensitivity, which was particularly evident when insertions were introduced in IS2 and IS3. IS3 was probably the most difficult because it not only contained SNVs and four types of SVs: deletions, duplications, inversions and insertions (mobile elements), but also INDELs (insertions and deletions shorter than 100 bp). It also contained subclones at respective cellular fractions of 50%, 33%, and 20%. NovoBreak achieved the highest balanced accuracy of 0.892 (sensitivity: 0.801 and precision: 0.984) due mainly to its higher sensitivity in detecting insertions (Fig. 2a). It discovered 100 (4.3%) and 120 (5.1%) more insertions in the ground truth than DELLY and Manta, respectively. Compared to alignment-based approaches, novoBreak more effectively utilized reads spanning insertion breakpoints (Supplementary Fig. 1). Further analysis of the SVs missed by DELLY and Manta indicates that novoBreak performs better in low coverage regions with few discordantly paired or split reads (Supplementary Note 3). Breakpoints identified by novoBreak also had the highest precision: 98.9% are within −2bp to 2bp relative to the ground truth (Fig. 2b).

Figure 2

novoBreak performance

in the IS3 data. (a) Precision and recall comparison among 3 top-performing tools: novoBreak (green), DELLY (blue) and Manta (red). Star indicates the best scoring results of each tool. (b) Comparison of breakpoint precision among the 3 tools. X-axis is the offset (in base pair) between the true and predicted breakpoint coordinates. Y-axis is the fraction of predicted breakpoints at each of the offset values. (c) INDEL detection sensitivity of GATK-Haplotypecaller, Strelka and novoBreak in the IS4 data. (d) Summary of SV breakpoints detected in COLO-829 data by novoBreak, BreakDancer, DELLY and Fermi.

Detection of INDELs, particularly insertions, is challenging because of difficulties in achieving accurate short-read alignment. NovoBreak ranked 2nd and 1st in IS3 and IS4, respectively (Supplementary Table 2). IS4 was particularly difficult for INDELs and SNVs due to three times more simulated events than in the previous challenges, including sub-clonal events at relatively low allelic frequencies (15%). Encouragingly, novoBreak achieved a balanced INDEL detection accuracy of 0.857 (sensitivity: 0.788 and precision: 0.926), close to the best SNV detection accuracy on the leaderboard. After comparison with the ground truth, we found that novoBreak discovered a higher fraction of INDELs in almost every size range than GATK-HaplotypeCaller[14] (balanced accuracy: 0.364, sensitivity: 0.499 and precision: 0.229) and Strelka[15] (balanced accuracy: 0.626, sensitivity: 0.601 and precision: 0.650) under the default parameters and filters (Fig. 2c). GATK-HaplotypeCaller had significantly lower sensitivity in detecting 1, 2, and 3 bp INDELs, due likely to limitations of aligning short-reads and stringent filtering. In contrast, Strelka demonstrated reduced sensitivity as INDEL size increased. It did not report any insertion longer than 25 bp. We compared novoBreak with BreakDancer[5] (v1.1.2), DELLY[8] (v0.6.3) and Fermi[16] (v1.1-r751-beta) using the WGS data from COLO-829, a melanoma tumor cell line[17]. Fermi is a string-graph-based whole genome assembler that retains contigs containing SNPs, INDELs and SVs. Because Fermi does not come with a ready-to-use tool to call SV breakpoints, we used the SV-calling steps of novoBreak to evaluate its assembly results. These data were previously analyzed by a read pair approach[17] and CREST[18]. In total, 48 SV breakpoints have been previously validated via polymerase chain reaction (PCR) and Sanger sequencing (Supplementary Table 3). We used these 48 breakpoints as ground truth to benchmark these tools. Under default parameters, BreakDancer identified 37 true positives (TPs), with a total of 14,340 predicted; DELLY, 34 TPs, with 1,113 predicted; and Fermi, 40 TPs, with 16,849 predicted. A large fraction of SVs reported by these tools were likely germline, instead of false SVs. In contrast, novoBreak identified 44 TPs with 78 breakpoints predicted (Fig. 2d and Supplementary Table 4). Of the 4 missing TPs, two were missed by all the tools and 2 could be recovered by novoBreak at less stringent settings. We designed PCR primers around the 34 novel breakpoints and validated 9 (Supplementary Table 5). The remaining ones were not necessarily false calls and could be attributed to deficiency in validation experiments or evolution of the cultured cell line. Indeed, 19 (57.6%) of the 34 calls were also predicted by at least one other tool. These results demonstrated novoBreak’s high sensitivity and specificity in analyzing real tumor data under default settings. Users can adjust the filtering parameters to obtain different sensitivity and specificity tradeoff in different applications. To further evaluate the sensitivity of novoBreak on cancer patient data, we analyzed the WGS data of a patient with low-grade glioma (Supplementary Note 4) and those of 22 invasive breast carcinoma samples in The Cancer Genome Atlas (TCGA). This set of TCGA samples was analyzed previously by INTEGRATE[19], which integrates matched whole genome and whole transcriptome sequencing (WTS) data to discover gene fusions. Overall, novoBreak identified 1,628 deletions, 1,724 duplications, 2,335 inversions and 1,982 translocations, equivalent to 349 SVs per sample (Supplementary Table 6 and 7). It identified 104 (86.7%) of the 120 known high-confidence gene fusions[19] (Supplementary Table 8). The true sensitivity was probably higher because 19% of the known SVs were likely false positives[19]. In addition, they were identified using both WGS and WTS data; whereas novoBreak only examined the WGS data. We present a new algorithm, novoBreak, for detecting structural variation breakpoints in subject genomes. The most significant improvement of novoBreak compared to other approaches is the k-mer identification, filtering and classification strategy, which substantially narrows down the number of putative SV breakpoints and focuses computational power on the most informative portion of the data. By clustering and performing local assembly around breakpoints, novoBreak takes full advantage of unmapped reads and/or partially mapped reads. The scoring and filtering strategy of novoBreak provides high precision in the final results. The k-mer targeted assembly framework exemplified by novoBreak will facilitate comprehensive, sensitive, efficient, and accurate identification of novel sequence alterations in genomic, exomic and transcriptomic sequencing data. A caveat of novoBreak is that it misses SV breakpoints in repetitive sequences longer than 2k-1 bp. Further versions of novoBreak with increased k will alleviate this limitation. The source code of novoBreak (Supplementary Software) is freely available for academic use at http://sourceforge.net/projects/novobreak/.

Online Methods

The novoBreak pipeline

novoBreak is developed to comprehensively discover exact chromosomal breakpoints introduced by structural variations in genomes or transcriptomes. It is based on 1) a genome-wide classification and filtering strategy, which identifies specific nucleotide signatures (novo k-mers) and 2) a local assembly approach, which constructs breakpoint sequences from reads containing the novo k-mers. The workflow of novoBreak consists of the following steps (Figure 1): (1) novoBreak begins with an indexing and filtering procedure to obtain “novo k-mers” and associated short reads, described in the section “indexing and filtering k-mers”. (2) Paired-end reads containing the same set of novo k-mers are clustered together. Each cluster contains read pairs covering the same breakpoint. An assembly algorithm is then applied to each cluster to construct a breakpoint spanning sequence. The clustering and local assembly step is described later in the section “Clustering and local assembly algorithm”. (3) After short reads are assembled in each cluster, the resulting contigs are aligned to the reference using BWA-MEM[20] (Supplementary Note 2) with ‘-M’ option to obtain secondary alignments. The alignment results are parsed to infer breakpoints and the associated SVs. For short SVs, such as INDELs, novoBreak directly parse the Compact Idiosyncratic Gapped Alignment Report (CIGAR) strings of the aligned contigs. For large SVs, novoBreak will consider both the primary and the secondary alignments of each contig. In current implementation, novoBreak predicts deletions, insertions, inversions, duplications and translocations at base pair resolution. (4) To achieve a high precision, novoBreak employs a scoring and filtering module, as described in the section “scoring method”.

Indexing and filtering k-mers

Given a sequence S of length L, a k-mer is a length k (k < L) substring of sequence S. We notice that if a read R contains a breakpoint of a structural change with respect to the reference or the normal genome of a cancer patient, there are at most k-1 k-mers (k < |R|) covering the breakpoint. The default k is 31 (Supplementary Note 1) in novoBreak. We define these k-mers as “novo k-mers” because they contain novel sequence information specific to the subject. In a tumor-normal paired cancer genome sequencing study, the novo k-mers contain the somatic breakpoints that specifically exist in the tumor but not in the paired normal sample. The first critical step of novoBreak is to obtain the novo k-mers. An effective approach is to implement a hash table that first indexes and loads all the k-mers in all the reads in the tumor sample into the memory, and then eliminate k-mers that are present in the reference or the normal genome. The remaining high frequency k-mers should contain genuine somatic breakpoints, including SNVs, small indels and large SVs. This approach is computationally feasible for whole exome or whole transcriptome analysis. But for high coverage whole genome analysis, the memory cost is extremely high (usually a few hundred gigabytes) due mainly to the presence of sequencing errors. A critical component of novoBreak is to reduce memory consumption. For whole genome sequencing data, instead of indexing the sequenced reads, novoBreak starts from hashing all the k-mers in the reference genome. Then, it adopts a two-pass approach to calculate novo k-mers in the sequenced genomes. The first pass is to scan every reads and mark the status (presence/absence) of each constituent k-mer in the reference genome using the pre-constructed hash table. In the process, novoBreak automatically trims off error prone ends in low quality reads (Supplementary Note 1). novoBreak uses a bit array data structure to mark a read. If a k-mer in a read is in the hash table, it will be marked as 1 (otherwise 0) in the corresponding bit in the bit array. When all the reads are processed, the hash table for the reference k-mers is released. Next, novoBreak goes through the reads containing at least one 0 bit to obtain the minimal occurrence of the non-reference k-mers. novoBreak adopts Bloom filter[21], a probabilistic data structure that tests whether a given element is in a set. A Bloom filter is a bit array of m bits, initialized to be 0. k different hash functions are applied to an element and map the element to k different positions in the array. To add an element, these k positions will be set to 1. To test whether an element is in the set, each of the k positions will be examined. If there is a 0 at any of the k positions, the element is definitely not in the set. If all the k positions are 1, then either the element is in the set or the positions were coincidently set to 1 by other elements. Such false positive (FP) errors could happen because different elements could be coincidently hashed to same positions in the bit array. Fortunately, the chance of having an error is very small, less than , where n is the total number of elements, m is the size of the bit array of the Bloom filter and k is the number of hash functions. Note these rare FP errors do not hurt sensitivity and have negligible possibility of introducing false positive breakpoints, due to the subsequent read clustering, assembly, alignment and variant calling steps. We expand the above standard Bloom filter from one bit to two or more (default to 3 bits in novoBreak) to count if a k-mer has occurred more than a minimal number of times (default 3 in novoBreak) (Supplementary Note 1) in the dataset. Thus, k-mers introduced by sequencing errors will be automatically disregarded and the remaining are novo k-mers from the variant alleles. For somatic analysis, novoBreak will further scan the normal control reads using a hash table and counts the occurrence of these k-mers in the normal reads. Based on these counts, candidate somatic k-mers (i.e., k-mers only present in the tumor but not the normal sample) can be identified, with the effect of cross-contamination between the samples being accounted for. Finally, novoBreak loads read pairs containing the candidate somatic k-mers and automatically removes duplicated read pairs that have identical sequences in both reads.

Clustering and local assembly

With novo k-mers and the associated read pairs identified, a straightforward method is to assemble all the read pairs directly. However, the cost of assembly is still very high due to a large number of reads. In addition, presence of alternative alleles, repeats and sequencing errors can easily cause misassemblies. Note that, as shown in (Supplementary Fig. 2), at each breakpoint, there are k-1 novo k-mers with many reads covering it. Reads covering the same breakpoint share a subset of the k-1 novo k-mers. Based on this pair-wise relationship between k-mers and reads, we can find the set of read pairs covering a breakpoint using a union-find algorithm [22], which identifies all the connected components in an undirected graph consisting of reads and k-mers (as nodes) and their connections (as edges). To avoid having large clusters with many reads due to repeats or sequencing errors, novoBreak trims the connected components based on read and k-mer statistics. For the purpose of detecting SVs, the computational cost is further reduced by directly reading from bam files and correcting base errors based on high quality aligned reads. After clustering, it is relatively easy to locally assemble the read pairs in each cluster, since the number of read pairs is small and they originate from the same locus of an allele. Almost every modern assembler can be applied for such a task. novoBreak pipeline uses SSAKE[23] (Supplementary Note 2) to assemble read pairs into contigs. The setting of SSAKE in novoBreak is “-p 1 -k 2 -n 1 -m 16 -x 3 -w 1 -z 30 -o 1”. SSAKE can generate multiple contigs from each cluster. Each contig is aligned by BWA-MEM and analyzed independently. After all the candidate breakpoints are generated, novoBreak merges them and creates a unique set of SVs.

Somatic SV Scoring methods

novoBreak scores and ranks each predicted breakpoint based on assembly and mapping results. At a given locus, novoBreak calculates a statistical quality score where D = {D} comprises of the counts of read pairs supporting the reference allele (R = r) and those supporting the variant alelle (R = v) from the tumor (I = t) and the normal (I = n) data, respectively; G = 0,1,2 indicates whether the locus has a reference (no SV in either tumor or normal), somatic (SV only in tumor) or germline (SV in both tumor and normal) status. We can compute the likelihood of the data, given the status of a locus. For example, likelihood of the somatic status G = 1 can be estimated as: Because the variant allele fraction in the tumor is unknown, novoBreak uses a beta-binomial distribution to estimate the likelihood of the observed read counts. For example, the number of read pairs supporting a breakpoint in the tumor sample is where α and β indicate the parameters used for the combinations among I ∈ {t, n} and G ∈ {0,1,2}. For the somatic status G = 1, novoBreak initializes α,1 = 1, β,1 = 10 and α,1 = 1, β,1 = 1 to reflect the concept that SV signal in the normal sample is largely due to noise. For the germline status G = 2, novoBreak uses a uniform distribution α,2 = β,2 = α,2 = β,2 = 1. For the reference status G = 0, α,0 = α,0 = 10 and β,0 = β,0 = 1 were used. D′ represents the total number of read pairs at the locus with non-zero chances of spanning the breakpoint. Minor empirical adjustment of these scores and parameters was applied to account for variations introduced by assembly quality, mapping quality, tissue purity, sampling bias, and SV size and type.

Indel analysis

Indels detection on the IS4 data was performed by novoBreak, GATK-HaplotypeCaller and Strelka as follows.

novoBreak

novoBreak (v1.03) was run under the parameters ‘-k31 -m2’. All the assembled contigs and unassembled short read pairs containing the novo k-mers were mapped to the reference using BWA-MEM[20]. The alignment results were sorted and the coordinates of indels were adjusted using SortSam of Picard (v1.107) (http://broadinstitute.github.io/picard/) and LeftAlignIndels of GATK[14, 24, 25] (v2.8-1), respectively. The CIGAR strings of the alignment results were parsed to generate an indel list (in VCFv4.1 format). Indels were further filtered using Database of Single Nucleotide Polymorphisms dbSNP (Build ID: 138, Available from: http://www.ncbi.nlm.nih.gov/SNP/) and low complexity regions identified with the mdust program (http://compbio.dfci.harvard.edu/tgi/). Finally, only indels with allele fraction greater than 1% were selected.

GATK-HaplotypeCaller

GATK v2.8-1 was run on the same data. First, tumor and normal bam files were realigned using IndelRealigner and left-aligned using LeftAlignIndels. HaplotypeCaller was run on tumor and normal bam files with the parameters ‘--genotyping_mode DISCOVERY-stand_emit_conf 10-stand_call_conf 30′, respectively. Then, SelectVariants of GATK was run with parameter ‘-selectType INDEL’ to generate an indel VCF file for the tumor and the normal. Indels from the tumor and the normal samples were further filtered using VariantFiltration with parameters ‘--filterExpression “QD < 2.0 || FS > 200.0 || ReadPosRankSum < −20.0”’. Finally, only indels in the tumor VCF file but not in the normal VCF file with “PASS” labels were evaluated as somatic indels.

Strelka

Strelka[15] (v1.0.14) was also tested on the dataset. Files and directories were generated and configured according to the documentation (https://sites.google.com/site/strelkasomaticvariantcaller/). Strelka was run with the default parameters. Since the number of somatic indels with the FILTER field “PASS” was too few, we also selected “QSI_ref” field for evaluation.

Experimental Validation

The COLO-829 and COLO-829BL cell lines were purchased from the American Type Culture Collection (ATCC). Primers for genomic PCR were designed using Primer3 (http://biotools.umassmed.edu/bioapps/primer3_www.cgi). The COLO-829 and COLO-829BL cells were cultured in RPMI 1640(Sigma) supplemented with 10% FBS and 1% penicillin and streptomycin. Genomic DNA was extracted from COLO-829 and COLO-829BL cells using genome DNA kit (Invitrogen) and the PCR was performed using GoTaq DNA Polymerase (Promega). Thermal cycling conditions were one cycles of 95 °C for 2 min, followed by 30 cycles of 95 °C for 30 s, 65°C for 30 s and 72 °C 1 min, followed by a final extension step of 72 °C for 10 min. PCR products were electrophoresed on 1% agarose gels with ethidium bromide, visualized using UV light illumination.

Data

ICGC-TCGA DREAM Challenge data[13] [SRA:SRP042948] was downloaded from https://www.synapse.org/#!Synapse:syn2280639 with public token (in silico 1, 2 and 3) or approval access with private token from ICGC (in silico 4). The whole genome sequencing data[17] [EGAD00000000055] of the immortal melanoma cancer cell line COLO-829 and lymphoblastoid cell line derived from the same patient COLO-829BL was requested from the European Genome-phenome Archive. The TCGA breast cancer WGS data were obtained through dbGAP [accession number phs000178.v7.p6]. The low grade glioma sample (SJLGG039) is available at European Genome-phenome Archive under accession EGAS00001000255.

System requirements and software availability

novoBreak is written in C and Perl. The source code is freely available at https://sourceforge.net/projects/novobreak/?source=updater. For a 40X 2×101bp whole genome tumor and normal pairs, novoBreak needs a main memory less than 40GB and a running time less than ~6 hours with 10 CPU cores.

22 in total

1. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors: Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal: Genome Res Date: 2010-07-19 Impact factor: 9.043

2. Assemblathon 1: a competitive assessment of de novo short read assembly methods.

Authors: Dent Earl; Keith Bradnam; John St John; Aaron Darling; Dawei Lin; Joseph Fass; Hung On Ken Yu; Vince Buffalo; Daniel R Zerbino; Mark Diekhans; Ngan Nguyen; Pramila Nuwantha Ariyaratne; Wing-Kin Sung; Zemin Ning; Matthias Haimel; Jared T Simpson; Nuno A Fonseca; İnanç Birol; T Roderick Docking; Isaac Y Ho; Daniel S Rokhsar; Rayan Chikhi; Dominique Lavenier; Guillaume Chapuis; Delphine Naquin; Nicolas Maillet; Michael C Schatz; David R Kelley; Adam M Phillippy; Sergey Koren; Shiaw-Pyng Yang; Wei Wu; Wen-Chi Chou; Anuj Srivastava; Timothy I Shaw; J Graham Ruby; Peter Skewes-Cox; Miguel Betegon; Michelle T Dimon; Victor Solovyev; Igor Seledtsov; Petr Kosarev; Denis Vorobyev; Ricardo Ramirez-Gonzalez; Richard Leggett; Dan MacLean; Fangfang Xia; Ruibang Luo; Zhenyu Li; Yinlong Xie; Binghang Liu; Sante Gnerre; Iain MacCallum; Dariusz Przybylski; Filipe J Ribeiro; Shuangye Yin; Ted Sharpe; Giles Hall; Paul J Kersey; Richard Durbin; Shaun D Jackman; Jarrod A Chapman; Xiaoqiu Huang; Joseph L DeRisi; Mario Caccamo; Yingrui Li; David B Jaffe; Richard E Green; David Haussler; Ian Korf; Benedict Paten
Journal: Genome Res Date: 2011-09-16 Impact factor: 9.043

3. CREST maps somatic structural variation in cancer genomes with base-pair resolution.

Authors: Jianmin Wang; Charles G Mullighan; John Easton; Stefan Roberts; Sue L Heatley; Jing Ma; Michael C Rusch; Ken Chen; Christopher C Harris; Li Ding; Linda Holmfeldt; Debbie Payne-Turner; Xian Fan; Lei Wei; David Zhao; John C Obenauer; Clayton Naeve; Elaine R Mardis; Richard K Wilson; James R Downing; Jinghui Zhang
Journal: Nat Methods Date: 2011-06-12 Impact factor: 28.547

4. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.

Authors: Geraldine A Van der Auwera; Mauricio O Carneiro; Christopher Hartl; Ryan Poplin; Guillermo Del Angel; Ami Levy-Moonshine; Tadeusz Jordan; Khalid Shakir; David Roazen; Joel Thibault; Eric Banks; Kiran V Garimella; David Altshuler; Stacey Gabriel; Mark A DePristo
Journal: Curr Protoc Bioinformatics Date: 2013

5. Structural variation in two human genomes mapped at single-nucleotide resolution by whole genome de novo assembly.

Authors: Yingrui Li; Hancheng Zheng; Ruibang Luo; Honglong Wu; Hongmei Zhu; Ruiqiang Li; Hongzhi Cao; Boxin Wu; Shujia Huang; Haojing Shao; Hanzhou Ma; Fan Zhang; Shuijian Feng; Wei Zhang; Hongli Du; Geng Tian; Jingxiang Li; Xiuqing Zhang; Songgang Li; Lars Bolund; Karsten Kristiansen; Adam J de Smith; Alexandra I F Blakemore; Lachlan J M Coin; Huanming Yang; Jian Wang; Jun Wang
Journal: Nat Biotechnol Date: 2011-07-24 Impact factor: 54.908

6. A comprehensive catalogue of somatic mutations from a human cancer genome.

Authors: Erin D Pleasance; R Keira Cheetham; Philip J Stephens; David J McBride; Sean J Humphray; Chris D Greenman; Ignacio Varela; Meng-Lay Lin; Gonzalo R Ordóñez; Graham R Bignell; Kai Ye; Julie Alipaz; Markus J Bauer; David Beare; Adam Butler; Richard J Carter; Lina Chen; Anthony J Cox; Sarah Edkins; Paula I Kokko-Gonzales; Niall A Gormley; Russell J Grocock; Christian D Haudenschild; Matthew M Hims; Terena James; Mingming Jia; Zoya Kingsbury; Catherine Leroy; John Marshall; Andrew Menzies; Laura J Mudie; Zemin Ning; Tom Royce; Ole B Schulz-Trieglaff; Anastassia Spiridou; Lucy A Stebbings; Lukasz Szajkowski; Jon Teague; David Williamson; Lynda Chin; Mark T Ross; Peter J Campbell; David R Bentley; P Andrew Futreal; Michael R Stratton
Journal: Nature Date: 2009-12-16 Impact factor: 49.962

Review 7. The impact of translocations and gene fusions on cancer causation.

Authors: Felix Mitelman; Bertil Johansson; Fredrik Mertens
Journal: Nat Rev Cancer Date: 2007-03-15 Impact factor: 60.716

8. Limitations of next-generation genome sequence assembly.

Authors: Can Alkan; Saba Sajjadian; Evan E Eichler
Journal: Nat Methods Date: 2010-11-21 Impact factor: 28.547

9. TIGRA: a targeted iterative graph routing assembler for breakpoint assembly.

Authors: Ken Chen; Lei Chen; Xian Fan; John Wallis; Li Ding; George Weinstock
Journal: Genome Res Date: 2013-12-04 Impact factor: 9.043

10. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection.

Authors: Adam D Ewing; Kathleen E Houlahan; Yin Hu; Kyle Ellrott; Cristian Caloian; Takafumi N Yamaguchi; J Christopher Bare; Christine P'ng; Daryl Waggott; Veronica Y Sabelnykova; Michael R Kellen; Thea C Norman; David Haussler; Stephen H Friend; Gustavo Stolovitzky; Adam A Margolin; Joshua M Stuart; Paul C Boutros
Journal: Nat Methods Date: 2015-05-18 Impact factor: 28.547

41 in total

1. Aggressive genomic features in clinically indolent primary HHV8-negative effusion-based lymphoma.

Authors: Matias Mendeville; Margaretha G M Roemer; Mari F C M van den Hout; G Tjitske Los-de Vries; Reno Bladergroen; Phylicia Stathi; Nathalie J Hijmering; Andreas Rosenwald; Bauke Ylstra; Daphne de Jong
Journal: Blood Date: 2018-12-03 Impact factor: 22.113

Review 2. Prospects of pan-genomics in barley.

Authors: Cécile Monat; Mona Schreiber; Nils Stein; Martin Mascher
Journal: Theor Appl Genet Date: 2018-11-16 Impact factor: 5.699

Review 3. Whole-Genome Sequencing in Cancer.

Authors: Eric Y Zhao; Martin Jones; Steven J M Jones
Journal: Cold Spring Harb Perspect Med Date: 2019-03-01 Impact factor: 6.915