Literature DB >> 22751202

forestSV: structural variant discovery through statistical learning.

Abstract

Detecting genomic structural variants from high-throughput sequencing data is a complex and unresolved challenge. We have developed a statistical learning approach, based on Random Forests, that integrates prior knowledge about the characteristics of structural variants and leads to improved discovery in high-throughput sequencing data. The implementation of this technique, forestSV, offers high sensitivity and specificity coupled with the flexibility of a data-driven approach.

Entities: Disease Species

Mesh：

Year: 2012 PMID： 22751202 PMCID： PMC3427657 DOI： 10.1038/nmeth.2085

Source DB: PubMed Journal: Nat Methods ISSN： 1548-7091 Impact factor: 28.547

Structural variants are a major form of genetic variation[1,2]. The investigation of structural variants has provided key insights into the genetic basis of common human disease[3], most notably in neuropsychiatric disorders, where it has been well established that rare and de novo mutations confer significant risk[4]. A variety of tools have been developed to detect structural variation using genomic sequence data. Some approaches rely on information gleaned from changes in read depth[5,6], while others also include signals from outlier read pairs[7,8]. Still other methods rely on split-read signals[9] (reads that span a breakpoint). Recently, the 1000 Genomes Project undertook a thorough evaluation[10] of these and other structural variant detection methods and found that there was wide variability in terms of sensitivity and specificity among the methods. While some of the tested methods were clear leaders (notably GenomeSTRiP[7] for its low false discovery rate), no single method produced results with a sensitivity comparable to a careful merging of calls from all the methods. The fact that the fusion of structural variant calls from multiple discovery methods yields better results than any single method suggests that existing tools may be individually too narrow in terms of the signals they consider or the methodology they employ. If a single method could effectively capture the collective insight provided by the variety of available tools, it might obviate the need to run all of them and carefully merge the results, saving valuable time and resources. In this work, we present a first step in this direction by drawing on machine learning ideas to facilitate better structural variant discovery. We used previous studies as a basis for extracting the multidimensional patterns that discriminate calls that validate experimentally from those that do not. Beyond the typical read depth and read pair signatures used by most structural variant discovery approaches, we constructed additional features from the sequencing data (described in Online Methods), thereby representing the genome in a higher-dimensional space (shown as the feature matrix X in ). We trained a Random Forest classifier[11] to partition this space in a way that optimizes the classification of deletions, duplications, and false positives whose identities are stored in the class label vector Y (, ). For training, the features in X were constructed from data available in the BAM files for the high-coverage trios (a total of six individuals) in the 1000 Genomes Project[12] (1KG). The structural variant class labels in Y were assigned based on an integration of the calls from refs. 10, 13, 14. The seven class labels “deletion”, “duplication”, “deletion-flanking”, “ duplication-flanking”,” false positive deletion”, “false positive duplication”, and “invariant” are defined in the supplementary information (. The rows of X (and consequently the elements of Y) correspond to overlapping 100 bp windows of the genome that we call “subjects” here (). When presented with new data (X') the trained classifier assigns a predicted class (Y') to each subject (). The final variant calls are then generated by a simple segmentation routine that merges consecutive subjects of the same predicted class into a single event. These events are ranked by a prediction confidence score produced by the classifier (the Random Forest vote proportion, ). The performance of forestSV was first assessed on the 1KG data, and compared with calls of other widely used methods[10]. We trained the classifier on five of the six 1KG genomes, and then made structural variant calls on the left-out genome, doing this for each of the six individuals in turn. We then combined the scored calls and constructed sensitivity-specificity curves () that reflect forestSV's ability to correctly prioritize the calls in a gold standard set. For comparison, we also marked the sensitivity and specificity of the “donor” call sets (method/group designations are carried over from ref. 10). For both deletion and duplication events, forestSV provided a combination of sensitivity and specificity that, in the gold standard set we examined, was superior to all of the donor call sets and matched the sensitivity and specificity of the merged and genotyped call set provided in ref. 10 (). This suggests that integrative methods, such as forestSV, outperform approaches that rely on a single methodology or sequence feature for structural variant discovery. Lastly, we examined whether the inclusion of related individuals in the training set of this leave-one-out scheme might lead to a disproportionate boost in the performance of forestSV, and we did not find evidence for such an effect (). The characteristics of the calls generated by forestSV are in line with the consensus genotyped calls in ref. 10 (). There is a relationship between event size and forestSV event score for duplications (), while for deletions scores are unbiased across a range of event sizes. We found that while forestSV was trained on high-coverage data, it also performs reasonably well on low-coverage (5×) data (). The accumulation of calls with a relaxing event score is characteristically different for deletions and duplications (). As an independent evaluation of our method, we applied forestSV to high coverage genome sequence data from a five-person family currently under investigation in our laboratory, consisting of a father, mother, monozygotic (MZ) twins concordant for autism, and an unaffected sibling. Performance of forestSV was evaluated based on patterns of Mendelian inheritance in the family and identity of genotypes between MZ twins (). “Mendelian inconsistencies” were defined as calls present in any of the children but not in either of the parents (with > 50% reciprocal overlap). “MZ discordances” were defined similarly as events detected in one twin and not found with > 50% reciprocal overlap in the other. We then plotted how rates of Mendelian inconsistency and MZ discordance varied with the event scores in the children (). By both measures, error rates are very low among events scoring > 0.65, suggesting that this set is enriched for true structural variants. These results suggest that an event score threshold in the range of 0.65-0.70 will keep error rates below 5% for deletions and ~10% for duplications. At a threshold of 0.65, forestSV discovered 2,300-2,593 unique deletions per individual, and 53-72 unique duplications per individual. Clearly, sensitivity for detecting duplications is lower than for deletions and hence a more relaxed threshold may be needed (see and ). Because this data set has not yet been as thoroughly characterized as the 1KG high-coverage trio samples, we are unable to reliably estimate sensitivity directly. forestSV is particularly well suited to the detection of rare variants because it is not reliant on finding variant support in multiple individuals. It can call structural variants effectively in a single genome. This contrasts with earlier methods developed by our group [5] and others[7] that require multiple genomes to be analyzed simultaneously and which favor variants with support in multiple individuals. This is a key advantage in light of the knowledge that rare structural variants, with population frequencies of ≤10-4, play an important role in common disease and particularly in diseases of the brain[15]. We implemented forestSV as an R package. It includes an executable that, with a single command, takes BAM files as input and produces deletion and duplication call sets. The package source code and documentation, together with a thorough technical tutorial describing its use, are available for download from our website at http://sebatlab.ucsd.edu/software. forestSV will continually improve its discovery abilities as additional training data becomes available () and we will periodically provide updates to the trained classifier on our website.

Online Methods

The methodology described in this work has been assembled as an R package called forestSV, which is available for download at http://sebatlab.ucsd.edu/software. This package is distributed with a tutorial that will enable the user to reproduce the type of analysis we have demonstrated.

Features

A coverage vector is a one-dimensional data structure whose elements map to positions in the genome, and whose entries indicate the number of times the corresponding position is covered by a read. Such a vector gives an indication of read depth. The features that the Random Forest classifier uses to predict the class identity of a genomic window are weighted coverage vectors. As the name implies, a weighted coverage vector has weights applied to the individual reads before the coverage is computed. The weights may be logical (binary) or continuous. In the work described here, weights are extracted from the BAM files, and convey information on such attributes as outlier read pairs, base content, mapping quality, CIGAR operations, and mapping flags. Descriptions of the weighting procedures for these features is given below, and their explicit implementation can be found in the source code of the featMat() function in the forestSV package. To construct features, the weighted coverage vectors are evaluated at three scales: local (100 bp window), left flank (1 kb window to the left of the local window), and right flank (1 kb window to the right of the local window). The mean value of each weighted coverage vector is taken at these three scales, and the position defined by the local window constitutes a row in the feature matrix. Throughout the text, we refer to these row entries in the feature matrix as “subjects”. The columns of the matrix are the features themselves (15 total), evaluated at each of the three scales (for a total of 45 columns). The next row in the matrix is offset by 50 bp, and the features are evaluated again, and so on. The 15 principal features are given in . The process of constructing a feature matrix from a BAM[16] file is implemented in the featMat() function provided in the forestSV package. Not all features are equally informative, and a feature's information content varies by event class, i.e. some features may be useful for discerning duplications but less useful for deletions, etc. We summarized the contribution of each feature by using the permutation importance measures available in Random Forests[11] (). These importance measures quantify the degradation in predictive accuracy of the classifier when the values of a feature are permuted randomly. The degree to which a classifier's performance is degraded by the random permutation of a feature reflects the importance of that feature. Unsurprisingly, the most important features are those already used by other structural variant discovery algorithms: read depth and paired end signatures. However, several of the additional features we introduced did in fact contribute to improved discovery. For instance, false duplication calls are more accurately identified when the mapq feature is included in the model, and identification of false deletion calls is improved by information contained in the strand, munm, and mrem features. It should be noted that Random Forest classifiers are generally robust against the inclusion of uninformative features, so the less informative features (such as CIGARd and CIGARi) do not adversely impact the accuracy of the classifier. We retain them in the classifier for the possibility that they may prove to be useful when including future data and structural variant types. It is common practice in structural variant discovery methods to impose a global correction for GC content[5,6], since such regions have been known to have biased read depth signals [17]. Here we do not impose a global correction, but rather include base content as a predictor in the training process. This has the advantage of avoiding the assumption that extreme GC content will lead universally to false positives, independent of all other variables. If GC content is found to have predictive power in discriminating true variants from false positives in the training set, it is incorporated into the classifier in a way that allows for (though does not require) its effect to be conditional on the other variables used. Such an allowance for the dependence of the predictive features cannot be achieved with a single, genome-wide GC content correction. To demonstrate that high-confidence calls generated by forestSV take GC content into account, we compared the distribution of GC content among high-confidence calls (event score > 0.65) with that of low-confidence calls (event score < 0.65). The results () show that GC content does indeed vary with the event score, with higher-scoring calls tending more toward intermediate GC content (mean: 0.43), and particular depletion in the tail tending toward high GC content. The two distributions differ significantly (P < 10-10 by the Kolmogorov-Smirnov test).

Training data

Training examples were defined using experimentally validated (or invalidated) structural variant calls from individuals NA12878, NA12891, NA12892, NA19240, NA19238, and NA19239 in and refs. 10, 13, 14. Subjects were given one of seven class labels: deletion, duplication, deletion-flanking, duplication-flanking, false positive deletion, false positive duplication, or invariant. Descriptions of these classes are given in . Only Illumina data acquired from the 1000 Genomes Project website were used for construction of the training set features. MAQ-aligned 1KG data were used in the performance comparison with the calls from ref. 10 and BWA-aligned 1KG data were used to train the classifier that was used for structural variant discovery in the BWA-aligned autism data. Size distributions and event counts for each of the classes in the gold standard set are shown in . The regions used as the basis for the training set are available in . forestSV offers an open-ended framework that could be re-trained to identify additional classes of structural variation that are not currently well ascertained with any available method. The detection of a new structural variant class could be facilitated by adding new features to the matrix X that are indicative of the class. Training examples of the new class could then be included in the training data and a new classifier constructed. Such extension of forestSV is discussed in greater detail in the tutorial that accompanies the forestSV package.

Donor call sets

The call sets comprising the work in ref. 10 are referred to throughout this work by the two-letter abbreviation originally given there. These are: Applied Biosystems/Life Technologies (AB), Boston College (BC), BGI-Shenzhen (BG), Broad Institute (BI), Leiden University Medical Center (LN), University of Oxford (OX), University of California, San Diego (SD), Wellcome Trust Sanger Institute (SI), University of Washington (UW), Washington University, St. Louis (WU), and Yale University (YL).

Random Forest classifier

The R reference implementation[18] of Random Forests[11] was used to build the classifier. Forests with 50 trees were grown, and the bootstrap resampling was stratified according to the classes, with a maximum within-class sample size of 500,000 subjects. All other arguments took on the default value.

Segmentation and scoring of structural variant calls

When presented with new data, the trained classifier outputs seven numeric values for every subject, one for each class, that together sum to 1. These are the class vote proportions, and are an indication of the identity of the subject, as predicted by the classifier. A higher value indicates higher confidence that the subject belongs to the corresponding class. Since these seven scores are distributed in space, that is, along the length of a chromosome, the way in which they co-vary along the chromosome can be used to identify regions of structural variation. For instance, in regions where vote proportion for class deletion is consistently higher than the vote proportions of all the other classes, this is indicative of a deletion event. The extent of the event is defined by the extent over which the vote proportion for deletion exceeds that of all the other classes (i.e. simple majority vote). More concretely, events (deletions and duplications) are defined as chromosomal regions where contiguous subjects have deletion or duplication as their maximally-scoring class. Events of the same class are merged if separated by 1 kb or less. The final score of the merged events is the mean score of the majority class over the subjects within the event boundaries. This process is implemented in the svScore() and svCall() functions provided by the forestSV package. The breakpoint resolution of forestSV calls is limited by the fact that subjects are offset by 50 bp, meaning that all event predictions have boundaries in increments of 50 bp. This is the result of design decisions that favored efficient discovery over high resolution of breakpoints, but can be addressed by supplementing forestSV calls with breakpoint assembly methods, such as TIGRA_SV (Chen, L. et al., unpublished, http://genome.wustl.edu/software/tigra_sv).

Performance assessment

For each of the six 1KG individuals (NA12878, NA12891, NA12892, NA19240, NA19238, and NA19239), a classifier was trained with the training data from that individual omitted. Structural variants were then called and scored, as described above, for the left-out individual. Calls for all six of the individuals are provided in . Sensitivity-specificity curves for deletions and duplications were then calculated on the gold standard set (), which consisted of the previously described gold standard deletions and duplications (positives), as well as all false positive deletion calls and false positive duplication calls (true negatives). Assessed calls were required to have > 50% reciprocal overlap with gold standard calls (both positives and negatives) to be used in the performance calculation. For example, if none of the calls from a method had > 50% reciprocal overlap with any of the gold standard negatives, this would lead to 100% specificity; likewise, if none of a method's calls had > 50% reciprocal overlap with the gold standard positives, this would lead to 0% sensitivity. The curves () represent the pooled performance of forestSV over all six individuals. Because the calls released in ref. 10 do not have scores associated with them, we were only able to calculate single points of sensitivity and specificity for each method, rather than entire curves. In , we showed that inclusion of related individuals in the training set does not lead to inflated estimates of predictive performance. For each of the six 1KG individuals, we trained a classifier on 1, 2, 3, 4, and 5 of the remaining genomes, then assessed performance (area under the ROC curve, AUC) using the gold standard deletions as described previously. In every case, related individuals (parents or offspring) were added to the training set last. We performed a within-individual Z-transformation to normalize the AUC values, and then regressed these values on the number of genomes used in the training set. Only data from training sets consisting entirely of unrelated individuals were used in the regression. Prediction intervals for this linear model are shown at levels of 90%, 95%, and 99%. The observed performance values for training sets that include related individuals all fell well within the 90% prediction interval, suggesting that the inclusion of related individuals while training does not lead to a disproportionate increase in performance compared to including unrelated individuals. This means that our cross-validated performance estimates are not unduly influenced by having related individuals in the training set. Annotation information used to calculate the metrics depicted in were derived from hg18 tracks downloaded from http://genome.ucsc.edu. Proportions of total called sequence occupied by genomic features were calculated by first summing the total unique genomic sequence (in bp) called by each method, then assessing the percentage of that sequence that was covered by the genome feature annotations (segmental duplications, Alu, SINE, LINE, LTR). For centromeres and telomeres, we extended their boundaries by 500 kb and calculated the percentage of calls in each set that intersected these regions. For the application to the autism data, we used a classifier trained on BWA-aligned Illumina sequence data from all six 1KG individuals. We used the same events (with their class labels) as in the previous 1KG validation, but had to lift over coordinates from hg18 to hg19, since the training BWA BAM files were aligned to this reference assembly. Using the forestSV framework, we produced genome-wide calls in the family 74-0352, which consists of a mother, father, two MZ twins, and a sibling, sequenced to ~40× depth (Illumina, paired-end, BWA aligned to hg18). Mendelian inconsistencies were considered to be calls in a child for which no call in the parents had greater than 50% reciprocal overlap. Similarly, discordant events between the twins were those that failed to achieve > 50% reciprocal overlap.

Computational considerations

We found that processing a single genome (~40× coverage) in a single thread (resulting in both deletion and duplication calls) took about 8 GB of RAM and 8 hours on an Intel Core i7 980X processor. Much of this time represents disk I/O while accessing BAM files. The run time can be significantly reduced if the BAM files are on a local (rather than a network-attached) disk.

16 in total

1. Detection of large-scale variation in the human genome.

Authors: A John Iafrate; Lars Feuk; Miguel N Rivera; Marc L Listewnik; Patricia K Donahoe; Ying Qi; Stephen W Scherer; Charles Lee
Journal: Nat Genet Date: 2004-08-01 Impact factor: 38.330

Review 2. CNVs: harbingers of a rare variant revolution in psychiatric genetics.

Authors: Dheeraj Malhotra; Jonathan Sebat
Journal: Cell Date: 2012-03-16 Impact factor: 41.582

Review 3. Structural variation in the human genome and its role in disease.

Authors: Paweł Stankiewicz; James R Lupski
Journal: Annu Rev Med Date: 2010 Impact factor: 13.739

4. Integrated detection and population-genetic analysis of SNPs and copy number variation.

Authors: Steven A McCarroll; Finny G Kuruvilla; Joshua M Korn; Simon Cawley; James Nemesh; Alec Wysoker; Michael H Shapero; Paul I W de Bakker; Julian B Maller; Andrew Kirby; Amanda L Elliott; Melissa Parkin; Earl Hubbell; Teresa Webster; Rui Mei; James Veitch; Patrick J Collins; Robert Handsaker; Steve Lincoln; Marcia Nizzari; John Blume; Keith W Jones; Rich Rava; Mark J Daly; Stacey B Gabriel; David Altshuler
Journal: Nat Genet Date: 2008-09-07 Impact factor: 38.330

5. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing.

Authors: Alexej Abyzov; Alexander E Urban; Michael Snyder; Mark Gerstein
Journal: Genome Res Date: 2011-02-07 Impact factor: 9.043

6. Large-scale copy number polymorphism in the human genome.

Authors: Jonathan Sebat; B Lakshmi; Jennifer Troge; Joan Alexander; Janet Young; Pär Lundin; Susanne Månér; Hillary Massa; Megan Walker; Maoyen Chi; Nicholas Navin; Robert Lucito; John Healy; James Hicks; Kenny Ye; Andrew Reiner; T Conrad Gilliam; Barbara Trask; Nick Patterson; Anders Zetterberg; Michael Wigler
Journal: Science Date: 2004-07-23 Impact factor: 47.728

7. Origins and functional impact of copy number variation in the human genome.

Authors: Donald F Conrad; Dalila Pinto; Richard Redon; Lars Feuk; Omer Gokcumen; Yujun Zhang; Jan Aerts; T Daniel Andrews; Chris Barnes; Peter Campbell; Tomas Fitzgerald; Min Hu; Chun Hwa Ihm; Kati Kristiansson; Daniel G Macarthur; Jeffrey R Macdonald; Ifejinelo Onyiah; Andy Wing Chun Pang; Sam Robson; Kathy Stirrups; Armand Valsesia; Klaudia Walter; John Wei; Chris Tyler-Smith; Nigel P Carter; Charles Lee; Stephen W Scherer; Matthew E Hurles
Journal: Nature Date: 2009-10-07 Impact factor: 49.962

8. Mapping copy number variation by population-scale genome sequencing.

Authors: Ryan E Mills; Klaudia Walter; Chip Stewart; Robert E Handsaker; Ken Chen; Can Alkan; Alexej Abyzov; Seungtai Chris Yoon; Kai Ye; R Keira Cheetham; Asif Chinwalla; Donald F Conrad; Yutao Fu; Fabian Grubert; Iman Hajirasouliha; Fereydoun Hormozdiari; Lilia M Iakoucheva; Zamin Iqbal; Shuli Kang; Jeffrey M Kidd; Miriam K Konkel; Joshua Korn; Ekta Khurana; Deniz Kural; Hugo Y K Lam; Jing Leng; Ruiqiang Li; Yingrui Li; Chang-Yun Lin; Ruibang Luo; Xinmeng Jasmine Mu; James Nemesh; Heather E Peckham; Tobias Rausch; Aylwyn Scally; Xinghua Shi; Michael P Stromberg; Adrian M Stütz; Alexander Eckehart Urban; Jerilyn A Walker; Jiantao Wu; Yujun Zhang; Zhengdong D Zhang; Mark A Batzer; Li Ding; Gabor T Marth; Gil McVean; Jonathan Sebat; Michael Snyder; Jun Wang; Kenny Ye; Evan E Eichler; Mark B Gerstein; Matthew E Hurles; Charles Lee; Steven A McCarroll; Jan O Korbel
Journal: Nature Date: 2011-02-03 Impact factor: 49.962

9. Discovery and genotyping of genome structural polymorphism by sequencing on a population scale.

Authors: Robert E Handsaker; Joshua M Korn; James Nemesh; Steven A McCarroll
Journal: Nat Genet Date: 2011-02-13 Impact factor: 38.330

10. Accurate whole human genome sequencing using reversible terminator chemistry.

Authors: David R Bentley; Shankar Balasubramanian; Harold P Swerdlow; Geoffrey P Smith; John Milton; Clive G Brown; Kevin P Hall; Dirk J Evers; Colin L Barnes; Helen R Bignell; Jonathan M Boutell; Jason Bryant; Richard J Carter; R Keira Cheetham; Anthony J Cox; Darren J Ellis; Michael R Flatbush; Niall A Gormley; Sean J Humphray; Leslie J Irving; Mirian S Karbelashvili; Scott M Kirk; Heng Li; Xiaohai Liu; Klaus S Maisinger; Lisa J Murray; Bojan Obradovic; Tobias Ost; Michael L Parkinson; Mark R Pratt; Isabelle M J Rasolonjatovo; Mark T Reed; Roberto Rigatti; Chiara Rodighiero; Mark T Ross; Andrea Sabot; Subramanian V Sankar; Aylwyn Scally; Gary P Schroth; Mark E Smith; Vincent P Smith; Anastassia Spiridou; Peta E Torrance; Svilen S Tzonev; Eric H Vermaas; Klaudia Walter; Xiaolin Wu; Lu Zhang; Mohammed D Alam; Carole Anastasi; Ify C Aniebo; David M D Bailey; Iain R Bancarz; Saibal Banerjee; Selena G Barbour; Primo A Baybayan; Vincent A Benoit; Kevin F Benson; Claire Bevis; Phillip J Black; Asha Boodhun; Joe S Brennan; John A Bridgham; Rob C Brown; Andrew A Brown; Dale H Buermann; Abass A Bundu; James C Burrows; Nigel P Carter; Nestor Castillo; Maria Chiara E Catenazzi; Simon Chang; R Neil Cooley; Natasha R Crake; Olubunmi O Dada; Konstantinos D Diakoumakos; Belen Dominguez-Fernandez; David J Earnshaw; Ugonna C Egbujor; David W Elmore; Sergey S Etchin; Mark R Ewan; Milan Fedurco; Louise J Fraser; Karin V Fuentes Fajardo; W Scott Furey; David George; Kimberley J Gietzen; Colin P Goddard; George S Golda; Philip A Granieri; David E Green; David L Gustafson; Nancy F Hansen; Kevin Harnish; Christian D Haudenschild; Narinder I Heyer; Matthew M Hims; Johnny T Ho; Adrian M Horgan; Katya Hoschler; Steve Hurwitz; Denis V Ivanov; Maria Q Johnson; Terena James; T A Huw Jones; Gyoung-Dong Kang; Tzvetana H Kerelska; Alan D Kersey; Irina Khrebtukova; Alex P Kindwall; Zoya Kingsbury; Paula I Kokko-Gonzales; Anil Kumar; Marc A Laurent; Cynthia T Lawley; Sarah E Lee; Xavier Lee; Arnold K Liao; Jennifer A Loch; Mitch Lok; Shujun Luo; Radhika M Mammen; John W Martin; Patrick G McCauley; Paul McNitt; Parul Mehta; Keith W Moon; Joe W Mullens; Taksina Newington; Zemin Ning; Bee Ling Ng; Sonia M Novo; Michael J O'Neill; Mark A Osborne; Andrew Osnowski; Omead Ostadan; Lambros L Paraschos; Lea Pickering; Andrew C Pike; Alger C Pike; D Chris Pinkard; Daniel P Pliskin; Joe Podhasky; Victor J Quijano; Come Raczy; Vicki H Rae; Stephen R Rawlings; Ana Chiva Rodriguez; Phyllida M Roe; John Rogers; Maria C Rogert Bacigalupo; Nikolai Romanov; Anthony Romieu; Rithy K Roth; Natalie J Rourke; Silke T Ruediger; Eli Rusman; Raquel M Sanches-Kuiper; Martin R Schenker; Josefina M Seoane; Richard J Shaw; Mitch K Shiver; Steven W Short; Ning L Sizto; Johannes P Sluis; Melanie A Smith; Jean Ernest Sohna Sohna; Eric J Spence; Kim Stevens; Neil Sutton; Lukasz Szajkowski; Carolyn L Tregidgo; Gerardo Turcatti; Stephanie Vandevondele; Yuli Verhovsky; Selene M Virk; Suzanne Wakelin; Gregory C Walcott; Jingwen Wang; Graham J Worsley; Juying Yan; Ling Yau; Mike Zuerlein; Jane Rogers; James C Mullikin; Matthew E Hurles; Nick J McCooke; John S West; Frank L Oaks; Peter L Lundberg; David Klenerman; Richard Durbin; Anthony J Smith
Journal: Nature Date: 2008-11-06 Impact factor: 49.962

17 in total

1. Next-generation sequencing of duplication CNVs reveals that most are tandem and some create fusion genes at breakpoints.

Authors: Scott Newman; Karen E Hermetz; Brooke Weckselblatt; M Katharine Rudd
Journal: Am J Hum Genet Date: 2015-01-29 Impact factor: 11.025

2. Whole-genome sequencing in autism identifies hot spots for de novo germline mutation.

Authors: Jacob J Michaelson; Yujian Shi; Madhusudan Gujral; Hancheng Zheng; Dheeraj Malhotra; Xin Jin; Minghan Jian; Guangming Liu; Douglas Greer; Abhishek Bhandari; Wenting Wu; Roser Corominas; Aine Peoples; Amnon Koren; Athurva Gore; Shuli Kang; Guan Ning Lin; Jasper Estabillo; Therese Gadomski; Balvindar Singh; Kun Zhang; Natacha Akshoomoff; Christina Corsello; Steven McCarroll; Lilia M Iakoucheva; Yingrui Li; Jun Wang; Jonathan Sebat
Journal: Cell Date: 2012-12-21 Impact factor: 41.582

3. A gradient-boosting approach for filtering de novo mutations in parent-offspring trios.

Authors: Yongzhuang Liu; Bingshan Li; Renjie Tan; Xiaolin Zhu; Yadong Wang
Journal: Bioinformatics Date: 2014-03-10 Impact factor: 6.937

4. SV2: accurate structural variation genotyping and de novo mutation detection from whole genomes.

Authors: Danny Antaki; William M Brandler; Jonathan Sebat
Journal: Bioinformatics Date: 2018-05-15 Impact factor: 6.937

5. Frequency and Complexity of De Novo Structural Mutation in Autism.

Authors: William M Brandler; Danny Antaki; Madhusudan Gujral; Amina Noor; Gabriel Rosanio; Timothy R Chapman; Daniel J Barrera; Guan Ning Lin; Dheeraj Malhotra; Amanda C Watts; Lawrence C Wong; Jasper A Estabillo; Therese E Gadomski; Oanh Hong; Karin V Fuentes Fajardo; Abhishek Bhandari; Renius Owen; Michael Baughn; Jeffrey Yuan; Terry Solomon; Alexandra G Moyzis; Michelle S Maile; Stephan J Sanders; Gail E Reiner; Keith K Vaux; Charles M Strom; Kang Zhang; Alysson R Muotri; Natacha Akshoomoff; Suzanne M Leal; Karen Pierce; Eric Courchesne; Lilia M Iakoucheva; Christina Corsello; Jonathan Sebat
Journal: Am J Hum Genet Date: 2016-03-24 Impact factor: 11.025

Review 6. Structural variation in the sequencing era.

Authors: Steve S Ho; Alexander E Urban; Ryan E Mills
Journal: Nat Rev Genet Date: 2019-11-15 Impact factor: 53.242

Review 7. Genetic Approaches to Understanding Psychiatric Disease.

Authors: Jacob J Michaelson
Journal: Neurotherapeutics Date: 2017-07 Impact factor: 7.620

8. Structural variant identification and characterization.

Authors: Parithi Balachandran; Christine R Beck
Journal: Chromosome Res Date: 2020-01-06 Impact factor: 5.239

Review 9. Structural variation discovery in the cancer genome using next generation sequencing: computational solutions and perspectives.

Authors: Biao Liu; Jeffrey M Conroy; Carl D Morrison; Adekunle O Odunsi; Maochun Qin; Lei Wei; Donald L Trump; Candace S Johnson; Song Liu; Jianmin Wang
Journal: Oncotarget Date: 2015-03-20

10. Unbalanced translocations arise from diverse mutational mechanisms including chromothripsis.

Authors: Brooke Weckselblatt; Karen E Hermetz; M Katharine Rudd
Journal: Genome Res Date: 2015-06-12 Impact factor: 9.043