Literature DB >> 34534268

Viola: a structural variant signature extractor with user-defined classifications.

Itsuki Sugita1,2, Shohei Matsuyama2, Hiroki Dobashi2, Daisuke Komura1, Shumpei Ishikawa1.   

Abstract

Here, we present Viola, a Python package that provides structural variant (SV; large scale genome DNA variations that can result in disease, e.g., cancer) signature analytical functions and utilities for custom SV classification, merging multi-SV-caller output files, and SV annotation. We demonstrate that Viola can extract biologically meaningful SV signatures from publicly available SV data for cancer and we evaluate the computational time necessary for annotation of the data. AVAILABILITY: Viola is available on pip (https://pypi.org/project/Viola-SV/) and the source code is on GitHub (https://github.com/dermasugita/Viola-SV). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2021. Published by Oxford University Press.

Entities:  

Year:  2021        PMID: 34534268      PMCID: PMC8723148          DOI: 10.1093/bioinformatics/btab662

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


1 Introduction

Somatic mutations in cancer are the cumulative result of DNA aberrations caused by diverse mutational processes. Recently, large scale studies of human cancer have revealed characteristic patterns of mutation types, i.e. mutational signatures, arising from specific processes of single nucleotide variant formation. These studies often provide theoretical explanations for known mutational processes and their consequences, e.g. C>A substitutions and CC>TT alterations caused by smoking and ultraviolet light exposure, respectively. Structural variants (SVs) are another type of DNA mutation, defined as events larger than 50 bp in size or involving multiple chromosomes, occupying non-negligible proportions of mutations in cancer cells (Mills ; Yi and Ju, 2018 ). Signature analysis of SVs may potentially provide novel insights into carcinogenesis. The development of high-throughput sequencing technologies and powerful SV callers has improved the accuracy of SV event identification. Several mechanisms of SV formation have also been identified (Yi and Ju, 2018). Therefore, research on SV signatures is gradually becoming realistic. To date, several attempts have been made to decompose SV patterns into SV signatures, but an established method has yet to be realized. Previous studies have mainly classified SVs according to segment size and revealed an association between small tandem duplications and BRCA1 mutations (Li ; Nik-Zainal ). However, a consensus has not been achieved on a precise SV classification method. SVs can be classified by metrics other than length. Li also used replication timing and common fragile sites (CFSs). Interestingly, the biological meaningfulness of replication timing and CFSs has been reported, e.g. the signatures of medium-sized (50–500 kb) tandem duplications occurring at the site of late replication timing have been associated with CDK12 driver mutations, whereas CFS signatures have been associated with gastrointestinal cancer. Other SV classification methods, such as microhomology and association of transposons, have yet to be considered in detail; therefore, further analysis is required to identify a suitable SV classification method for signature analysis. At present, very few tools are available for SV signature analysis. To the best of our knowledge, pyCancerSig (Thutkawkorapin et al., 2020), which is the first tool that can handle SVs for cancer mutation signature analysis, is the only SV signature analysis tool currently available. However, pyCancerSig has limitations in SV classifications as it only supports traditional SV classes, i.e. deletion, duplication, inversion and translocation, and length-based classification. The time-consuming nature of parsing variant call format (VCF) files is also an obstacle to SV analysis. VCF is the de facto standard format by which genetic variant data are recorded with high human readability. However, from a data management perspective, VCF can be a bottleneck for analysis owing to its complex structure. For SVs in particular, accurate interpretation of VCF records at the single nucleotide level requires considerable learning costs. Difficulties with VCF interpretation cannot be ignored because even 1 bp error in positioning SVs can have critical consequences, e.g. in microhomology analysis. Merging SV calls from different callers is also an issue in SV analysis. Precision of SV detection can be improved by merging the results of multiple SV callers (Cameron ; Kuzniar ); however, different SV callers use different ways to represent VCF files, which makes integration challenging. Here, we present Viola, a highly customizable and flexible Python package that supports SV signature analysis with user-defined SV classification, matrix-generation functions, and a file exportation system that is compatible with external statistical utilities and facilitates interpretation of results. Viola accepts VCF files from four popular SV callers, namely Manta, Delly, Lumpy and Gridss, and can also read BEDPE format (Cameron et al., 2017; Chen ; Layer ; Rausch ). Viola also provides an intuitive VCF file manager for filtering, annotating, converting VCF to BEDPE and multicaller merging.

2 Implementation

2.1 Data structure

Viola converts input SV data files, such as VCF and BEDPE files, into our original Python classes. Instances of these classes store SV data as a set of tidy rectangular tables linked via identifiers such as SV ID output by the SV callers (Supplementary Fig. S1). These tables follow the principles of tidy data, i.e. each SV record is a row, each variable is a column and each type of observational unit is a table (Wickham, 2014). Consequently, storage of multiple values in one element is avoided, in contrast to the INFO and FORMAT columns of a VCF file. Hence, a specific single value can be accessed by simply specifying the row and column of the table of interest; this provides freedom in data handling without the need for cumbersome codes.

2.2 User interface

Viola is written in the Python programming language. Although it is intended for use within Python scripts, some features are available from the command line. Viola supports SV signature analysis with user-defined SV classes (Fig. 1A and Supplementary Fig. S1A and B). A simple feature matrix based on traditional SV types and SV length, output by the SV caller can be generated from the command line. Advanced uses such as annotation, filtering and multicaller intersection, which are required to generate a complex feature matrix, are supported within Python scripts. In combination with these functions, it is possible to define a wide variety of SV classes, such as ‘duplications located on CFS sites’ and ‘deletions <50 kb in size, located on the early replication timing zones’. These operations can be implemented with simple syntax and are designed to refine the SV classification by trial and error (Supplementary Fig. S2B).
Fig. 1.

Visualization of the data flow in the main analysis scenarios. (A) Process of feature matrix generation from multiple samples. (B) Overview of VCF merging system

Visualization of the data flow in the main analysis scenarios. (A) Process of feature matrix generation from multiple samples. (B) Overview of VCF merging system From an internal data structure perspective, user-defined SV classes are interpreted as new INFO entries of the VCF file. Hence, users can output new VCF or BEDPE files with annotation of novel SV classes as well as generating a signature-analysis-ready feature matrix according to these additional SV classes. Alongside signature analysis, Viola has the following features: Support of well-known SV callers including Manta, Delly, Lumpy and Gridss. The notation has been unified as much as possible to facilitate subsequent processing including merging (Fig. 1B). Fast annotation methods that utilize the interval tree algorithm. Source files in BED format are acceptable; thus, information such as gene names, CFSs, replication timing and copy number can be annotated if they can be expressed in BED format. An intuitive method for filtering SV records. In addition to filtering for genomic coordinates and INFO fields, filtering for FORMAT fields is possible. Estimations of the length and sequence of microhomology from SV breakpoint positions. Where SV callers do not return microhomology information or publicly available SV data does not contain such information, Viola can estimate microhomology using the reference sequence. The use of these characteristics is described in detail in the official Viola documentation, which is available online (https://dermasugita.github.io/ViolaDocs/docs/html/index.html).

2.3 Custom SV classification overview

With Viola, any information in the INFO field of the VCF can be used for SV classification. Many SV callers write the SV type and length in the INFO field by default making it easy to classify by these variables. For BEDPE files that do not define a field corresponding to the INFO field in a VCF file, Viola will automatically generate INFO fields such as SV length and type. Additionally, new INFO fields can be added using BED file annotation and microhomology prediction. BED files can be used to annotate genes, CFSs, replication timing, copy numbers, etc., which individually or in combination can be used to classify SVs. For usability, two SV classifications are available as default settings of the Viola function. One is a simple length-based classification, and the other is the same classification as the analysis in Section 2.5 (Supplementary Table S1A and B).

2.4 VCF merging strategy

Viola provides multicaller merging systems. SVs from the different callers will be merged into a single SV when the following conditions are satisfied: (i) the genomic coordinates of multiple SV break-ends are close to each other based on the user-specified criteria (proximity-based or confidence interval-based criteria as described below). (ii) The strands of the SV break-ends are concordant. (iii) The SVs overlap each other at least 1 bp. The latter two conditions are included to avoid merging discordant SV types and small non-overlapping SVs. Currently, two criteria for genomic coordinates evaluation have been implemented: proximity-based and confidence interval-based criteria. The former uses the representative genomic coordinates, e.g. POS field and END entry of INFO field. Multiple SV records within a user-defined threshold will be merged. The latter employs confidence intervals reported by SV callers on the CIPOS and CIEND entries of the INFO field. The multiple SV records will be considered a single event when their confidence intervals share at least 1 bp of genomic coordinates.

2.5 Application

2.5.1 Matrix generation with simple code

We ran Viola to generate an SV feature matrix using public BEDPE files reported in a PCAWG study (Li ). First, we downloaded 2748 BEDPE files from the ICGC data portal (https://dcc.icgc.org/releases/PCAWG/consensus_sv) and used Viola to read 2605 of these files that were not empty as a MultiBedpe instance. Second, the instance was successfully annotated by CFSs and replication timing BED files that we built according to the PCAWG study. We defined 25 SV classes according to CFSs, replication timing and SV length and then generated a 2605 × 25 feature matrix. These operations were written in only 11 lines of the Python code, excluding code for custom SV definitions (Supplementary Fig. S2A). The matrix generated here can be easily reproduced by following the tutorial in the Viola official document.

2.5.2 Signature extraction analysis

We extracted nine SV signatures from the generated matrix using a function of Viola that simultaneously performs non-negative matrix factorization and cluster stability evaluation (Supplementary Figs S3 and S4). Several signatures, including the signatures of CFSs, small deletions (<50 kb) and small duplications (<50 kb), were comparable to those in the PCAWG study (Li ). We further explored the association between each of the nine signatures and driver mutations of three well-known DNA repair genes: BRCA1, BRCA2 and CDK12 (Supplementary Table S2). These genes were significantly associated with the small duplication signature, small deletion signature and medium–large duplication signature, as expected from previous studies (Li ; Menghi ; Nik-Zainal ; Popova ) (Supplementary Table S1).

2.5.3 Multicaller VCF merging

We synthesized VCF files mimicking outputs from Manta, Delly, Lumpy and Gridss. These files shared several SVs recorded with errors within 100 bp. In addition, they were designed as the confidence intervals of shared SV break-ends overlapped each other. Four VCF files were merged by Viola with two methods, proximity-based and confidence interval-based criteria, which the user can select. First, we tested proximity-based merging with 100 bp specified as the option for proximity. SV events located within 100 bp were given the same ID. We removed SV records called by only a single SV caller. All shared SVs were merged as expected and successfully exported as a VCF file (Supplementary Data S1). Second, we examined confidence interval-based merging. When SV events that their confidence intervals shared the genomic coordinates at least 1 bp, they were merged and given the same ID. SV records supported by a single SV caller were filtered out. The obtained VCF file was the same as expected.

2.5.4 Annotation performance

We tested the performance of the annotations on 2605 BEDPE files using 18 lines of CFS BED files. In total, 618 492 break-ends were annotated according to whether each was present or absent on the CFS. On average, this took 7.5 min to complete using a single thread on an Ubuntu x86_64 server (Intel Core i7-8700K CPU at 3.70 GHz).

3 Future works

Although Viola provides useful functions for SV data manipulation, especially SV signature analysis, further enhancements would make the software more meaningful, covering a wider range of research questions. First, functions for handling more complex SV events, such as chromothripsis and chromoplexy that result from multiple DNA damage occurring in a single event, could be desirable (Li ; Stephens ). Such features may lead to the discovery of new SV signatures and the elucidation of the mechanistic basis of SV formation. Second, a more detailed annotation system would facilitate a more specific characterization of each SV event, because the current version of Viola does not support annotation with nucleotide sequence-level analysis, e.g. the frameshift status of affected genes or the impact of putative fusion genes, with the exception of microhomology inference. Such a detailed annotation system would facilitate a more specific characterization of each SV event. Finally, more types of SV callers need to be supported including those for long-read sequencing technology.

4 Conclusion

We developed Viola, a tool for SV signature analysis that allows highly customizable SV classification. This tool also overcomes the difficulty of parsing current VCF files as well as the problem of different notations derived from different callers. Viola will help stimulate cancer genome research to better understand the biological significance of SVs. Click here for additional data file.
  15 in total

1.  Ovarian Cancers Harboring Inactivating Mutations in CDK12 Display a Distinct Genomic Instability Pattern Characterized by Large Tandem Duplications.

Authors:  Tatiana Popova; Elodie Manié; Valentina Boeva; Aude Battistella; Oumou Goundiam; Nicholas K Smith; Christopher R Mueller; Virginie Raynal; Odette Mariani; Xavier Sastre-Garau; Marc-Henri Stern
Journal:  Cancer Res       Date:  2016-01-19       Impact factor: 12.701

2.  Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications.

Authors:  Xiaoyu Chen; Ole Schulz-Trieglaff; Richard Shaw; Bret Barnes; Felix Schlesinger; Morten Källberg; Anthony J Cox; Semyon Kruglyak; Christopher T Saunders
Journal:  Bioinformatics       Date:  2015-12-08       Impact factor: 6.937

3.  Massive genomic rearrangement acquired in a single catastrophic event during cancer development.

Authors:  Philip J Stephens; Chris D Greenman; Beiyuan Fu; Fengtang Yang; Graham R Bignell; Laura J Mudie; Erin D Pleasance; King Wai Lau; David Beare; Lucy A Stebbings; Stuart McLaren; Meng-Lay Lin; David J McBride; Ignacio Varela; Serena Nik-Zainal; Catherine Leroy; Mingming Jia; Andrew Menzies; Adam P Butler; Jon W Teague; Michael A Quail; John Burton; Harold Swerdlow; Nigel P Carter; Laura A Morsberger; Christine Iacobuzio-Donahue; George A Follows; Anthony R Green; Adrienne M Flanagan; Michael R Stratton; P Andrew Futreal; Peter J Campbell
Journal:  Cell       Date:  2011-01-07       Impact factor: 41.582

4.  Patterns of somatic structural variation in human cancer genomes.

Authors:  Yilong Li; Nicola D Roberts; Jeremiah A Wala; Ofer Shapira; Steven E Schumacher; Kiran Kumar; Ekta Khurana; Sebastian Waszak; Jan O Korbel; James E Haber; Marcin Imielinski; Joachim Weischenfeldt; Rameen Beroukhim; Peter J Campbell
Journal:  Nature       Date:  2020-02-05       Impact factor: 49.962

5.  sv-callers: a highly portable parallel workflow for structural variant detection in whole-genome sequence data.

Authors:  Arnold Kuzniar; Jason Maassen; Stefan Verhoeven; Luca Santuari; Carl Shneider; Wigard P Kloosterman; Jeroen de Ridder
Journal:  PeerJ       Date:  2020-01-06       Impact factor: 2.984

6.  DELLY: structural variant discovery by integrated paired-end and split-read analysis.

Authors:  Tobias Rausch; Thomas Zichner; Andreas Schlattl; Adrian M Stütz; Vladimir Benes; Jan O Korbel
Journal:  Bioinformatics       Date:  2012-09-15       Impact factor: 6.937

7.  LUMPY: a probabilistic framework for structural variant discovery.

Authors:  Ryan M Layer; Colby Chiang; Aaron R Quinlan; Ira M Hall
Journal:  Genome Biol       Date:  2014-06-26       Impact factor: 13.583

8.  Landscape of somatic mutations in 560 breast cancer whole-genome sequences.

Authors:  Serena Nik-Zainal; Helen Davies; Johan Staaf; Manasa Ramakrishna; Dominik Glodzik; Xueqing Zou; Inigo Martincorena; Ludmil B Alexandrov; Sancha Martin; David C Wedge; Peter Van Loo; Young Seok Ju; Marcel Smid; Arie B Brinkman; Sandro Morganella; Miriam R Aure; Ole Christian Lingjærde; Anita Langerød; Markus Ringnér; Sung-Min Ahn; Sandrine Boyault; Jane E Brock; Annegien Broeks; Adam Butler; Christine Desmedt; Luc Dirix; Serge Dronov; Aquila Fatima; John A Foekens; Moritz Gerstung; Gerrit K J Hooijer; Se Jin Jang; David R Jones; Hyung-Yong Kim; Tari A King; Savitri Krishnamurthy; Hee Jin Lee; Jeong-Yeon Lee; Yilong Li; Stuart McLaren; Andrew Menzies; Ville Mustonen; Sarah O'Meara; Iris Pauporté; Xavier Pivot; Colin A Purdie; Keiran Raine; Kamna Ramakrishnan; F Germán Rodríguez-González; Gilles Romieu; Anieta M Sieuwerts; Peter T Simpson; Rebecca Shepherd; Lucy Stebbings; Olafur A Stefansson; Jon Teague; Stefania Tommasi; Isabelle Treilleux; Gert G Van den Eynden; Peter Vermeulen; Anne Vincent-Salomon; Lucy Yates; Carlos Caldas; Laura van't Veer; Andrew Tutt; Stian Knappskog; Benita Kiat Tee Tan; Jos Jonkers; Åke Borg; Naoto T Ueno; Christos Sotiriou; Alain Viari; P Andrew Futreal; Peter J Campbell; Paul N Span; Steven Van Laere; Sunil R Lakhani; Jorunn E Eyfjord; Alastair M Thompson; Ewan Birney; Hendrik G Stunnenberg; Marc J van de Vijver; John W M Martens; Anne-Lise Børresen-Dale; Andrea L Richardson; Gu Kong; Gilles Thomas; Michael R Stratton
Journal:  Nature       Date:  2016-05-02       Impact factor: 49.962

Review 9.  Patterns and mechanisms of structural variations in human cancer.

Authors:  Kijong Yi; Young Seok Ju
Journal:  Exp Mol Med       Date:  2018-08-07       Impact factor: 8.718

10.  pyCancerSig: subclassifying human cancer with comprehensive single nucleotide, structural and microsatellite mutational signature deconstruction from whole genome sequencing.

Authors:  Jessada Thutkawkorapin; Jesper Eisfeldt; Emma Tham; Daniel Nilsson
Journal:  BMC Bioinformatics       Date:  2020-04-03       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.