Literature DB >> 27257471

Common integration sites of published datasets identified using a graph-based framework.

Alessandro Vasciaveo¹, Ivana Velevska², Gianfranco Politano³, Alessandro Savino³, Manfred Schmidt², Raffaele Fronza².

Abstract

With next-generation sequencing, the genomic data available for the characterization of integration sites (IS) has dramatically increased. At present, in a single experiment, several thousand viral integration genome targets can be investigated to define genomic hot spots. In a previous article, we renovated a formal CIS analysis based on a rigid fixed window demarcation into a more stretchy definition grounded on graphs. Here, we present a selection of supporting data related to the graph-based framework (GBF) from our previous article, in which a collection of common integration sites (CIS) was identified on six published datasets. In this work, we will focus on two datasets, ISRTCGD and ISHIV, which have been previously discussed. Moreover, we show in more detail the workflow design that originates the datasets.

Entities: Chemical Disease

Year: 2015 PMID： 27257471 PMCID： PMC4874583 DOI： 10.1016/j.csbj.2015.11.004

Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN： 2001-0370 Impact factor: 7.271

Specifications Table Value of the data The analyzed dataset here provided can be used as benchmark to compare the results of the graph modeling approach for CIS identification and analysis implemented in software tools. Graph modeling approach to the identification of common integration sites. Validation of the graph-based framework (GBF) against well-known datasets. Detailed illustrated procedure for the identification of CIS via GBF.

Data

The dataset containing the identified CIS from the Retroviral Tagged Cancer Gene Database (RTCGD) [6] is provided in Table 1 Appendix A and it is obtained by using a Cytoscape 2.8 plugin, which implements some of the features of the GBF method (see how to retrieve the code in [1]). The other datasets are collected using a normal Internet browser. Fig. 1 shows a Venn diagram in which two datasets are compared. The first dataset is the collection of all the genes found with the GBF method, while the second dataset is the list of genes provided by RTCGD which uses the standard window method (SWM) to identify CIS and the next gene approach (NGA) to discover and associate an annotated gene to the identified CIS. For further details about the two approaches, see [1]. With the GBF method, it is possible to discover 1421 genes which are not present in the RTCGD dataset. Only 142 genes were not discovered by the GBF method while they are present in the RTCGD gene list, and 404 of the genes can be found by both methods.

Fig. 1

Venn diagram of the gene atmosphere of all identified CIS from the RTCGD dataset using the GBF (graph-based framework) [1] and using the SWM (standard window method) [2].

Experimental design, materials and methods

Experiment workflow

The workflow of the analysis is depicted in Fig. 2. The input is a dataset composed of a list of integration sites (IS). The graph-based framework (GBF) presented in [1] is adopted to perform all the following analyses. The first step is the CIS identification and the computation of some statistics for every CIS. Further steps are optional but they have to follow the order. The second step consists of enhancing the CIS dataset with information from genomic annotated data. This step generates the gene atmosphere (GA) dataset as shown in Table 2 Appendix A. Using the GA dataset, the next step consists of the functional analysis, as shown in Table 3 Appendix A.

Fig. 2

Workflow of the full analysis process: starting from the raw dataset to the functional analysis.

Data preparation

The dataset used for the analysis should contain few attributes in order to be properly analyzed by the GBF method. Some of these attributes are mandatory and they are shown in Table 1. The mandatory attributes for the CIS enhancing phase are shown in Table 2.

Table 1

Mandatory attributes of the input dataset for the identification of CIS using the GBF method.

Attributes	Description
Chromosome number	The ordinal number of the chromosome in which the integration event was found
Insertion site position	The position on the genome: a very long integer number representing the base pair where the virus was integrated
Entropy label (e.g. Kind of tumor, virus type)	Meta-information used for the computation of the CIS entropy. It is a label that represents a factor of the experiment. For example, it could be the tumor model or type from which the IS has been associated

Table 2

Mandatory attributes of the input dataset for enhancing analysis using annotated genomic data against the GBF method.

Attributes	Description
Chromosome number	The ordinal number of the chromosome in which the TSS of the gene is located
Transcription start site	The position on the genome: a very long integer number representing the base pair where transcription starts at the 5′-end of a gene sequence

Common integration sites identification

The method presented in [1] allows the identification of CIS on the basis of very few attributes found in the dataset under analysis (see Table 1). Fig. 3 shows the flowchart of the global method that builds the model and identifies the CIS with their statistics.

Fig. 3

Flowchart of the main method for the identification and enhancing of CIS using the graph-based framework.

Starting from the dataset containing the integration sites (IS dataset), it is convenient to order the dataset according to the integration position to improve the algorithm efficiency. This is the data preparation part (Table 1). Afterwards, as depicted in Fig. 3, the building of the model starts creating an empty graph. For every IS present in the dataset, a node is created and added to the graph. A nested loop checks if all the vertices instantiated in the graph are at a distance below a certain threshold from the current IS previously added as a node to the graph itself. An edge connecting two nodes of the same type (i.e. two IS nodes) is created and added to the graph if the distance is lower than the threshold. When all the IS from the dataset are analyzed, the main loop terminates and the graph is ready to be analyzed by the main algorithm for CIS identification. This algorithm can be implemented in different ways (e.g. an algorithm that extracts the connected components (CC) from an undirected and disconnected graph). An efficient version of this algorithm is presented in [3].

Common integration sites statistics computation

When the CIS identification is performed, a set of statistics are computed. The most interesting statistics are presented in Table 3. For further details about how the statistics have been computed, see Paragraph 2.6 in [1].

Table 3

Computed statistics for CIS.

Statistic	Description
CIS order	The total number of IS present in the CIS
CIS dimension	The number of base pairs that contain all the IS belonging to a single CIS (see Section 2.7 for details)
CIS p-value	The p-value associated to the CIS. See Paragraph 3.6 in [1] for a comprehensive explanation
CIS entropy	The entropy of the CIS based on the label from the input dataset (e.g. tumor type, virus type). See paragraph 3.6 in [1] and Section 2.7

Common integration sites enhancing

Optionally, an enhancing of the CIS dataset can follow. The purpose is to link each IS with its neighborhood on the genome retrieving annotations present in online databases. Here, we used a normal Internet browser to perform queries accessing annotated data provided online by the BioMart database [4]. The dataset resulting from this step is shown in Table 2 Appendix A, which provides a list of transcriptional elements (TE) composing the GA of all CIS identified with the previous step. As shown in the flowchart in Fig. 3, the process that builds the GA is similar to the process that build the IS graph. The IS nodes in the graph are linked with the TE nodes if the distance on the genome is below a certain threshold.

Functional annotation using a GA list

If the previous step is performed, a functional annotation using DAVID [5] may follow. This is the last step of the main workflow shown in Fig. 2. Here, we perform this step using the RTCGD dataset and the output is shown in Table 3.

CIS properties computed in the Cytoscape prototype

CIS number Integer value given to a CIS by the plugin. CIS name Name of the CIS as it appears in the tabular exported file. It is a composition of the chromosome and the CIS number. CIS order Number of IS that compose the CIS. CIS average position Approximate CIS position p calculated as ; ISfirst and ISlast are the positions on the chromosome of the first and last IS in the CIS. CIS median position Approximate CIS position p calculated sorting the n IS as they appear on the chromosome: if n is odd or if n is even. IS( is the position of the ith IS of the CIS. For CIS with an asymmetric distribution of the IS, this approximation gives a more precise estimation. CIS entropy If the number of different labels (entropy label) found in the CIS is n and the order is O, the entropy value is computed aswhere n is the number of IS labelled with i. Normalized entropy If the number of different labels (entropy label) found in the entire dataset is N and the order of the CIS is O, the entropy value is computed aswhere n is the number of IS labelled with the label i. CIS p value See the subsection “Statistical model, p-value and log-likelihood ratio test” in [1] CIS loglike ratio See the subsection “Statistical model, p-value and log-likelihood ratio test” in [1]

Subject area	Computational biology, systems biology
More specific subject area	Gene therapy, integrational mutagenesis analysis
Type of data	Table, image, dataset
How data was acquired	In silico experiments
Data format	Analyzed datasets, analyzed Excel tables, PNG files
Experimental factors	Integration sites datasets were analyzed with a new computational method for common integration sites identification
Experimental features	A proposed set of common integration sites from two published integration sites datasets (see [1])
A pathway enrichment analysis is also reported
Data source location	Heidelberg, Germany
Data accessibility	Data is with this article and in ref. [1]

5 in total

1. DAVID: Database for Annotation, Visualization, and Integrated Discovery.

Authors: Glynn Dennis; Brad T Sherman; Douglas A Hosack; Jun Yang; Wei Gao; H Clifford Lane; Richard A Lempicki
Journal: Genome Biol Date: 2003-04-03 Impact factor: 13.583

2. RTCGD: retroviral tagged cancer gene database.

Authors: Keiko Akagi; Takeshi Suzuki; Robert M Stephens; Nancy A Jenkins; Neal G Copeland
Journal: Nucleic Acids Res Date: 2004-01-01 Impact factor: 16.971

3. Analyzing the number of common integration sites of viral vectors--new methods and computer programs.

Authors: Ulrich Abel; Annette Deichmann; Ali Nowrouzi; Richard Gabriel; Cynthia C Bartholomae; Hanno Glimm; Christof von Kalle; Manfred Schmidt
Journal: PLoS One Date: 2011-10-14 Impact factor: 3.240

4. A Graph Based Framework to Model Virus Integration Sites.

Authors: Raffaele Fronza; Alessandro Vasciaveo; Alfredo Benso; Manfred Schmidt
Journal: Comput Struct Biotechnol J Date: 2015-11-30 Impact factor: 7.271

5. BioMart Central Portal--unified access to biological data.

Authors: Syed Haider; Benoit Ballester; Damian Smedley; Junjun Zhang; Peter Rice; Arek Kasprzyk
Journal: Nucleic Acids Res Date: 2009-05-06 Impact factor: 16.971

5 in total

2 in total

1. Efficient Nuclease-Directed Integration of Lentivirus Vectors into the Human Ribosomal DNA Locus.

Authors: Diana Schenkwein; Saira Afzal; Alisa Nousiainen; Manfred Schmidt; Seppo Ylä-Herttuala
Journal: Mol Ther Date: 2020-05-23 Impact factor: 11.454

2. A Graph Based Framework to Model Virus Integration Sites.

Authors: Raffaele Fronza; Alessandro Vasciaveo; Alfredo Benso; Manfred Schmidt
Journal: Comput Struct Biotechnol J Date: 2015-11-30 Impact factor: 7.271

2 in total