| Literature DB >> 27257471 |
Alessandro Vasciaveo1, Ivana Velevska2, Gianfranco Politano3, Alessandro Savino3, Manfred Schmidt2, Raffaele Fronza2.
Abstract
With next-generation sequencing, the genomic data available for the characterization of integration sites (IS) has dramatically increased. At present, in a single experiment, several thousand viral integration genome targets can be investigated to define genomic hot spots. In a previous article, we renovated a formal CIS analysis based on a rigid fixed window demarcation into a more stretchy definition grounded on graphs. Here, we present a selection of supporting data related to the graph-based framework (GBF) from our previous article, in which a collection of common integration sites (CIS) was identified on six published datasets. In this work, we will focus on two datasets, ISRTCGD and ISHIV, which have been previously discussed. Moreover, we show in more detail the workflow design that originates the datasets.Entities:
Year: 2015 PMID: 27257471 PMCID: PMC4874583 DOI: 10.1016/j.csbj.2015.11.004
Source DB: PubMed Journal: Comput Struct Biotechnol J ISSN: 2001-0370 Impact factor: 7.271
Fig. 1Venn diagram of the gene atmosphere of all identified CIS from the RTCGD dataset using the GBF (graph-based framework) [1] and using the SWM (standard window method) [2].
Fig. 2Workflow of the full analysis process: starting from the raw dataset to the functional analysis.
Mandatory attributes of the input dataset for the identification of CIS using the GBF method.
| Attributes | Description |
|---|---|
| Chromosome number | The ordinal number of the chromosome in which the integration event was found |
| Insertion site position | The position on the genome: a very long integer number representing the base pair where the virus was integrated |
| Entropy label (e.g. Kind of tumor, virus type) | Meta-information used for the computation of the CIS entropy. It is a label that represents a factor of the experiment. For example, it could be the tumor model or type from which the IS has been associated |
Mandatory attributes of the input dataset for enhancing analysis using annotated genomic data against the GBF method.
| Attributes | Description |
|---|---|
| Chromosome number | The ordinal number of the chromosome in which the TSS of the gene is located |
| Transcription start site | The position on the genome: a very long integer number representing the base pair where transcription starts at the 5′-end of a gene sequence |
Fig. 3Flowchart of the main method for the identification and enhancing of CIS using the graph-based framework.
Computed statistics for CIS.
| Statistic | Description |
|---|---|
| CIS order | The total number of IS present in the CIS |
| CIS dimension | The number of base pairs that contain all the IS belonging to a single CIS (see |
| CIS | The |
| CIS entropy | The entropy of the CIS based on the label from the input dataset (e.g. tumor type, virus type). See paragraph 3.6 in |
| Subject area | Computational biology, systems biology |
| More specific subject area | Gene therapy, integrational mutagenesis analysis |
| Type of data | Table, image, dataset |
| How data was acquired | In silico experiments |
| Data format | Analyzed datasets, analyzed Excel tables, PNG files |
| Experimental factors | Integration sites datasets were analyzed with a new computational method for common integration sites identification |
| Experimental features | A proposed set of common integration sites from two published integration sites datasets (see |
| A pathway enrichment analysis is also reported | |
| Data source location | Heidelberg, Germany |
| Data accessibility | Data is with this article and in ref. |