| Literature DB >> 24564843 |
Dario Strbenac, Nicola J Armstrong, Jean Y H Yang.
Abstract
BACKGROUND: The large-scale sequencing of 5' cap enriched cDNA promises to reveal the diversity of transcription initiation across entire genomes. The process of transcription is noisy, and there is often no single, exact start site. This creates the need for a fast and simple method of identifying transcription start peaks based on this type of data. Due to both biological and technical noise, many of the peaks seen are not real transcription initiation events. Classification of the observed peaks is an essential filtering step in the discovery of genuine initiation locations.Entities:
Mesh:
Substances:
Year: 2013 PMID: 24564843 PMCID: PMC3852351 DOI: 10.1186/1471-2164-14-S5-S9
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Key steps in the bioinformatic workflow for analysing CAGE sequencing data. The reads from the sequencer are aligned to the genome. Only the first position of each read is used, and the positions are clustered into peaks. Lastly, a classification algorithm needs to be used to label the peaks as being from transcription initiation or not.
Number of peaks found by Poisson thresholding of sliding window method.
| Name | Summarisation | Location | Type |
|---|---|---|---|
| Kurtosis | Directly used | Peak | Internal |
| Read Density | Directly used | Peak | Internal |
| 4-mers Counts | Count | 500 bases upstream and downstream of peak summit | Internal |
| TFBS | Maximum | Peak and 100 base extension of boundaries | External |
| DNAse I | Maximum | Peak and 100 base extension of boundaries | External |
| H3K4me3 | Maximum | Peak and 100 base extension of boundaries | External |
| Mammalian | Average | Peak | External |
| RNA-seq Difference | Distribution function probability | 100 bases flanks adjacent to peak boundaries | External |
For each feature, the summarisation procedure, location of data points summarised, and the feature categorisation are shown.
Summary of features and how they are calculated.
| Cell Line | Total Peaks Detected |
|---|---|
| GM12878 | 43161 |
| H1-ESC | 111945 |
| HeLa-S3 | 41195 |
| HepG2 | 59390 |
| HUVEC | 40420 |
| K562 | 35622 |
For the six ENCODE cell lines used, the total number of peaks found by the sliding window approach is shown.
Figure 2Peaks found by the algorithm in six cell lines. CAGE-seq strand-specific read coverage is shown along with tracks that contain boxes representing the areas that are found to be peaks. A narrow peak of between 1 and 3 bases wide is at the left side of the figure. A wider peak that varies between 76 bases and 207 bases is located at the right side.
Figure 3Density plots of single feature scores for all peaks in all cell lines. Red line is for TSS class. Blue line is for non-TSS class. Kurtosis and read density are internal features. The other four features are external features.
Figure 4Precision and recall for three feature scenarios. Precision and recall are calculated at each cost parameter value based on a LOOCV scheme. Blue lines are precision. Red lines are recall. Horizontal bars or dots represent the minimum and maximum value of all cell lines. Points on the line are averages across all cell lines. A. Internal features for six cell lines. B. Internal features and pooled external features for six cell lines. C. Internal features and matched RNA-seq data for two cell lines.
Precision and recall of publically available classifications.
| Segway | ENCODE HMM | Proposed Method | ||||
|---|---|---|---|---|---|---|
| GM12878 | 0.7 | 0.64 | 0.25 | 0.92 | 0.77 | 0.81 |
| H1-ESC | 0.59 | 0.71 | 0.27 | 0.89 | 0.61 | 0.81 |
| HeLa-S3 | 0.79 | 0.66 | 0.32 | 0.91 | 0.76 | 0.87 |
| HepG2 | 0.59 | 0.59 | 0.23 | 0.93 | 0.69 | 0.79 |
| HUVEC | 0.82 | 0.67 | 0.26 | 0.94 | 0.81 | 0.85 |
| K562 | 0.77 | 0.62 | 0.27 | 0.93 | 0.71 | 0.88 |
The reference labelling is the collection of segments that overlap a GENCODE transcript with at least 2 supporting CAGE reads within 500 bases either side of the annotated start location. An example of the proposed method's performance is presented alongside the two public methods, for a SVM cost parameter value of 0.1 and using only internal features.