| Literature DB >> 28850105 |
Imad Abugessaisa1, Shuhei Noguchi1, Akira Hasegawa1,2, Jayson Harshbarger1,2, Atsushi Kondo1,2, Marina Lizio1,2, Jessica Severin1,2, Piero Carninci1,2, Hideya Kawaji1,2,3,4, Takeya Kasukawa1.
Abstract
The FANTOM5 consortium described the promoter-level expression atlas of human and mouse by using CAGE (Cap Analysis of Gene Expression) with single molecule sequencing. In the original publications, GRCh37/hg19 and NCBI37/mm9 assemblies were used as the reference genomes of human and mouse respectively; later, the Genome Reference Consortium released newer genome assemblies GRCh38/hg38 and GRCm38/mm10. To increase the utility of the atlas in forthcoming researches, we reprocessed the data to make them available on the recent genome assemblies. The data include observed frequencies of transcription starting sites (TSSs) based on the realignment of CAGE reads, and TSS peaks that are converted from those based on the previous reference. Annotations of the peak names were also updated based on the latest public databases. The reprocessed results enable us to examine frequencies of transcription initiations on the recent genome assemblies and to refer promoters with updated information across the genome assemblies consistently.Entities:
Mesh:
Year: 2017 PMID: 28850105 PMCID: PMC5574367 DOI: 10.1038/sdata.2017.107
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Figure 1Work flow of FANTOM5 data re-processing.
The figure describes the reprocessing of the FANTOM5 data. The workflow encompasses three processes; CAGE reads realignment (1), CAGE peaks liftOver (2) and CAGE peaks call (3). The source datasets are in (GRCH37/hg19) and (NCBI37/mm9). The target assembly is (GRCH38/hg38) and (GRCm38/mm10). CAGE reads realignment result in mapped CAGE peaks, CAGE peaks liftOver result in two sets of CAGE peaks (mapped and unmapped). And the CAGE peaks call result in new CAGE peaks in the latest genomes. Process (1) and (2) are followed by quality checking (QC). The QC filtered the mapped CAGE peaks into fair and problematic CAGE peaks. The set of problematic and dropped CAGE peak regions are investigated and manually curated. The new CAGE peaks from (3) are intersected with the fair CAGE peaks using bedtools (intersectbed) to define non-overlapped CAGE peaks (new CAGE peaks). The fair and new CAGE peaks are annotated with the latest gene and transcript models and their expression tables are calculated.
CAGE peaks counts.
| Categories of the CAGE peaks converted to the current genome assemblies. | ||||
|---|---|---|---|---|
| Fair | 201,295 | 99.75% | 158,878 | 99.94% |
| Problematic | 339 | 0.17% | 76 | 0.05% |
| Dropped | 168 | 0.08% | 12 | 0.01% |
| Total | 201,802 | — | 158,966 | — |
Manual correction of the genomic coordinates.
| This table list the CAGE peaks and the issues detected during genomic coordinates conversion. It shows the workaround solution to each issue and additional notes. | |||
|---|---|---|---|
| hg19::chr1:145176389..145176406,+;hg_14198.1 | overlapping (1st group)/changing length | kept in 'problematic' | dut to the unintentional conversion of the CAGE peak |
| hg19::chr1:146369648..146369656,−;hg_14199.1 | overlapping (1st group) | resecued to 'fair' | |
| hg19::chr1:146544055..146544062,−;hg_14200.1 | overlapping (1st group) | resecued to 'fair' | |
| hg19::chr1:146556295..146556310,−;hg_14201.1 | overlapping (1st group) | resecued to 'fair' | |
| hg19::chr1:120905986..120906002,+;hg_14114.1 | overlapping (2nd group) | resecued to 'fair' | due to the merge of two genes in hg38, chose the longer peak |
| hg19::chr1:149399224..149399230,−;hg_14115.1 | overlapping (2nd group) | kept in 'problematic' | |
| hg19::chr1:120838328..120838358,−;hg_4940.1 | overlapping (3rd group) | kept in 'problematic' | |
| hg19::chr1:143913790..143913841,+;hg_4941.1 | overlapping (3rd group) | resecued to 'fair' | due to the merge of two genes in hg38, chose the longer peak |
| hg19::chrX:52112158..52112165,+;hg_196395.1 | overlapping (4th group) | kept in 'problematic' | |
| hg19::chrX:52386980..52387013,−;hg_196396.1 | overlapping (4th group) | resecued to 'fair' | due to the merge of two regions of single genes in hg38, chose the longer peak |
| hg19::chr3:124646690..124646794,−;hg_44259.1 | changing length | kept as is | |
| hg19::chr7:101930149..101930221,+;hg_80245.1 | changing length | kept as is | |
| hg19::chr8:143857402..143857446,−;hg_95429.1 | changing length | kept as is | |
| hg19::chr10:61122262..61122358,−;hg_109451.1 | changing length | kept as is | |
| hg19::chr14:22689737..22689741,−;hg_142211.1 | changing length | changed CAGE peak regions | inserted 3 'T' nucleotides at the start of CAGE peaks, which would be removed |
| hg19::chr17:26684480..26684571,−;hg_164987.1 | changing length | kept as is | |
| hg19::chrX:114690490..114690493,−;hg_200246.1 | changing length | changed CAGE peak regions | inserted 17 'T' nucleotides at the start of CAGE peaks, which would be removed |
| hg19::chrX:148713374..148713438,−;hg_200825.1 | changing length | kept as is |
The naming scheme of the CAGE peaks before and after reprocessing.
| The table shows the naming rules of the CAGE peaks used in the published FANTOM5 human and mouse dataset and the newly assigned CAGE peaks ID after liftOver. | ||
|---|---|---|
| Genomic coordinate | From 564639 bp to 564649 bp | From 629259 bp to 629269 bp |
| CAGE peak ID | chr1:564639..564649,+ | hg19::chr1:564639..564649,+;hg_2.1 |
| Accession | — | hg_2.1 |
| Short Description | p3@MTND1P23 | p3@MTND1P23 |
| Full description | CAGE_peak_3_at_MTND1P23_5end | — |
*These positions are based on the coordinates in BED format
CAGE read counts.
| The CAGE read counts successfully aligned with the genome assemblies and within the CAGE peaks. | |||||
|---|---|---|---|---|---|
| Human | GRCh37/hg19 | 7,002,308,021 | — | 5,288,118,024 | — |
| GRCh38/hg38 | 6,846,664,897 | 97.8% | 5,158,308,820 | 97.5% | |
| Mouse | NCBI37/mm9 | 4,694,137,744 | — | 3,491,906,982 | — |
| GRCm38/mm10 | 4,687,916,697 | 99.9% | 3,509,420,580 | 100.5% |
The problematic peaks in hg38.
| Table list total number of problematic CAGE peak regions per each chromosome in hg38. | |
|---|---|
| chr1 | 118 |
| chr2 | 14 |
| chr3 | 2 |
| chr4 | 6 |
| chr5 | 2 |
| chr6 | 27 |
| chr7 | 13 |
| chr8 | 1 |
| chr9 | 7 |
| chr10 | 8 |
| chr11 | 8 |
| chr12 | 4 |
| chr13 | 1 |
| chr14 | 4 |
| chr15 | 5 |
| chr16 | 8 |
| chr17 | 7 |
| chr18 | 5 |
| chr19 | 1 |
| chr21 | 90 |
| chrX | 5 |
| chrY | 3 |
| Total | 339 |
The problematic peaks in mm10.
| Table list total number of problematic CAGE peak regions per each chromosome in mm10. | |
|---|---|
| chr3 | 1 |
| chr4 | 1 |
| chr5 | 3 |
| chr9 | 3 |
| chr13 | 1 |
| chr15 | 1 |
| chr16 | 2 |
| Total | 12 |
The dropped peaks from hg19.
| Table list total number of dropped CAGE peak regions during genomic coordinates conversion per each chromosome in hg19. | |
|---|---|
| chr1 | 14 |
| chr2 | 2 |
| chr3 | 1 |
| chr6 | 2 |
| chr7 | 105 |
| chr8 | 3 |
| chr11 | 1 |
| chr14 | 8 |
| chr17 | 2 |
| chr19 | 9 |
| chr22 | 8 |
| chrM | 8 |
| chrX | 5 |
| Total | 168 |
The dropped peaks from mm9.
| Table list total number of dropped CAGE peak regions during genomic coordinates conversion per each chromosome in mm10. | |
|---|---|
| chr1 | 7 |
| chr2 | 1 |
| chr3 | 1 |
| chr4 | 5 |
| chr5 | 3 |
| chr7 | 2 |
| chr8 | 10 |
| chr10 | 2 |
| chr12 | 2 |
| chr13 | 1 |
| chr14 | 25 |
| chr16 | 1 |
| chrX | 6 |
| chrY | 10 |
| Total | 76 |
Number of new CAGE peaks identified by peaks calling and their overlap with the converted CAGE peaks.
| The table shows the total number and the ratio of the overlapped and non-overlapped CAGE peaks between the converted ‘fair’ CAGE peaks and the new CAGE peaks identified by the decomposition-based peak identification (DPI). | ||
|---|---|---|
| All peaks | 201,295 | 195,444 |
| Overlapped peaks with the other dataset | 189,679 (94.2%) | 186,828 (95.6%) |
| Non-oeverlapped peaks with the other dataset | 11,616 (5.8%) | 8,616 (4.4%) |
| All CAGE peaks | 158,878 | 155,006 |
| Overlapped with the other dataset | 152,189 (95.8%) | 149,212 (96.3%) |
| Non-oeverlapped with the other dataset | 6,689 (4.2%) | 5,794 (3.7%) |
Figure 2Correlation between the CAGE tags count of the aligned CAGE reads and the liftOver CAGE peaks.
The scatterplot shows the correlation between the number of tag count within the regions of aligned CAGE reads and the liftOver CAGE peaks. [2a] human and [2b] mouse.
Counts of CAGE peaks associated with transcripts, genes and proteins.
| The table shows the number of (robust) CAGE peaks associated with known transcripts, genes in Entrez Gene, HGNC and MGI, and proteins in UniProt. The numbers in GRCh38/hg38 and GRCm37/mm10 rows were counted by the reprocessing project. The numbers of GRCh37/hg19 and NCBI37/mm9 were retrieved from the original paper. | ||||||
|---|---|---|---|---|---|---|
| Human | GRCh37/hg19 | 93,558 | 56,011 | 82,257 | — | 82,150 |
| GRCh38/hg38 | 108,791 | 57,935 | 96,998 | — | 97,560 | |
| Mouse | NCBI37/mm9 | 61,072 | 47,755 | — | — | 56,744 |
| GRCm38/mm10 | 89,471 | 47,657 | — | 84,308 | 79,319 |
Number of peaks associated with Entrez Gene categories.
| The table shows the number of (robust) CAGE peaks associated with Entrez Gene categories. The numbers in GRCh38/hg38 and GRCm37/mm10 rows were counted by the reprocessing project. The numbers of GRCh37/hg19 and NCBI37/mm9 were retrieved from the original paper. | ||||||
|---|---|---|---|---|---|---|
| human | GRCh37/hg19 | 79,735 | 489 | 1,755 | 126 | 163 |
| GRCh38/hg38 | 90,351 | 737 | 5,731 | 124 | 300 | |
| mouse | NCBI37/mm9 | 55,217 | 435 | 1,356 | 22 | 16 |
| GRCm38/mm10 | 77,224 | 208 | 3,156 | 16 | 934 |