Literature DB >> 28972578

Linking FANTOM5 CAGE peaks to annotations with CAGEscan.

Nicolas Bertin1,2, Mickaël Mendez1,2, Akira Hasegawa1,2, Marina Lizio1,2, Imad Abugessaisa1,2, Jessica Severin1,2, Mizuho Sakai-Ohno1,2, Timo Lassmann1,2, Takeya Kasukawa1, Hideya Kawaji1,2,3, Yoshihide Hayashizaki2,3, Alistair R R Forrest1,2,3, Piero Carninci1,2, Charles Plessy1,2.   

Abstract

The FANTOM5 expression atlas is a quantitative measurement of the activity of nearly 200,000 promoter regions across nearly 2,000 different human primary cells, tissue types and cell lines. Generation of this atlas was made possible by the use of CAGE, an experimental approach to localise transcription start sites at single-nucleotide resolution by sequencing the 5' ends of capped RNAs after their conversion to cDNAs. While 50% of CAGE-defined promoter regions could be confidently associated to adjacent transcriptional units, nearly 100,000 promoter regions remained gene-orphan. To address this, we used the CAGEscan method, in which random-primed 5'-cDNAs are paired-end sequenced. Pairs starting in the same region are assembled in transcript models called CAGEscan clusters. Here, we present the production and quality control of CAGEscan libraries from 56 FANTOM5 RNA sources, which enhances the FANTOM5 expression atlas by providing experimental evidence associating core promoter regions with their cognate transcripts.

Entities:  

Mesh:

Substances:

Year:  2017        PMID: 28972578      PMCID: PMC5625555          DOI: 10.1038/sdata.2017.147

Source DB:  PubMed          Journal:  Sci Data        ISSN: 2052-4463            Impact factor:   6.444


Background & Summary

CAGE (Cap Analysis Gene Expression[1]) is the method of choice for studying gene regulation through quantitative analysis of transcription start sites (TSS, sequence ontology term 0000315)[2]. By sequencing the 5′ end of cDNA-converted capped RNAs, CAGE enables the identification of core promoter regions and 5′ end transcriptional activity. Large scale application of CAGE by the FANTOM consortium to nearly 2,000 human RNA sources including primary cells, whole-tissue extracts and cell lines[3,4] identified nearly 200,000 core promoter regions active within the human genome[5]. Although CAGE enables the location of TSS at a single nucleotide resolution, the determination of their connection to downstream known gene structures or to independent novel RNAs is limited to positional computational inference and low-throughput gene-by-gene experimental validations. Half (101,893/201,802) of the FANTOM5’s active core promoter regions did not co-localise within a reasonable distance with 5′ termini of annotated gene models. To experimentally associate these orphan core promoter regions to transcriptional units, we employed CAGEscan[6], an approach in which paired-end sequencing of the 5′ end of cDNA-converted capped RNAs with their cognate randomly priming sites enables the unequivocal association of individual TSS to transcripts exons. In a previous project, focused on analysing the translatome of Purkinje neurons in rat[7], the CAGEscan approach annotated 43 % of the core promoters active in rat’s Purkinje neurons that we detected but had no by direct overlap with Ensembl transcripts. Here, we selected 56 RNA sources which upon FANTOM5 CAGE profiling revealed the greatest levels of transcriptome diversity and prepared individual CAGEscan libraries, with 6 of these 56 RNA sources prepared in duplicate (see Table 1). Using the FANTOM5 core promoter atlas as seed, we clustered the CAGEscan paired-end reads in a collection of 112,315 models called CAGEscan clusters, by collating all the pairs whose alignment started in the same FANTOM5 CAGE peak. To de-orphanise FANTOM5 promoters, we intersected the CAGEscan clusters with GENCODE 18 gene models. Of the 85 % that intersected, 33,632 clusters had no annotation in FANTOM5, thus revealing novel and alternative promoters to known genes. We made these data available along with the FANTOM5 CAGE atlas data, as well as ready for manual inspection and analysis via the ZENBU genome browser http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=ZkJi4RdBAFhnsudxePrZxD (see Fig. 1).
Table 1

Summary of the libraries prepared.

 Source NameDescriptionTotalUnextractedArtefactsrDNANon alignedNon properDuplicatesPromoterExonNon annotated
The RNA identifier (Source.Name) can be searched in the FANTOM5 SSTAR database[15,20]. The RNA samples are also described in the SDRF files distributed alongside the FASTQ sequences and alignments, as well as the raw alignment statistics.            
NCig1001310002-101A5SABiosciences XpressRef Human Universal Total RNA, pool112,980,474865,23253,5788,620,490303,381336,125914,464352,035557,012978,157
NCig1001410012-101C3brain, adult, pool110,041,908789,46056,0533,579,617156,712499,6571,967,826479,8411,007,6781,505,064
NCig1001510016-101C7heart, adult, pool117,071,911657,06567,49312,680,315189,587360,8181,791,722341,589679,820303,502
NCig1001610026-101D8testis, adult, pool112,778,881735,40256,6638,828,467229,250357,108761,921321,059575,393913,618
NCig1001710030-101E3retina, adult, pool17,438,898983,12049,2092,016,04091,931396,2261,395,594341,442574,8001,590,536
NCig1001811210-116A4Smooth Muscle Cells—Aortic, donor016,069,580636,80576,00814,079,255126,255163,413505,347152,188219,216111,093
NCig1001912176-128I7Whole blood (ribopure), donor090325, donation111,721,271745,63359,6305,626,776168,819557,4901,749,239502,772701,9141,608,998
NCig1002010019-101D1lung, adult, pool113,194,146853,74390,4068,943,567141,998347,985819,745309,787599,1261,087,789
NCig1002110022-101D4prostate, adult, pool111,134,114746,11664,4136,095,769167,516486,5361,152,553366,192716,6511,338,368
NCig1002210025-101D7spleen, adult, pool18,339,981852,30962,9763,130,575143,638431,1021,038,248402,237785,5771,493,319
NCig1002310150-102I6medial frontal gyrus, adult, donor102524,512,027960,28525,800467,91681,388389,764534,385224,651506,8451,320,993
NCig1002410151-102I7amygdala, adult, donor102526,314,079858,65233,5031,112,179112,044450,492801,134321,988702,4451,921,642
NCig1002510153-102I9hippocampus, adult, donor102525,068,313892,71632,233861,08180,184360,888643,485241,503530,4521,425,771
NCig1002610154-103A1thalamus, adult, donor102528,151,958817,75135,8722,262,38695,945505,9111,296,220377,511716,4792,043,883
NCig1002710155-103A2medulla oblongata, adult, donor102529,999,7871,397,24778,3672,366,286132,873610,5221,541,331446,644906,7372,519,780
NCig1002810157-103A4parietal lobe, adult, donor102529,456,143936,08471,2911,471,269153,886694,4471,387,954463,3551,096,7673,181,090
NCig1002910158-103A5substantia nigra, adult, donor102527,656,6631,078,14659,2512,360,69885,562425,7121,188,939344,322602,7641,511,269
NCig1003010159-103A6spinal cord, adult, donor102529,651,183888,98179,2002,801,018116,324594,9911,353,570442,320903,7642,471,015
NCig1003110160-103A7pineal gland, adult, donor102527,577,4341,011,96065,0551,343,792103,948545,004944,389348,323744,9592,470,004
NCig1003210161-103A8globus pallidus, adult, donor1025211,489,499821,07780,3874,015,936130,251673,5881,632,249469,685916,5872,749,739
NCig1003310162-103A9pituitary gland, adult, donor102528,630,256970,96449,7551,932,932124,341563,7071,591,606455,105847,8582,093,988
NCig1003410163-103B1occipital cortex, adult, donor102529,407,509905,25444,6941,193,623136,650708,7311,223,230432,8691,030,5953,731,863
NCig1003510164-103B2caudate nucleus, adult, donor102526,816,9571,102,40838,186869,711109,813476,0421,310,808346,046754,7801,809,163
NCig1003610165-103B3locus coeruleus, adult, donor102526,753,0261,045,45349,7111,251,96197,962454,4901,173,365330,395729,5251,620,164
NCig1003710166-103B4cerebellum, adult, donor102526,025,0351,095,99254,415368,37062,650519,173650,125254,418492,0162,527,876
NCig1003811207-116A1Endothelial Cells—Aortic, donor015,261,564718,89798,32212,128,937142,908291,064820,844298,247455,971306,374
NCig1003911222-116B7Fibroblast—Gingival, donor4 (GFH2)5,865,574885,111167,8332,081,881108,043284,5191,080,129346,774547,431363,853
NCig1004011224-116B9CD14+ Monocytes, donor111,232,175651,017101,4614,297,440152,438540,0351,268,085510,000645,8773,065,822
NCig1004111229-116C5CD14+ monocyte derived endothelial progenitor cells, donor110,775,3211,032,309242,1453,950,479190,791539,0871,666,613561,795835,5461,756,556
NCig1004211245-116E3Fibroblast—Aortic Adventitial, donor19,543,436735,498828,6703,376,827198,604517,8582,135,913655,705710,317384,044
NCig1004311246-116E4Intestinal epithelial cells (polarized), donor16,681,741919,980392,8201,056,095120,525433,4072,003,503513,395536,761705,255
NCig1004411247-116E5Mesothelial Cells, donor17,150,721870,202443,4811,547,197127,726418,1032,291,855516,801516,012419,344
NCig1004511248-116E6Anulus Pulposus Cell, donor19,329,478673,123467,2834,733,836191,255418,2301,674,689474,236571,373125,453
NCig1004611249-116E7Pancreatic stromal cells, donor16,917,860895,678266,8411,598,919129,096447,6331,839,323563,482606,168570,720
NCig1004711256-116F5Small Airway Epithelial Cells, donor18,934,394762,215197,2863,113,793175,723506,5912,330,196629,638801,987416,965
NCig1004811273-116H4Mammary Epithelial Cell, donor110,019,381890,533198,8344,561,811208,742497,8652,025,351497,139721,321417,785
NCig1004911278-116H9Placental Epithelial Cells, donor111,212,007523,668434,0797,019,440196,493358,1541,908,353304,347296,384171,089
NCig1005011282-116I4Skeletal muscle cells differentiated into Myotubes—multinucleated, donor18,911,706825,574278,8163,852,864174,167447,3922,002,086521,214534,883274,710
NCig1005111468-119C1Preadipocyte—omental, donor15,109,588863,018244,0701,743,47394,850257,299790,493304,494524,536287,355
NCig1005211487-119E2Mast cell—stimulated, donor14,388,4681,047,45953,428390,68786,272312,0081,219,897244,001294,922739,794
NCig1005310411-106B6renal cell carcinoma cell line:OS-RC-26,905,711774,316209,7032,297,666117,997421,7791,058,325387,862580,2141,057,849
NCig1005410412-106B7malignant trichilemmal cyst cell line:DJM-19,858,285728,347139,1303,031,554164,712630,4342,045,672648,552895,2671,574,617
NCig1005510414-106B9maxillary sinus tumor cell line:HSQ-899,125,063857,517135,6112,250,778123,989579,8171,951,209498,578697,0192,030,545
NCig1005610431-106D8epidermoid carcinoma cell line:Ca Ski5,074,9861,071,422146,593982,63787,543343,508859,004378,966452,570752,743
NCig1005710436-106E4signet ring carcinoma cell line:Kato III8,693,941840,579145,6873,244,444137,763512,6281,657,263503,623614,5401,037,414
NCig1005810442-106F1schwannoma cell line:HS-PSS7,714,618941,029176,1591,799,562134,668519,8661,659,980589,180733,2631,160,911
NCig1005910444-106F3glioblastoma cell line:A1728,266,061861,701175,0942,804,921186,736495,9541,209,931520,670712,7551,298,299
NCig1006010454-106G4chronic myelogenous leukemia cell line:K5624,756,5811,045,797109,272645,62770,593363,272675,740342,295400,3801,103,605
NCig1006110464-106H5acute lymphoblastic leukemia (T-ALL) cell line:Jurkat9,344,079869,111131,0892,562,674178,216687,4781,748,129774,916819,4251,573,041
NCig1006210508-107D4neuroblastoma cell line:CHP-134, tech_rep14,622,691962,974148,947278,61857,421405,741662,738258,098391,9381,456,216
NCig1006310552-107I3cervical cancer cell line:D98-AH2, tech_rep14,307,4251,005,845156,179421,51470,319310,3501,186,016271,445368,888516,869
NCig1006410558-107I9osteosarcoma cell line:HS-Os-1, tech_rep14,374,077983,856182,894548,73780,879357,116711,651286,493395,130827,321
NCig1006510410-106B5extraskeletal myxoid chondrosarcoma cell line:H-EMC-SS, tech_rep13,965,350928,677138,036393,52664,950343,707582,912220,600400,189892,753
NCig1006610441-106E9synovial sarcoma cell line:HS-SY-II, tech_rep14,039,831844,408197,018574,81457,821348,974523,331235,006375,904882,555
NCig1006710474-106I6myeloma cell line:PCM6, tech-rep14,582,185810,459186,856755,59471,301358,371807,453278,280416,507897,364
NCig1006810424-106D1splenic lymphoma with villous lymphocytes cell line:SLVL4,458,999852,283163,002455,66380,461376,376969,585280,831377,707903,091
NCig1012610508-107D4neuroblastoma cell line:CHP-134, tech_rep25,259,146995,79548,625550,45063,938396,327701,319298,112438,5541,766,026
NCig1012710552-107I3cervical cancer cell line:D98-AH2, tech_rep24,097,3891,015,54260,740646,95064,609304,041930,823243,837338,193492,654
NCig1012810558-107I9osteosarcoma cell line:HS-Os-1, tech_rep24,681,628968,28275,235865,89176,828336,463737,523296,891409,283915,232
NCig1012910410-106B5extraskeletal myxoid chondrosarcoma cell line:H-EMC-SS, tech_rep23,118,570822,75240,614436,600112,425377,956276,992152,854276,872621,505
NCig1013010441-106E9synovial sarcoma cell line:HS-SY-II, tech_rep23,232,761726,77364,905633,63381,424473,478240,861160,678250,046600,963
NCig1013110474-106I6myeloma cell line:PCM6, tech-rep23,985,344720,12460,988898,04378,483322,369566,406231,781346,042761,108
Figure 1

ZENBU view of CAGEscan data.

CAGEscan clusters revealing new promoters for the SH3BGRL2 gene. Features on the plus and minus strand are displayed in green and purple respectively. Promoter regions of interest are highlighted with ellipses in track D. (a) Genomic coordinates. (b) FANTOM5 CAGE signal as a quantitative histogram. (c) CAGEscan CAGE signal. (d) CAGEscan meta-clusters, combining pairs for all libraries. The name of the seed CAGE peak is indicated on the left of each cluster. (e) NCBI Gene bodies. (f) GENCODE 19 annotations. (g) GenBank mRNA sequences. (h) EST sequences supporting the CAGEscan clusters.

Methods

All human samples used in the project were either exempted material (available in public collections or commercially available), or provided under informed consent. All non-exempt material is covered under RIKEN Yokohama Ethics applications (H17–34 and H21-14). The CAGEscan libraries were prepared as described earlier[8]. In brief, 500 ng of RNA were reverse-transcribed in presence of random primers and template-switching oligonucleotides, amplified by PCR and sequenced paired-end (2×36 nt) on Illumina GAIIx sequencers, one sample per lane. The barcode sequence GCTATA, present in every sample, acted as the spacer that we introduced in ref. 9 to decrease the amount of strand-invasion artefacts. The paired-end sequences were then processed with the MOIRAI workflow system[10], with a template implementing the workflow OP-WORKFLOW-CAGEscan-FANTOM5-v1.0, described below and in Fig. 2.
Figure 2

FANTOM5 CAGEscan processing workflow.

Processing pipeline. The diagram made of boxes connected by black arrows displays the MOIRAI workflow completed for one (NCig10013) of the 62 CAGEscan libraries. The coloured text and arrows overlayed on the diagram represents the points where the main alignment statistics are calculated to summarise the number of read pairs passing all the filters (CAGEscan pairs) or discarded at each step of the processing pipeline (Unextracted, rDNA, Artefacts, Non-aligned, Non-proper, Duplicates).

For each pair, the first (CAGE) and second (CAGEscan) reads in FASTQ format were demultiplexed. The first 9 bases of the CAGE reads were trimmed as they contain the sample barcode and the template-switching linker. CAGEscan paired-end reads that did not contain the exact barcode and linker sequences were discarded. The first 6 bases of the CAGEscan reads were trimmed, because they originate from the random primers and not the cDNAs, and therefore are prone to errors caused by mismatches during the hybridisation to the RNAs, that are well tolerated by the reverse-transcriptase[11]. The CAGE and CAGEscan reads were then filtered independently with the TagDust program version 1.13 (ref. 12), using the sequences of empty constructs and primers as artefact library. They were then compared to reference sequences of ribosomal genes (GenBank: U13369.1) using the rRNAdust program version 1.03. Reads whose mates were discarded by these two filters were then removed. FASTQ formatted cleaned paired-end reads were then aligned on the human genome version hg19 with BWA version 0.7.15 (ref. 13) using standard parameters, except that the maximum insert length (−a) was set to 2 Mbp to allow pairs to map on different exons, and that insert size detection was disabled (−A). Extra header records (for SQ: AS and for RG: CN, ID, LB, PU, SM, and PL) were added to ease processing and tracking. The resulting BWA SAM formatted alignments were then converted to BAM format, and unmapped as well as non-properly paired CAGE reads were discarded (flag 0×42). The resulting ‘CAGEscan pairs’ provide individual experimental information on the association of a single-nucleotide-resolution TSS with the body of a gene product. The CAGEscan pairs were then converted to BED12 format using the program pairedBamToBed12 version 1.2, in which the score field is the sum of the mapping qualities of each read of the pair. They were then assembled into CAGEscan clusters using the CAGEscan-Clustering script version 1.2 and the Phase 1+2 FANTOM5 DPI CAGE peaks as seeds. The CAGEscan-Clustering script also takes advantage of the BED12 format, reporting the number of CAGEscan paired-end reads used to assemble each cluster via the score field and the name and position of the seeding CAGE peak via the name, thickStart and thickEnd fields respectively. Finally, the CAGEscan clusters from all libraries were then combined into a single global assembly of ‘meta-clusters’ using the same program and output in BED12 files where the score indicates the number of libraries contributing data to each meta-cluster.

Code availability

The MOIRAI workflow template used to process the libraries is available as a supplemental XML file (Data Citation 1). MOIRAI enabled the design of a complete data processing pipeline based on the following softwares: FASTX-Toolkit (http://hannonlab.cshl.edu/fastx_toolkit/), TagDust 1.13 (ref. 12), rRNAdust 1.03 (http://fantom.gsc.riken.jp/5/sstar/Protocols:rRNAdust) (note that for new projects, we recommend TagDust 2 instead of TagDust 1 and rRNAdust), BWA 0.7.15-r1140[13], SAMtools 0.1.19-44428cd[14], pairedBAMtoBED12 1.2 (https://github.com/Population-Transcriptomics/pairedBamToBed12, Data Citation 2), CAGEscan-Clustering.pl 1.2 (https://github.com/nicolas-bertin/CAGEscan-Clustering, Data Citation 3) and promexinstats.sh for the annotation (see Data Citation 1). The software above and standard Unix tools are sufficient to re-implement the pipeline in a different workflow system.

Data Records

Each CAGEscan library is described with a Sample and Data Relationship Format (SDRF) record, together with the rest of the FANTOM5 data[15]. For each library, raw sequences in FASTQ format, alignment data in BAM format (including unmapped reads), CAGEscan pairs in BED12 format, CAGEscan clusters in BED12 format and alignment statistics in plain text tabulation-delimited triples (subject, predicate, object), are available in the FANTOM5 data repository (http://fantom.gsc.riken.jp/5/datafiles/phase2.3/basic/). The raw sequences have also been deposited to DDBJ Sequence Read Archive (Data Citation 4).

Technical Validation

We derived individual library alignment statistics from the MOIRAI data processing pipeline (see Table 1 and Figs 2 and 3a). The statistics count the number of reads discarded at key steps of the processing. ‘Unextracted’ are pairs where the linker was not found, ‘Artefacts’ are pairs that matched the artefact library, or had a low complexity, ‘rDNA’ are pairs that matched the reference rDNA locus (including rRNAs and their spacer regions), ‘Non-aligned’ are pairs where one or both mates were not aligned to the genome, and ‘Non-proper’ are pairs where the mates were not aligned in head-to-head orientation within 2 Mbp. ‘Duplicates’ are the pairs removed during the deduplication step. That is, when there are n pairs with identical coordinates, 1 is kept and n−1 are discarded as ‘Duplicates’. These statistics show that the amount of PCR duplicates was not larger than the number of CAGEscan pairs, suggesting that the libraries prepared in this study have not been fully exhausted by sequencing.
Figure 3

Alignment and annotation statistics. Quality control statistics.

(a) Fraction of pairs passing all filters (CAGEscan pairs) or discarded at key steps of the processing pipeline (see Fig. 2). The central block of stack bars represents each library individually. The left block aggregates them by sequencing batch, named by the sequencing run identifier. The right block aggregates the libraries by sample type. Each sample type is represented by one colour, that is also used to colour the library identifiers and the sequence identifiers in the other blocks. Batches comprising multiple types are indicated by multiple colours. (b) Fraction of pairs starting in a Promoter, Exon, or Other (non-promoter, non-exon) region.

The library alignment statistics, as well as statistics describing the distribution of CAGEscan TSSs on GENCODE 19 annotations (Fig. 3b), also suggest that the biological nature of the samples (cancer cell lines, primary cells, tissue samples and brain tissue) strongly influenced the performance of the CAGEscan protocol used in this study. Albeit displaying the best performance in terms of alignment (largest fraction of CAGEscan pairs), brain tissue derived samples had the lowest rate of known promoters overlapping start sites, hinting at a much greater diversity of alternative promoters usage in human brain. However, since, in this study, all brain tissue derived samples were taken from a single donor, this observation may result from technical batch effect rather than being a general feature of the nature of human brain transcriptome. To assess the reproducibility and consistency of our libraries, we computed a Jaccard similarity index between the lists of FANTOM5 CAGE peaks detected in each possible pair of libraries. For each sample analysed in duplicate, the library with the highest similarity was the replicate (Fig. 4). Hierarchical clustering of the libraries tended to group the samples by type rather than by batch. Accordingly, library NCig10014, typed as ‘Tissue’ together with other samples obtained from Ambion’s FirstChoice Human Total RNA Survey Panel, and containing its brain RNA pool, clustered with the donor-derived ‘Brain samples’. Together with the similarity of replicates, this provides confidence that the data reflects the biological contents of the libraries and not batch effects.
Figure 4

Similarity between libraries.

Heatmap of the Jaccard similarity indexes computed between each pair of libraries. Sample type and batches are indicated by a colour code near library names, and pairs of replicates are indicated by an asterisk superimposed to the square displaying their similarity index.

Usage Notes

We have seeded the CAGEscan clustering with FANTOM5 CAGE-defined core promoter regions, however alternative seeding strategies could be envisioned. The 5′ ends of the CAGEscan pairs themselves could be clustered by peak calling and used as a seed, which is the default mode of operation of the pairedBamToBed12 tool. Foregoing the discovery of alternative promoters, CAGEscan clusters could also be seeded using promoter regions defined by GENCODE models. To discover potential enhancer-associated non-coding RNAs, region corresponding to FANTOM5 enhancers[16] could also be used. We used a simple alignment strategy that did not take splicing into account. Thus, pairs overlapping splice junctions could not be mapped and CAGEscan clusters lack coverage at the beginning and end of each exon, but this only mildly impacts the main purpose of the method. In addition, since the CAGEscan pairs are anchored at the 5′ end of the transcripts, splice junctions occurring close to the TSS may render some whole loci unmappable. Indeed, transcripts databases such as GENCODE reveal splice junctions very near to the TSS. Trimming the CAGE reads to 20 nt rescued some loci, but other loci were lost due to the decrease of alignment stringency (data not shown). One of the most striking differences between the HeliScopeCAGE-based FANTOM5 CAGE data and the nanoCAGE-based FANTOM5 CAGEscan data is a larger amount of start sites in the gene body, far from the promoter. This can be explained by the lower stringency of the nanoCAGE protocol, which uses template-switching for capturing 5′ ends from limiting amounts of samples[6], where the HeliScopeCAGE protocol, that uses CAP Trapper[17], would not be possible. Readers curious about the position of the random priming site, indicated by the end position of the CAGEscan pairs, will notice that their distribution is very far from random. Control experiments performed using different batches of random primers ordered by different makers confirmed that the quality of the oligonucleotides was not in question (data not shown). In the latest version of the nanoCAGE protocol[18], this problem was solved by the fragmentation of the cDNAs by the ‘tagmentation’ method. Altogether, we recommend to use our latest protocol for making new libraries. In this study, the CAGEscan libraries were prepared using the nanoCAGE method, but the CAGEscan workflow, which can use any paired-end sequencing of CAGE libraries were the 3′ sequencing read is at a random position in the cDNA, can be applied to other publicly available dataset, for instance made with the RAMPAGE method[19].

Additional Information

How to cite this article: Bertin, N. et al. Linking FANTOM5 CAGE peaks to annotations with CAGEscan. Sci. Data 4:170147 doi: 10.1038/sdata.2017.147 (2017). Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
  20 in total

1.  Unamplified cap analysis of gene expression on a single-molecule sequencer.

Authors:  Mutsumi Kanamori-Katayama; Masayoshi Itoh; Hideya Kawaji; Timo Lassmann; Shintaro Katayama; Miki Kojima; Nicolas Bertin; Ai Kaiho; Noriko Ninomiya; Carsten O Daub; Piero Carninci; Alistair R R Forrest; Yoshihide Hayashizaki
Journal:  Genome Res       Date:  2011-05-19       Impact factor: 9.043

2.  NanoCAGE: A Method for the Analysis of Coding and Noncoding 5'-Capped Transcriptomes.

Authors:  Stéphane Poulain; Sachi Kato; Ophélie Arnaud; Jean-Étienne Morlighem; Makoto Suzuki; Charles Plessy; Matthias Harbers
Journal:  Methods Mol Biol       Date:  2017

3.  Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells.

Authors:  Erik Arner; Carsten O Daub; Kristoffer Vitting-Seerup; Robin Andersson; Berit Lilje; Finn Drabløs; Andreas Lennartsson; Michelle Rönnerblad; Olga Hrydziuszko; Morana Vitezic; Tom C Freeman; Ahmad M N Alhendi; Peter Arner; Richard Axton; J Kenneth Baillie; Anthony Beckhouse; Beatrice Bodega; James Briggs; Frank Brombacher; Margaret Davis; Michael Detmar; Anna Ehrlund; Mitsuhiro Endoh; Afsaneh Eslami; Michela Fagiolini; Lynsey Fairbairn; Geoffrey J Faulkner; Carmelo Ferrai; Malcolm E Fisher; Lesley Forrester; Daniel Goldowitz; Reto Guler; Thomas Ha; Mitsuko Hara; Meenhard Herlyn; Tomokatsu Ikawa; Chieko Kai; Hiroshi Kawamoto; Levon M Khachigian; S Peter Klinken; Soichi Kojima; Haruhiko Koseki; Sarah Klein; Niklas Mejhert; Ken Miyaguchi; Yosuke Mizuno; Mitsuru Morimoto; Kelly J Morris; Christine Mummery; Yutaka Nakachi; Soichi Ogishima; Mariko Okada-Hatakeyama; Yasushi Okazaki; Valerio Orlando; Dmitry Ovchinnikov; Robert Passier; Margaret Patrikakis; Ana Pombo; Xian-Yang Qin; Sugata Roy; Hiroki Sato; Suzana Savvi; Alka Saxena; Anita Schwegmann; Daisuke Sugiyama; Rolf Swoboda; Hiroshi Tanaka; Andru Tomoiu; Louise N Winteringham; Ernst Wolvetang; Chiyo Yanagi-Mizuochi; Misako Yoneda; Susan Zabierowski; Peter Zhang; Imad Abugessaisa; Nicolas Bertin; Alexander D Diehl; Shiro Fukuda; Masaaki Furuno; Jayson Harshbarger; Akira Hasegawa; Fumi Hori; Sachi Ishikawa-Kato; Yuri Ishizu; Masayoshi Itoh; Tsugumi Kawashima; Miki Kojima; Naoto Kondo; Marina Lizio; Terrence F Meehan; Christopher J Mungall; Mitsuyoshi Murata; Hiromi Nishiyori-Sueki; Serkan Sahin; Sayaka Nagao-Sato; Jessica Severin; Michiel J L de Hoon; Jun Kawai; Takeya Kasukawa; Timo Lassmann; Harukazu Suzuki; Hideya Kawaji; Kim M Summers; Christine Wells; David A Hume; Alistair R R Forrest; Albin Sandelin; Piero Carninci; Yoshihide Hayashizaki
Journal:  Science       Date:  2015-02-12       Impact factor: 47.728

4.  The Sequence Alignment/Map format and SAMtools.

Authors:  Heng Li; Bob Handsaker; Alec Wysoker; Tim Fennell; Jue Ruan; Nils Homer; Gabor Marth; Goncalo Abecasis; Richard Durbin
Journal:  Bioinformatics       Date:  2009-06-08       Impact factor: 6.937

5.  A promoter-level mammalian expression atlas.

Authors:  Alistair R R Forrest; Hideya Kawaji; Michael Rehli; J Kenneth Baillie; Michiel J L de Hoon; Vanja Haberle; Timo Lassmann; Ivan V Kulakovskiy; Marina Lizio; Masayoshi Itoh; Robin Andersson; Christopher J Mungall; Terrence F Meehan; Sebastian Schmeier; Nicolas Bertin; Mette Jørgensen; Emmanuel Dimont; Erik Arner; Christian Schmidl; Ulf Schaefer; Yulia A Medvedeva; Charles Plessy; Morana Vitezic; Jessica Severin; Colin A Semple; Yuri Ishizu; Robert S Young; Margherita Francescatto; Intikhab Alam; Davide Albanese; Gabriel M Altschuler; Takahiro Arakawa; John A C Archer; Peter Arner; Magda Babina; Sarah Rennie; Piotr J Balwierz; Anthony G Beckhouse; Swati Pradhan-Bhatt; Judith A Blake; Antje Blumenthal; Beatrice Bodega; Alessandro Bonetti; James Briggs; Frank Brombacher; A Maxwell Burroughs; Andrea Califano; Carlo V Cannistraci; Daniel Carbajo; Yun Chen; Marco Chierici; Yari Ciani; Hans C Clevers; Emiliano Dalla; Carrie A Davis; Michael Detmar; Alexander D Diehl; Taeko Dohi; Finn Drabløs; Albert S B Edge; Matthias Edinger; Karl Ekwall; Mitsuhiro Endoh; Hideki Enomoto; Michela Fagiolini; Lynsey Fairbairn; Hai Fang; Mary C Farach-Carson; Geoffrey J Faulkner; Alexander V Favorov; Malcolm E Fisher; Martin C Frith; Rie Fujita; Shiro Fukuda; Cesare Furlanello; Masaaki Furino; Jun-ichi Furusawa; Teunis B Geijtenbeek; Andrew P Gibson; Thomas Gingeras; Daniel Goldowitz; Julian Gough; Sven Guhl; Reto Guler; Stefano Gustincich; Thomas J Ha; Masahide Hamaguchi; Mitsuko Hara; Matthias Harbers; Jayson Harshbarger; Akira Hasegawa; Yuki Hasegawa; Takehiro Hashimoto; Meenhard Herlyn; Kelly J Hitchens; Shannan J Ho Sui; Oliver M Hofmann; Ilka Hoof; Furni Hori; Lukasz Huminiecki; Kei Iida; Tomokatsu Ikawa; Boris R Jankovic; Hui Jia; Anagha Joshi; Giuseppe Jurman; Bogumil Kaczkowski; Chieko Kai; Kaoru Kaida; Ai Kaiho; Kazuhiro Kajiyama; Mutsumi Kanamori-Katayama; Artem S Kasianov; Takeya Kasukawa; Shintaro Katayama; Sachi Kato; Shuji Kawaguchi; Hiroshi Kawamoto; Yuki I Kawamura; Tsugumi Kawashima; Judith S Kempfle; Tony J Kenna; Juha Kere; Levon M Khachigian; Toshio Kitamura; S Peter Klinken; Alan J Knox; Miki Kojima; Soichi Kojima; Naoto Kondo; Haruhiko Koseki; Shigeo Koyasu; Sarah Krampitz; Atsutaka Kubosaki; Andrew T Kwon; Jeroen F J Laros; Weonju Lee; Andreas Lennartsson; Kang Li; Berit Lilje; Leonard Lipovich; Alan Mackay-Sim; Ri-ichiroh Manabe; Jessica C Mar; Benoit Marchand; Anthony Mathelier; Niklas Mejhert; Alison Meynert; Yosuke Mizuno; David A de Lima Morais; Hiromasa Morikawa; Mitsuru Morimoto; Kazuyo Moro; Efthymios Motakis; Hozumi Motohashi; Christine L Mummery; Mitsuyoshi Murata; Sayaka Nagao-Sato; Yutaka Nakachi; Fumio Nakahara; Toshiyuki Nakamura; Yukio Nakamura; Kenichi Nakazato; Erik van Nimwegen; Noriko Ninomiya; Hiromi Nishiyori; Shohei Noma; Shohei Noma; Tadasuke Noazaki; Soichi Ogishima; Naganari Ohkura; Hiroko Ohimiya; Hiroshi Ohno; Mitsuhiro Ohshima; Mariko Okada-Hatakeyama; Yasushi Okazaki; Valerio Orlando; Dmitry A Ovchinnikov; Arnab Pain; Robert Passier; Margaret Patrikakis; Helena Persson; Silvano Piazza; James G D Prendergast; Owen J L Rackham; Jordan A Ramilowski; Mamoon Rashid; Timothy Ravasi; Patrizia Rizzu; Marco Roncador; Sugata Roy; Morten B Rye; Eri Saijyo; Antti Sajantila; Akiko Saka; Shimon Sakaguchi; Mizuho Sakai; Hiroki Sato; Suzana Savvi; Alka Saxena; Claudio Schneider; Erik A Schultes; Gundula G Schulze-Tanzil; Anita Schwegmann; Thierry Sengstag; Guojun Sheng; Hisashi Shimoji; Yishai Shimoni; Jay W Shin; Christophe Simon; Daisuke Sugiyama; Takaai Sugiyama; Masanori Suzuki; Naoko Suzuki; Rolf K Swoboda; Peter A C 't Hoen; Michihira Tagami; Naoko Takahashi; Jun Takai; Hiroshi Tanaka; Hideki Tatsukawa; Zuotian Tatum; Mark Thompson; Hiroo Toyodo; Tetsuro Toyoda; Elvind Valen; Marc van de Wetering; Linda M van den Berg; Roberto Verado; Dipti Vijayan; Ilya E Vorontsov; Wyeth W Wasserman; Shoko Watanabe; Christine A Wells; Louise N Winteringham; Ernst Wolvetang; Emily J Wood; Yoko Yamaguchi; Masayuki Yamamoto; Misako Yoneda; Yohei Yonekura; Shigehiro Yoshida; Susan E Zabierowski; Peter G Zhang; Xiaobei Zhao; Silvia Zucchelli; Kim M Summers; Harukazu Suzuki; Carsten O Daub; Jun Kawai; Peter Heutink; Winston Hide; Tom C Freeman; Boris Lenhard; Vladimir B Bajic; Martin S Taylor; Vsevolod J Makeev; Albin Sandelin; David A Hume; Piero Carninci; Yoshihide Hayashizaki
Journal:  Nature       Date:  2014-03-27       Impact factor: 49.962

6.  High-fidelity promoter profiling reveals widespread alternative promoter usage and transposon-driven developmental gene expression.

Authors:  Philippe Batut; Alexander Dobin; Charles Plessy; Piero Carninci; Thomas R Gingeras
Journal:  Genome Res       Date:  2012-08-30       Impact factor: 9.043

7.  Digital expression profiling of the compartmentalized translatome of Purkinje neurons.

Authors:  Anton Kratz; Pascal Beguin; Megumi Kaneko; Takahiko Chimura; Ana Maria Suzuki; Atsuko Matsunaga; Sachi Kato; Nicolas Bertin; Timo Lassmann; Réjan Vigot; Piero Carninci; Charles Plessy; Thomas Launey
Journal:  Genome Res       Date:  2014-06-05       Impact factor: 9.043

8.  FANTOM5 transcriptome catalog of cellular states based on Semantic MediaWiki.

Authors:  Imad Abugessaisa; Hisashi Shimoji; Serkan Sahin; Atsushi Kondo; Jayson Harshbarger; Marina Lizio; Yoshihide Hayashizaki; Piero Carninci; Alistair Forrest; Takeya Kasukawa; Hideya Kawaji
Journal:  Database (Oxford)       Date:  2016-07-09       Impact factor: 3.451

9.  Suppression of artifacts and barcode bias in high-throughput transcriptome analyses utilizing template switching.

Authors:  Dave T P Tang; Charles Plessy; Md Salimullah; Ana Maria Suzuki; Raffaella Calligaris; Stefano Gustincich; Piero Carninci
Journal:  Nucleic Acids Res       Date:  2012-11-24       Impact factor: 16.971

10.  FANTOM5 CAGE profiles of human and mouse samples.

Authors:  Shuhei Noguchi; Takahiro Arakawa; Shiro Fukuda; Masaaki Furuno; Akira Hasegawa; Fumi Hori; Sachi Ishikawa-Kato; Kaoru Kaida; Ai Kaiho; Mutsumi Kanamori-Katayama; Tsugumi Kawashima; Miki Kojima; Atsutaka Kubosaki; Ri-Ichiroh Manabe; Mitsuyoshi Murata; Sayaka Nagao-Sato; Kenichi Nakazato; Noriko Ninomiya; Hiromi Nishiyori-Sueki; Shohei Noma; Eri Saijyo; Akiko Saka; Mizuho Sakai; Christophe Simon; Naoko Suzuki; Michihira Tagami; Shoko Watanabe; Shigehiro Yoshida; Peter Arner; Richard A Axton; Magda Babina; J Kenneth Baillie; Timothy C Barnett; Anthony G Beckhouse; Antje Blumenthal; Beatrice Bodega; Alessandro Bonetti; James Briggs; Frank Brombacher; Ailsa J Carlisle; Hans C Clevers; Carrie A Davis; Michael Detmar; Taeko Dohi; Albert S B Edge; Matthias Edinger; Anna Ehrlund; Karl Ekwall; Mitsuhiro Endoh; Hideki Enomoto; Afsaneh Eslami; Michela Fagiolini; Lynsey Fairbairn; Mary C Farach-Carson; Geoffrey J Faulkner; Carmelo Ferrai; Malcolm E Fisher; Lesley M Forrester; Rie Fujita; Jun-Ichi Furusawa; Teunis B Geijtenbeek; Thomas Gingeras; Daniel Goldowitz; Sven Guhl; Reto Guler; Stefano Gustincich; Thomas J Ha; Masahide Hamaguchi; Mitsuko Hara; Yuki Hasegawa; Meenhard Herlyn; Peter Heutink; Kelly J Hitchens; David A Hume; Tomokatsu Ikawa; Yuri Ishizu; Chieko Kai; Hiroshi Kawamoto; Yuki I Kawamura; Judith S Kempfle; Tony J Kenna; Juha Kere; Levon M Khachigian; Toshio Kitamura; Sarah Klein; S Peter Klinken; Alan J Knox; Soichi Kojima; Haruhiko Koseki; Shigeo Koyasu; Weonju Lee; Andreas Lennartsson; Alan Mackay-Sim; Niklas Mejhert; Yosuke Mizuno; Hiromasa Morikawa; Mitsuru Morimoto; Kazuyo Moro; Kelly J Morris; Hozumi Motohashi; Christine L Mummery; Yutaka Nakachi; Fumio Nakahara; Toshiyuki Nakamura; Yukio Nakamura; Tadasuke Nozaki; Soichi Ogishima; Naganari Ohkura; Hiroshi Ohno; Mitsuhiro Ohshima; Mariko Okada-Hatakeyama; Yasushi Okazaki; Valerio Orlando; Dmitry A Ovchinnikov; Robert Passier; Margaret Patrikakis; Ana Pombo; Swati Pradhan-Bhatt; Xian-Yang Qin; Michael Rehli; Patrizia Rizzu; Sugata Roy; Antti Sajantila; Shimon Sakaguchi; Hiroki Sato; Hironori Satoh; Suzana Savvi; Alka Saxena; Christian Schmidl; Claudio Schneider; Gundula G Schulze-Tanzil; Anita Schwegmann; Guojun Sheng; Jay W Shin; Daisuke Sugiyama; Takaaki Sugiyama; Kim M Summers; Naoko Takahashi; Jun Takai; Hiroshi Tanaka; Hideki Tatsukawa; Andru Tomoiu; Hiroo Toyoda; Marc van de Wetering; Linda M van den Berg; Roberto Verardo; Dipti Vijayan; Christine A Wells; Louise N Winteringham; Ernst Wolvetang; Yoko Yamaguchi; Masayuki Yamamoto; Chiyo Yanagi-Mizuochi; Misako Yoneda; Yohei Yonekura; Peter G Zhang; Silvia Zucchelli; Imad Abugessaisa; Erik Arner; Jayson Harshbarger; Atsushi Kondo; Timo Lassmann; Marina Lizio; Serkan Sahin; Thierry Sengstag; Jessica Severin; Hisashi Shimoji; Masanori Suzuki; Harukazu Suzuki; Jun Kawai; Naoto Kondo; Masayoshi Itoh; Carsten O Daub; Takeya Kasukawa; Hideya Kawaji; Piero Carninci; Alistair R R Forrest; Yoshihide Hayashizaki
Journal:  Sci Data       Date:  2017-08-29       Impact factor: 6.444

View more
  6 in total

Review 1.  Computational Methods for Mapping, Assembly and Quantification for Coding and Non-coding Transcripts.

Authors:  Isaac A Babarinde; Yuhao Li; Andrew P Hutchins
Journal:  Comput Struct Biotechnol J       Date:  2019-05-07       Impact factor: 7.271

2.  Update of the FANTOM web resource: expansion to provide additional transcriptome atlases.

Authors:  Marina Lizio; Imad Abugessaisa; Shuhei Noguchi; Atsushi Kondo; Akira Hasegawa; Chung Chau Hon; Michiel de Hoon; Jessica Severin; Shinya Oki; Yoshihide Hayashizaki; Piero Carninci; Takeya Kasukawa; Hideya Kawaji
Journal:  Nucleic Acids Res       Date:  2019-01-08       Impact factor: 16.971

3.  A versatile 5' RACE-Seq methodology for the accurate identification of the 5' termini of mRNAs.

Authors:  Panagiotis G Adamopoulos; Panagiotis Tsiakanikas; Irene Stolidi; Andreas Scorilas
Journal:  BMC Genomics       Date:  2022-02-26       Impact factor: 3.969

Review 4.  Long Non-coding RNAs: Mechanisms, Experimental, and Computational Approaches in Identification, Characterization, and Their Biomarker Potential in Cancer.

Authors:  Anshika Chowdhary; Venkata Satagopam; Reinhard Schneider
Journal:  Front Genet       Date:  2021-07-01       Impact factor: 4.599

5.  Global Analysis of Transcription Start Sites in the New Ovine Reference Genome (Oar rambouillet v1.0).

Authors:  Mazdak Salavati; Alex Caulton; Richard Clark; Iveta Gazova; Timothy P L Smith; Kim C Worley; Noelle E Cockett; Alan L Archibald; Shannon M Clarke; Brenda M Murdoch; Emily L Clark
Journal:  Front Genet       Date:  2020-10-23       Impact factor: 4.599

6.  Evidence That STK19 Is Not an NRAS-dependent Melanoma Driver.

Authors:  Marta Rodríguez-Martínez; Thierry Boissiére; Melvin Noe Gonzalez; Kevin Litchfield; Richard Mitter; Jane Walker; Svend Kjœr; Mohamed Ismail; Julian Downward; Charles Swanton; Jesper Q Svejstrup
Journal:  Cell       Date:  2020-06-11       Impact factor: 41.582

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.