| Literature DB >> 31992717 |
Steven W Wingett1,2, Simon Andrews3, Peter Fraser4,5, Jörg Morf6.
Abstract
We have previously developed and described a method for measuring RNA co-locations within cells, called Proximity RNA-seq, which promises insights into RNA expression, processing, storage and translation. Here, we describe transcriptome-wide proximity RNA-seq datasets obtained from human neuroblastoma SH-SY5Y cell nuclei. To aid future users of this method, we also describe and release our analysis pipeline, CloseCall, which maps cDNA to a custom transcript annotation and allocates cDNA-linked barcodes to barcode groups. CloseCall then performs Monte Carlo simulations on the data to identify pairs of transcripts, which are co-barcoded more frequently than expected by chance. Furthermore, derived co-barcoding frequencies for individual transcripts, dubbed valency, serve as proxies for RNA density or connectivity for that given transcript. We outline how this pipeline was applied to these sequencing datasets and openly share the processed data outputs and access to a virtual machine that runs CloseCall. The resulting data specify the spatial organization of RNAs and builds hypotheses for potential regulatory relationships between RNAs.Entities:
Mesh:
Substances:
Year: 2020 PMID: 31992717 PMCID: PMC6987088 DOI: 10.1038/s41597-020-0372-3
Source DB: PubMed Journal: Sci Data ISSN: 2052-4463 Impact factor: 6.444
Fig. 1Workflow of Proximity RNA-seq, CloseCall and sample processing. Reverse transcription and RNA co-barcoding in droplets generate sequencing libraries with cDNAs, whose RNA templates were in spatial proximity, sharing the same barcode (blue box). After Illumina sequencing, single-end sequence reads from the FASTQ file are validated to contain a defined primer sequence. Barcodes of cDNAs are extracted and, if near-identical, grouped together (green box). In parallel, the cDNA part of each read is mapped onto the genome and allocated to a custom transcriptome annotation (yellow box). The Monte-Carlo simulation takes merged datasets as input and randomises cDNA – barcode pairings to derive expected co-barcoding frequencies for RNA pairs.
Primers and adapters used in Proximity RNA-seq.
| Modification | Sequence (5′-3′) | Purification | |
|---|---|---|---|
| Primer-F | CCATCTCATCCCTGCGTGTC | HPLC | |
| Primer-R | CCTATCCCCTGTGTGCCTTG | HPLC | |
| Primer-R | 5′ dual biotin | CCACTACGCTCGCTATCCTATCCCCTGTGTGCCTTG | HPLC |
| Random barcode template | CCATCTCATCCCTGCGTGTCNNNNNNNNNNNNNNNNNNNNNNNNNNGATCGTCGGACTGTAGAACTCCCTATAGTGAGTCGTATTACAAGGCACACAGGGGATAGG | PAGE | |
| Random tail primer F | NNNNNNNNNNNNNNNCCATCTCATCCCTGCGTGTC | HPLC | |
| T7 probe | 5′ Cy5 | CGTCGGACTGTAGAACTCCCTATAGTGAGTCGTA | HPLC |
| PolyC12 primer | GCCTTGGCACCCGAGAATTCCACCCCCCCCCCCC | PAGE | |
| RP1 | AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA | PAGE | |
| Index 6 | CAAGCAGAAGACGGCATACGAGATATTGGCGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA | PAGE | |
| Index 12 | CAAGCAGAAGACGGCATACGAGATTACAAGGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA | PAGE | |
| Index 5 | CAAGCAGAAGACGGCATACGAGATCACTGTGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA | PAGE | |
| Index 19 | CAAGCAGAAGACGGCATACGAGATTTTCACGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA | PAGE | |
CloseCall scripts.
| Analysis Area (Fig. | Step | Script Name | Input | Main Output |
|---|---|---|---|---|
| Transcript Analysis | 1 | create_features.pl | List of gene annotations and list of repeat elements | Transcript annotation file |
| Barcode Analysis | 2 | check_sequence_present.pl | FASTQ file | FASTQ file |
| Barcode Analysis | 3 | problem_barcodes.pl | FASTQ file from step 2 | FASTQ file |
| Barcode Analysis | 4 | map_trimmed_barcodes.pl | FASTQ file from step 3 | SAM file |
| Barcode Analysis | 5 | group_trimmed_barcodes.pl | SAM file from step 4 | Text file of relationships |
| Barcode Analysis | 6 | assign_read_to_barcode_group.pl | FASTQ file from step 3 and relationships file from step 5 | FASTQ file |
| Transcript Analysis | 7 | trim.pl | FASTQ file from step 6 | FASTQ file |
| Transcript Analysis | 8 | CloseCall (pipeline master script) | FASTQ file from step 7 | FASTQ screen summary files |
| Transcript Analysis | 9 | mapper_hisat2.pl | FASTQ file from step 8, genome index files, splice sites | SAM file |
| Transcript Analysis | 10 | map_editor.pl | SAM file from step 9 and list of repeat elements | BAM file |
| Transcript Analysis | 11 | create_data_file_include_noninteracting.pl | BAM file from step 10 | Interactions text file |
| Summary Results | 12 | CloseCall (pipeline master script) | Interactions file from step 11 | Summary results files |
| Transcript Analysis | 13 | identify_reads_by_regions.pl | Interactions file from step 11 | Interactions text file |
| Transcript Analysis | 14 | remove_gene_duplicates.pl | Interactions file from step 12 | Interactions text file |
| Transcript Analysis | 15 | format_for_simulation.pl | Interactions file from step 14 and list of repeat elements | Interactions text file |
| Transcript Analysis | 16 | createDitags_features.pl | Text file from step 15 | Pairwise interactions file |
| Summary Results | 17 | calc_frequeny_interactions.pl | Text file from step 15 | Summary results files |
| Transcript Analysis | 18 | reporter.pl | Summary files generated by the previous steps as the pipeline proceeds | Summary results files |
| Additional Script | Valency | feature_valency_distribution.pl | Text file from step 15 | Text file listing frequencies |
| Monte Carlo Simulation | Sim1 | anacondamontecarlo.jar | Text file from step 15 | Text file of the simulation results, comparing the observed interaction frequencies with those simulated. |
| Monte Carlo Simulation | Sim2 | collate_monte_carlo_results.pl | Simulation results from step Monte Carlo Simulation 1 | Text file of the simulation results, comparing the observed interaction frequencies with those simulated. |
| Monte Carlo Simulation | Sim3 | multiple_testing_correction.pl | Simulation results from step Monte Carlo Simulation (from either step Sim1 or step Sim2) | Text file listing statistical significance for RNA pairs |
The CloseCall pipeline comprises a series of Perl cripts, and a master script (named CloseCall), which executes in turn each of the other scripts as data proceeds from step to step. The table lists name and main functions (in italics) of the scripts in the pipeline, starting from the input FASTQ files produced by sequencing, to the file taken as input by the Monte Carlos simulation, and finally to the list of statistically significant RNA-RNA proximities.
Fig. 2Quality of Proximity RNA-seq fragments, comparative read mapping statistics and randomisations. (a) Base calling quality scores (y axis) with background colours indicating very good (green), medium (orange) and poor (red) quality plotted against read position (x axis) using FASTQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/). (b) Schematic of a sequenced Proximity RNA-seq library fragment with base distributions along raw fragment sequences. Of note, only 50 bases of the cDNA part were subsequently used to ensure the usage of high base quality scores. (c) Read mapping statistics for different datasets. (d) Randomisations of RNA proximities taking transcript-specific abundances and barcode group sizes on beads into account. (e) Plots of observed to randomized co-barcoding count ratios against observed co-barcoding counts. Each dot represents one RNA-RNA pairing. The top panel depicts the co-barcoding ratio of a randomised data set (randomised 1) to the average of all its randomisations, which provides a background distribution. The bottom panel shows the ratio of actual, observed co-barcoding to the average of all its randomisations. In both plots, pairwise RNA proximities with at least 2 observations were included and coloured in grey. Coloured in yellow are proximities with at least 3 observations and a local background-corrected p value <= 0.01, in red are proximities with at least 3 observations and a Benjamini-Hochberg corrected p value <= 0.05.
Samples and read processing statistics.
| Row: | Pipeline step: | p2 library | p5 library | p8 library | |
|---|---|---|---|---|---|
| 2 | Fixed_Seq_Check: | Sequencing reads | 90967875 | 235347955 | 132988019 |
| 3 | Matching Fixed_Seq | 85407381 | 221586324 | 112960113 | |
| 4 | %Matching Fixed_Seq | 93.9 | 94.2 | 84.9 | |
| 5 | Barcode filtering: | LC & Adapter Pass | 85213377 | 221124118 | 112733569 |
| 6 | %LC & Adapter Pass | 99.8 | 99.8 | 99.8 | |
| 7 | Unique barcodes | 8508280 | 33612246 | 22583834 | |
| 8 | Barcode groups | 7161395 | 30135828 | 20674495 | |
| 9 | Reads (row 5) allocated to barcode groups (row 8) | 84883356 | 220848562 | 112687674 | |
| 10 | Mapping: | Uniquely mapped (incl. rRNA annotation) | 72444996 | 174241156 | 72111191 |
| 11 | Multi-mapped on RNA repeats | 546788 | 1873537 | 625221 | |
| 12 | Multi-mapped (excl. row 11) | 6609030 | 14837602 | 5038160 | |
| 13 | Unmapped | 5282542 | 29896267 | 34913102 | |
| 14 | %Mapped | 85.99 | 79.74 | 64.55 | |
| 15 | Deduplication: | Reads for further use (row 10 + 11) | 72991784 | 176114693 | 72736412 |
| 16 | %PCR deduplication, barcode group size limit (of row 15) | 9.92 | 9.44 | 9.6 | |
| 17 | %Within annotation (of row 16) | 92.89 | 93.55 | 94.84 | |
| 18 | %Proxy reads (of row 17) | 90.48 | 93.59 | 89.27 | |
| 19 | Proxy reads (of row 17) | 5888752 | 13988521 | 5627689 | |
| 21 | Summary: | Proxy reads (row 19) | 5888752 | 13988521 | 5627689 |
| 22 | Barcode groups with proxy reads (row 19) | 5047022 | 12993806 | 4768049 | |
| 23 | Reads in barcode group 1 | 4323281 | 12116171 | 4059143 | |
| 24 | Reads in barcode group 2 | 1243424 | 1550542 | 1163214 | |
| 25 | Reads in barcode group 3 | 263778 | 267969 | 319842 | |
| 26 | Reads in barcode group 4 | 49468 | 45988 | 72452 | |
| 27 | Reads in barcode group 5 | 8075 | 7065 | 11970 | |
| 28 | Reads in barcode group 6 | 726 | 786 | 1068 | |
| 29 | %Proxy reads co-barcoded | 26.6 | 13.4 | 14.8 |
Breakdown of read numbers from replicate libraries, pool 2, pool 5 and pool 8, sequenced at different steps of CloseCall processing. Fixed_Seq: Primer sequences for on-bead barcode amplification. LC: low complexity barcode sequences. Of note, Proximity RNA-seq libraries are sequenced close to saturation (approximately 10% of unique reads remain after deduplication) in order to increase the number of detected co-barcoding events. Up to 25% of proxy reads are co-barcoded.
Fig. 3Pairwise Spearman rank correlation coefficients between different Proximity RNA-seq libraries. Correlations between different Proximity RNA-seq datasets using the number of proxy reads (upper, right triangle) or the number of detected co-barcoding events between two transcripts (lower, left triangle) as variables. Libraries p2, p5 and p8 are biological replicates with crosslinked nuclear homogenates from SH-SY5Y cells. p1 is a control library with randomly barcoded, non-clonal beads, p3 is a control with reverse-crosslinked RNA. p4 has been prepared in parallel and is comparable in sequencing depth to p1 and p3. Subsequently, p4 has been sequenced deeper to generate p5.
Fig. 4Validation of Proximity RNA-seq. (a) Transcript structure of pre-45S rRNA. RNA proximities between precursor parts and mature rRNA are indicated by arcs. Arc height and thickness represent – log10 p values with a plateau at 10. (b) Matrix of P values, indicated in matrix cells, for pre-45S rRNA and 5S proximities. White matrix cells signify pairs for which no data was obtained. (c) Matrix of observed/randomised co-barcoding count ratios. The colour gradient represents observed/randomised co-barcoding count ratios, the numbers in matrix cells are observed co-barcoding counts. (d–g) Candidate transcript-specific proximity networks (p value <= 0.1) in which thicker edge lines indicate proximities with lower p values, purple and bigger nodes represent higher valency transcripts, green nodes lower valency, black nodes middle valency. Grey nodes are transcript without assigned valency. (d) SNORD17 network. (e) RNase MRP. (f) FGF14. (g) MSI2.
| Measurement(s) | RNA • Proximity |
| Technology Type(s) | RNA sequencing |
| Factor Type(s) | biological replicate |
| Sample Characteristic - Organism | Homo sapiens |