| Literature DB >> 32393825 |
Gloria M Sheynkman1,2,3, Katharine S Tuttle4,5,6,7,8,9, Florent Laval4,5,6,10,11, Elizabeth Tseng12, Jason G Underwood12, Liang Yu13, Da Dong13, Melissa L Smith8,9, Robert Sebra8,9, Luc Willems10,11, Tong Hao4,5,6, Michael A Calderwood4,5,6, David E Hill14,15,16, Marc Vidal4,5.
Abstract
Most human protein-coding genes are expressed as multiple isoforms, which greatly expands the functional repertoire of the encoded proteome. While at least one reliable open reading frame (ORF) model has been assigned for every coding gene, the majority of alternative isoforms remains uncharacterized due to (i) vast differences of overall levels between different isoforms expressed from common genes, and (ii) the difficulty of obtaining full-length transcript sequences. Here, we present ORF Capture-Seq (OCS), a flexible method that addresses both challenges for targeted full-length isoform sequencing applications using collections of cloned ORFs as probes. As a proof-of-concept, we show that an OCS pipeline focused on genes coding for transcription factors increases isoform detection by an order of magnitude when compared to unenriched samples. In short, OCS enables rapid discovery of isoforms from custom-selected genes and will accelerate mapping of the human transcriptome.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32393825 PMCID: PMC7214433 DOI: 10.1038/s41467-020-16174-z
Source DB: PubMed Journal: Nat Commun ISSN: 2041-1723 Impact factor: 14.919
Fig. 1ORF Capture-Seq (OCS) method for accelerated discovery of full-length isoforms.
a Schematic of the OCS method. ORF clones of target genes are pooled and used as templates for a biotin-dUTP-labeling PCR reaction, creating randomly biotinylated amplicons which are fragmented to generate a probe set. In this study, PCR-based amplicons derived from the clones were used as template. These OCS probes can be used in targeted sequencing applications, such as enrichment of full-length cDNA for sequencing on the PacBio platform. b Transcriptional abundances in human brain cDNA. These values were used as the basis for selecting three low to moderate abundance transcription factors (TFs) as target genes (purple labels) and two high abundance genes (yellow labels) as background controls. Length is an average of all transcripts annotated for each gene (GENCODE v22). TPM values were obtained from processing Illumina sequencing data (Methods). TPM, transcripts per million. c Comparison of IDT vs OCS-based target enrichment. Each bar shows the relative proportion of cDNA from target (purple) versus background (yellow) genes as quantified by qPCR (average of two technical replicates). A total of three individual capture reactions were performed per day (see Supplementary Fig. 1e for full dataset) over two days (Day A, B). Only one of the three reactions is shown in this figure. d Individual gene expression, ranked in descending abundance, as quantified by Illumina sequencing and Kallisto (Methods). Each bar represents one gene. Only the 20 most abundant genes are shown. Bars are color coded as background controls (yellow), target genes (purple), and all remaining genes that were not targeted (gray). On-target percentages are the fraction of transcriptional abundance corresponding to the three targeted TFs (ARNTL, STAT1, ZNF268), in each capturant. Fold enrichment is computed by dividing percentage of targets in the capturant by the percentage in the unenriched input. Source data are available in the Source Data file.
Fig. 2Benchmarking OCS analytical performance.
a Schematic of benchmarking experiment using ERCC standards. b Enrichment of ERCC targets. The x-axis represents the nominal concentration of ERCC RNAs spiked into the starting pool of RNA (input) and the y-axis represents estimated abundance of each ORF in the input cDNA (top row) or capturant (bottom row). Each point represents a distinct ERCC standard (92 in total) which was targeted (light blue) or not targeted (pink). c Summary statistics related to capture efficiencies for ERCC capture reactions. d Schematic of probe synthesis using the SIRV system. e Relationship between enrichment efficiencies at different isoform overlaps. The isoform overlap represents the absolute number of nucleotides overlapping between (i) the template isoform used to generate probes, and (ii) the target isoform present in the sample. Negative controls are non-overlapping isoforms. Capture efficiencies were computed by dividing read depth of each SIRV (isoform) in capturant by the read depth in input cDNA. Error bars, standard deviation. Source data are available in the Source Data file.
Fig. 3Multiplexing parameters for enrichment of human transcription factors.
a Rank abundance bar plots for unenriched (input) and enriched (capturant) cDNA. Data is shown for (i) input brain cDNA, and (ii) the series of capturants prepared using probe sets with increasing number of TF genes. Only the top 50 most abundantly expressed genes, calculated per sample, are shown. Each bar corresponds to a single gene, colored by whether that gene is targeted (purple) or not targeted (gray). Fraction of total transcripts was calculated by dividing the transcript abundance (TPM) of all transcripts from a gene by the total transcript abundances for the sample (Methods). On-target rates, as calculated for the entire sample, are displayed on the upper right-hand position of the plots. b Absolute number of targeted genes (dark purple) and isoforms (light purple) detected from each capture reaction. c Relationship between the number of genes multiplexed and the fraction of genes for which there was a detected full-length read. d As in c except shows the decrease in isoforms per targeted gene, on average, for each experiment. Source data are available in the Source Data file.
Fig. 4Full-length transcription factor isoforms across diverse human tissues.
a Rank abundance bar plot for cDNA enriched for human TFs. Only the top 50 most abundantly expressed genes are shown. Each bar corresponds to a single gene, colored by whether that gene is targeted (purple) or not targeted (gray). Abundances as shown on the y-axis are computed by dividing the number of full-length reads mapped to the gene by the total number of full-length reads. On-target rates for input and capturant are shown. b Gains in coverage of target genes upon enrichment. Increases in number of genes, isoforms, and full-length reads are shown. Data used to generate these numbers used an equal number of full-length raw reads that were subsampled from the unenriched (input) or enriched (capturant) cDNA. c Fraction of all GENCODE genes and transcripts detected in the capturant. A gene is considered detected if at least one full-length read is detected for that locus. Isoforms are considered detected if the full set of junctions are identical between the GENCODE-annotated and sequenced transcript. The fraction detected was also computed for sets of genes for which there was higher probe set representation of the gene (1 TPM or higher) and genes for which there was evidence of expression in the tissues interrogated (10 TPM or higher). d Fraction of novel splice sites, junctions, and full-length isoforms in the TF enrichment experiment. Unique splice sites and junctions are only counted once. The 5′ splice site corresponds to the splice donor and the 3′ splice site corresponds to the splice acceptor. SS, splice site. e Proportions of known and novel isoforms. Known isoforms are further divided by completeness. Novel isoforms are further divided by whether all splice sites are found in GENCODE (novel in catalog, NIC) or if the isoform contains a novel splice site (novel not in catalog, NNC). Match categories are defined by the isoform annotation tool SQANTI. g Example of isoforms from the gene MITF identified from the TF763 capture experiment. Source data are available in the Source Data file.