Literature DB >> 35468141

Micro-dissection and integration of long and short reads to create a robust catalog of kidney compartment-specific isoforms.

Hongyang Li¹, Ridvan Eksi¹, Daiyao Yi¹, Bradley Godfrey², Lisa R Mathew³, Christopher L O'Connor², Markus Bitzer², Matthias Kretzler^1,2, Rajasree Menon^1,2, Yuanfang Guan^1,2.

Abstract

Studying isoform expression at the microscopic level has always been a challenging task. A classical example is kidney, where glomerular and tubulo-interstitial compartments carry out drastically different physiological functions and thus presumably their isoform expression also differs. We aim at developing an experimental and computational pipeline for identifying isoforms at microscopic structure-level. We microdissected glomerular and tubulo-interstitial compartments from healthy human kidney tissues from two cohorts. The two compartments were separately sequenced with the PacBio RS II platform. These transcripts were then validated using transcripts of the same samples by the traditional Illumina RNA-Seq protocol, distinct Illumina RNA-Seq short reads from European Renal cDNA Bank (ERCB) samples, and annotated GENCODE transcript list, thus identifying novel transcripts. We identified 14,739 and 14,259 annotated transcripts, and 17,268 and 13,118 potentially novel transcripts in the glomerular and tubulo-interstitial compartments, respectively. Of note, relying solely on either short or long reads would have resulted in many erroneous identifications. We identified distinct pathways involved in glomerular and tubulo-interstitial compartments at the isoform level, creating an important experimental and computational resource for the kidney research community.

Entities: Chemical

Mesh：

Substances：

Year: 2022 PMID： 35468141 PMCID： PMC9037928 DOI： 10.1371/journal.pcbi.1010040

Source DB: PubMed Journal: PLoS Comput Biol ISSN： 1553-734X Impact factor: 4.475

Introduction

Gene expression patterns at the microscopic level can facilitate the understanding of the function of a tissue [1]. While single cell sequencing can now provide gene expression levels for individual cells [2], the assignment to respective microscopic structure remains a challenging task. Furthermore, the majority of the experimental protocols use short reads in single cell sequencing. While short reads can provide some information about the isoform expression levels, the assignment is not conclusive. We aim at addressing the above challenge by developing an integrative protocol to micro-dissect microscopic structures of a tissue, followed by both short-read sequencing and long-read sequencing to identify isoforms, and particularly novel isoforms, in these microscopic structures. We used kidneys as a model tissue, due to both disease relevance and clear delineation of two types of compartments in the kidney: glomerular and tubulo-interstitial compartments. The glomeruli function to filter the blood and extra fluid and wastes pass into the tubule and form urine. Chronic kidney disease is prevalent in 14.5% of the US population [3]. Kidney diseases are traditionally classified based on etiology and pathological appearances, but in reality, kidney diseases are a collection of diseases with unique mechanisms, progression rates, and therapeutic responses despite sharing similar histopathological appearances [4]. Micro-dissected renal biopsy specimens are rich sources for capturing the gene expression data of distinct compartments of the kidney glomerular and tubulo-interstitial compartments. Kidney transcriptomic studies typically use generic transcriptome databases such as Ensembl [5], GENCODE [6], or NCBI [6,7], which are incomplete and miss some of the critical kidney-specific transcripts. Moreover, when we use the complete set of annotated transcripts, some short reads from expressed annotated transcripts may be assigned to very similar non-expressed annotated transcripts. This resulted in miscalculated Fragments Per Kilobase of transcript per Million mapped reads values (FPKM) and power-loss in differential expression analysis, which can be remedied by the long reads. In addition, alternative splicing (AS) exponentially increases the information content of genomes by producing multiple transcripts from a single gene, further complicating the picture. Approximately 95% of multi-exonic human genes undergo AS, producing more than 100,000 distinct transcripts from approximately 20,000 protein-coding genes [8]. Therefore, a complete and reliable database of transcript isoforms that is specifically expressed in the kidney using accurate long reads will improve the characterization of new mechanisms, biomarkers, and therapeutic targets. Recent progress in single-molecule long-read sequencing has provided powerful new tools for researchers to resolve the previous inaccuracies of short-read RNA or DNA sequencing [9-12]. Pacific Biosciences (PacBio) has developed a technique based on Single Molecule Real Time sequencing (SMRT). The sequencing length of the PacBio RS II platform used in this study is 10 kb, covering the entire size of most eukaryotic transcripts. This capability allows the sequencing of long sections of genomic DNAs or transcripts without fragmentation or PCR amplification. Thus, PacBio’s full-length or nearly full-length transcripts eliminate the need for transcript assembly for downstream analysis. However, one limitation of the PacBio platform is its relatively high sequence error rate, but fortunately, these errors are randomly distributed across the reads. Recently improved read-length and base-calling algorithms of the PacBio SMRT analysis platform and the use of circular molecules have mitigated this error rate. When the read length exceeds the length of the cDNA template, each base pair is covered on both strands multiple times, and these low-quality base calls are aggregated to derive high-quality, single-molecule Reads of Inserts (ROIs). The PacBio long-read transcriptome sequencing platform (Iso-Seq) has been successfully applied to human and other species, and it has shown that the use of Iso-seq has a significant advantage over short-read RNA-Seq methods for identifying novel isoforms, detecting AS and gene fusion events [13-16]. In recent years, many computational methods have been developed to identify isoforms through analyzing RNA-seq reads. For example, Mandalorion was developed to analyze long reads and identify isoforms at the single-cell level in murine B cells [17]. Many expressed genes, including cell surface receptors, were found to display complex isoforms. Later on, FLAIR was developed to analyze full-length transcripts, and study differential splicing events and isoforms in leukemia samples with and without SF3B1 mutation [18]. This work demonstrates the power of full-length isoform analysis in connecting different alternative splicing events in cancer. Recently, TALON was developed as a step of the ENCODE4 pipeline in analyzing long-read transcriptomes [19]. TALON was used in multiple transcriptomes to identify both known and novel transcripts across datasets and platforms. In this study, we examined human kidney cortex tissue to study the overall transcriptome of glomerular (hereafter referred to as “glo”) and tubulo-interstitial (hereafter referred to as “tub”) compartments using both long reads from PacBio Iso-Seq platform and RNA-seq short reads from Illumina platform. To validate full-length transcripts sequenced by PacBio, we used short reads to generate high-quality sets of transcript isoforms from two distinct clinical sources. Then, we compared the confirmed transcript isoforms to the set of known transcript isoforms in GENCODE and provided the final list of expressed transcript isoforms and their annotation status for researchers in downstream kidney transcriptome studies. With this data collection, we identified a large number of novel transcript isoforms and pathways in these two kidney compartments, creating an important resource for the kidney research community. Most importantly, the experimental and computational pipeline developed in this study is promising to be applied to other tissues to acquire the isoform catalog at microscopic levels.

Materials and methods

Ethics statement

The study is approved by the University of Michigan Institutional Review Board (HUM00002468: Expression Analysis in Human Renal Disease). All participants have provided written consent to the study.

RNA extraction from human kidney cortical tissue

We used healthy human kidney cortex cores obtained from five patients who underwent tumor nephrectomies and 20 healthy samples from the European Renal cDNA Bank (ERCB) study [20]. The kidney cores were immediately placed in RNA. Later solution at 4°C for 12 to 24 hours and then stored at -20°C [20]. Micro-dissection of glomerular and tubulo-interstitial compartments was performed as previously described [21,22].

RNA library preparation/sequencing using illumina platform

For Illumina RNA-seq runs, TapeStation (Agilent, Santa Clara, CA) assessed the RNA quality while following the manufacturer’s recommended protocols. Samples with RNA Integrity Numbers (RINs) of 8 or higher were prepared using the Illumina TruSeq mRNA Sample Prep v2 kit (Catalog #s RS-122-2001, RS-122-2002, Illumina, San Diego, CA) using manufacturer’s recommended protocols. 0.1–3μg of total RNA was enriched for mRNA using a polyA purification, and the mRNA was then fragmented and copied into first strand cDNA using reverse transcriptase and random primers. The 3 prime ends of the cDNA were then adenylated, and the adapters were ligated. One such adapter was a six-nucleotide barcode unique to each sample allowing us to sequence more than one sample in each lane of a HiSeq flow cell (Illumina). The products were purified and enriched by polymerase chain reactions to create the final cDNA library. Final libraries were checked for quality and quantity by TapeStation (Agilent) and qPCR using Kapa’s library quantification kit for Illumina Sequencing platforms (catalog # KK4835, Kapa Biosystems, Wilmington MA) using manufacturer’s recommended protocols. The libraries were clustered on the cBot (Illumina) and sequenced 4 samples per lane on a 50 cycle paired-end for tumor nephrectomy samples, and 1 sample per lane on a 100 cycle paired-end run for ERCB samples on a HiSeq 2000 (Illumina) in High Output mode using version 3 reagents according to manufacturer’s recommended protocols.

RNA sequencing with the PacBio platform

For the PacBio library, equal proportions of RNA from the tumor nephrectomy samples were pooled to form 500 ng of RNA for each compartment and processed for next-generation sequencing (NGS) library preparation. PacBio sequencing library preparation was done according to the manufacturer’s recommendation for Isoform Sequencing using the Clontech SMARTer PCR cDNA synthesis kit and BluePippin Size-Selection System. cDNA SMRTbell templates were fractionated into 1 kb– 2 kb, 2 kb– 3 kb, 3 kb– 6 kb, and 5 kb– 10 kb. Sixteen SMRT cells were used in total: two for each size fraction of both glomerular and tubulointerstitial compartments. Sequencing was performed on a Pacific Biosciences PacBio RSII by University of Michigan DNA Sequencing Core. Each glomerular and tubulo-interstitial compartment had eight sequencing cells, generating 132,240 and 125,047 SMRT CFLs respectively. These CFLs were supported by 206,415 and 232,845 SMRT Reads of Inserts (ROIs). These were determined to be full-length based on the apparent 5’ and 3’ cDNA primer sequences and polyA tail at 3’ end.

Sequence generation and alignments

Illumina RNA-Seq reads were aligned and mapped using STAR [23] version 2.5 to the human genome (hg19 assembly). STAR was run in a 2-pass mode with suggested parameters under “ENCODE options” heading in the STAR manual. Pacific Bioscience SMRT raw reads were initially processed using the Pacific Biosciences SMRT analysis software version 2.3.0. The polymerase reads were partitioned into sub-reads. Read of Inserts (ROI) were generated using the default number of polymerase full passes. The Iso-Seq classify tool was then used to separate the ROIs into full-length non-chimeric and non-full length reads. We defined full-length reads as containing 5’ and 3’ cDNA primers and polyA tails. Then, the Iso-Seq cluster tool was used to cluster all the full-length reads derived from the same transcript to produce the consensus full-length transcripts (CFLs). CFLs that were unpolished by Quiver were used in the rest of the analysis because it had been reported that Quiver polishing sometimes obscured the introns [15]. SMRT CFLs were aligned and mapped to the human genome (hg19 assembly) using GMAP [24]. We kept reads mapping to a single location (argument–n 1). As the next phase of the analysis, the pipeline for identification of transcription start sites, splice junctions and transcription end sites were adapted from the TRIMD pipeline [15,24]. Single-exon CFLs were excluded from the following validation steps, as most of them were potential intronic fragments resulting from pre-processed mRNAs. Single-exon CFLs were added back to analysis before the collapse step.

Identification of transcription start sites (TSS)

CFL 5’ end clusters were generated with CFL 5′ ends mapping within 8 bp of each other. Only CFL isoforms whose 5’ ends did not contain mismatches were used. A single TSS is determined for each of the clusters by calculating weighted (based on the number of SMRT reads for each start coordinate) averages of the start coordinates of CFLs within the cluster. These consensus TSS are considered validated if there is an annotated transcription start site within 10bp vicinity [24]. Annotated TSS are extracted from GENCODE comprehensive annotation set (version 24).

Identification of splice junctions

Splice junctions from Iso-Seq CFLs were identified using GMAP, and splice junctions for Illumina reads were identified with STAR [23]. Splice junctions from Iso-Seq CFLs are required to have at least 1 full-length read spanning it to be identified. A splice junction from an Iso-Seq CFL is marked as validated if at least 3 short reads are spanning it or if the junction is already annotated. Annotated junctions are extracted from GENCODE comprehensive annotation set (version 24).

Identification of transcription end sites (TES)

Illumina reads that have poly(A) tails were extracted from SAM alignment files. These putative reads with poly(A) tails are the reads that have a FLAG code as being first-of-pair, and either end with a run of at least five As, at least two of which are mismatched on plus strand or that start with a run of at least five Ts, at least two of which are mismatched on minus strand. The alignment position base next to the mismatched location was considered a candidate TES. TES that are within 8 bp of each other were considered single candidate TES. The consensus TES coordinate was determined using a weighted average of putative 3′ ends based on a few short reads supporting each TES coordinate. TES are marked as validated either if there is an Illumina TES on the same strand within four bases upstream or ten bases downstream or if there is an annotated TES in the 10bp vicinity [15]. Iso-Seq CFL 3′ ends that align within 8 bp of each other on the genome are considered a single candidate TES. The CFL consensus TES were determined by calculating weighted averages of the end coordinates. Weights are determined by the number of PacBio consensus sequence reads ending at each coordinate. Only putative PacBio 3’ end sites that are supported by at least three SMRT reads are kept.

CFL validation and comparison to known annotations

TSS, splice junction and transcription end sites of each glomerular CFLs are compared to coordinates extracted from short-reads from ERCB glomerular samples and annotated transcripts. For tubulointerstitial compartment, transcript features are compared to short-read RNA-seq data from matching tumor nephrectomy samples, and ERCB tubulo-interstitial samples separately. Iso-Seq CFL validation was done by validating every splice junction present in the CFL. An Iso-Seq CFL was considered validated if every splice junction was validated based on the criteria explained above. Two different short-read RNA-seq datasets for the tubulo-interstitial compartment were combined for this step. Single-exon CFLs were added to the set of validated multi-exon CFLs, pending further investigation based on their relative location to a known transcript. The set of validated multi-exon CFLs and single-exon CFLs were collapsed with the collapse_isoforms_by_sam.py script in the tofu package provided by PacBio, which is the developmental version of the official Iso-Seq protocol. Then, collapsed isoforms were compared to the GENCODE comprehensive set of annotations (version 24) with a cuffcompare tool from Tuxedo suite of tools [25-28].

Data records

The set of validated and collapsed isoform sequences with their corresponding annotation class based on cuffcompare tool can be found in “Glom_validated_transcripts.fa” (Data Citation 1) for glomerular compartment and in “Tubulo_validated_transcripts.fa” (Data Citation 2) for tubulo-interstitial compartment. Both files are in fasta format. The ID line for each sequence entry contains an internal transcript ID (PB.X.X), chromosome name, location, cuffcompare class code and associated gene’s ENSEMBL gene id.

Technical validation

For short-read RNA-seq runs, only samples with 8 or higher RINs were used. For long-read RNA-seq with the PacBio RS II platform, equal proportions of RNA from the tumor nephrectomy samples were pooled to form 500 ng of RNA for each compartment and processed for next-generation sequencing (NGS) library preparation. Throughout this study, every sample preparation step was done according to the manufacturer’s recommendation.

Results

Long-read sequencing analysis of kidney transcriptome

In this study, we micro-dissected glomerular and tubulo-interstitial compartments from 25 healthy human kidney cortex core samples following [21,22] (detailed protocol please see ). Among the 25 samples, five were from patients who underwent tumor nephrectomies and 20 healthy samples from the European Renal cDNA Bank (ERCB) study [20]. For the Illumina RNA-seq short reads, we followed TapStation’s standard protocols to assess RNA quality. We checked the quality of the final cDNA libraries using Kapa’s library quantification kit following the manufacturer’s recommended protocols. Then the short reads were mapped to the human genome (hg19) using STAR version 2.5 in a 2-pass mode following the standard “ENCODE options” in the STAR manual. For the PacBio long reads, PacBio’s Iso-seq protocol was used for library preparation and long-read sequencing, and PacBio’s SMRT Portal was used for the initial data analyses. We processed them using SMRT version 2.3.0 and generated reads of inserts using the default parameters of polymerase full passes. The consensus full-length transcripts (CFLs) were also mapped to hg19 using GMAP. About 98% of the CFLs were uniquely mapped for both kidney apartments. Each glomerular and tubulointerstitial compartment had eight sequencing cells, from which we generated 132,240 (glo) and 125,047 (tub) SMRT CFLs, validated by two high-quality sets of transcript isoforms from two distinct clinical sources based on short reads (). These were determined to be full-length based on the apparent 5’ and 3’ cDNA primer sequences and polyA tail at 3’ end. Glomerular CFLs ranged from 300 to 27,890 bases in length with a mean of 2,446 bases. Tubulointerstitial CFLs ranged from 300 to 22,765 bases in length with a mean of 2,862 bases. Size fractionation of the library before sequencing reduced the bias toward shorter transcripts.

Overall study design.

The features of every multi-exon Consensus Full-Length transcripts (CFL’s) found in PacBio reads were validated through Illumina short reads from two different RNA-seq datasets. Whole CFLs were then validated at every splice junction before comparison to GENCODE annotation. We further identified 42,896 and 38,831 putative transcription start sites (TSS) from PacBio glomerular and tubulointerstitial multi-exon CFLs respectively (see ). As demonstrated in using sample gene AIF5, we could not reliably infer TSS from Illumina short reads; our validation criteria relied only on annotated TSS: Top tracks in the are illumina short reads. Just by looking at the read coverage of illumina reads, we cannot pinpoint a location for the TSS, meaning there is not a cliff where read coverage suddenly starts. As a result, we only used TSS that exists in the annotation file. Gene AIF5 has multiple CFLs with varying start positions. During transcript collapsing the CFLs that are identical other than their start positions are merged, and the start coordinate of CFL with the longest 5’ end is taken as the true 5’ end for the merged transcript. For gene AIF5, we identified 13 TSS and 6 of them were more than 10bp away from the annotated TSS.

Fig 2

Variation of transcription start and end positions of consensus full-length transcripts.

Variation of transcription start and end positions of consensus full-length transcripts.

(A) Thirteen multi-exon consensus full-length transcripts (CFL’s) from gene AIF1 locus. (B) The first exon with vertical (green) lines demonstrating the location of annotated transcription start sites (TSS). Thirteen CFL’s have a total of 13 different TSS, only 6 out of 13 TSS are within 10 bp of an annotated TSS. (C) The last exon with vertical (red) line shows the only annotated transcription end sites (TES). Thirteen CFL’s have a total of 7 different TES, only 4 out of 7 TES are within 10 bp annotated TES. Through mapping PacBio multi-exon CFLs to the genome, we identified 171,742 and 185,185 splice junctions in glomerular CFLs and tubulo-interstitial CFLs, respectively. Each splice junction has at least one full-length SMRT read spanning them. We further utilized short reads to validate the new splice junction findings. A validated set of splice junctions is the union of annotated junctions and splice junctions that have at least three short reads spanning them from Illumina RNA-Seq data. For glomerular compartments, approximately 65% of PacBio junctions were validated. In tubulo-interstitial compartments, using the shallower tumor nephrectomy Illumina RNA-seq data, we validated approximately 64% of all PacBio tubular splice junctions. For example, for tubulo-interstitial, the deeper ERCB RNA-seq data allowed us to validate an additional 8,824–3,105 = 5,719 novel splice junctions (). TSS- transcription start site; TES- transcription end site; CFL consensus full-length transcripts; ERCB European renal cDNA bank; TN tumor nephrectomy. We further identified 24,625 and 26,260 putative transcription end sites (TES) in PacBio glomerular and tubulo-interstitial multi-exon CFLs through the process explained in Materials and Methods. Then, we extracted short reads containing polyA reads from Illumina RNA-Seq data and extracted polyA site coordinates from those reads. We considered putative TES from PacBio validated if they were near either an annotated TES or a polyA site extracted from short reads. Although it is not as prevalent as the 5’ ends, we have a high variation in the transcript end sites extracted from PacBio CFLs (), and this variation is incompletely captured by short-reads as we have a limited number of reads with polyA tails. We decided to provide CFLs present in the sample with their original 3’ end locations to give readers a choice to process CFLs with different 3’ UTR lengths based on their study objective.

CFL validation and collapsing into final structures

Since exact TSS and TES coordinates do not agree well with annotated coordinates, we did not use these transcript features in whole CFL validation. Our validation criterion required the CFL to have all its junctions validated either by short read support or by annotation. With this criterion, 53,540 tubulointerstitial multi-exon CFLs and 47,785 glomerular multi-exon CFLs were marked as validated. Any two isoforms that differed on the 3’ end by more than 100 bp (a defined threshold by PacBio) were considered different isoforms. If two isoforms differed only by their 5’ ends, meaning one isoform had 0, 1, or more 5’ exons than the others but all remaining exons agreed, then the shorter isoform was considered identical to the longer one, and it was collapsed into the longer isoforms. After collapsing, 45,778 and 58,378 isoforms were formed from tubulointerstitial and glomerular tissue.

Comparison of validated isoforms to annotated transcripts

We compared the list of collapsed CFLs to the annotated set of transcripts from GENCODE version 24. The comparison was made with the cuffcompare tool in Cufflinks which classifies each input transcript into twelve distinct classes based on their overlap with an annotated transcript. The seven most prevalent classes and the numbers of collapsed transcripts from tubulo-interstitial and glomerular compartments belonging to each of these seven classes are shown in . “Complete match of intron chain with an annotated transcript,” and “Contained within a reference transcript” classes comprise the set of expressed annotated transcripts in the sample. The set of transcripts that belonged to “A transflag falling entirely within a reference intron” class was discarded. These are single exon transcripts, which are most likely by-products of the intron decay process [13]. Other single exon transcript classes are “Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron” and “Exonic overlap with reference on the opposite strand class”, which were also discarded. These are possible pre-mRNA fragments that were pulled down due to the inefficient polyA selection step. The class of “Potentially novel isoforms” includes transcripts that share at least one junction with an annotated transcript and have junctions that do not occur in any annotated transcript. If junctions occur in the annotated transcript, their combination is novel. Again, readers should be cautious about the exact location of TSS in these transcripts as transcripts may be truncated. The class of “Intergenic transcripts” includes transcripts that map to an intergenic location. Most of these intergenic transcripts are single exon transcripts, more likely regulatory non-coding mRNAs. The two sets of validated and collapsed isoform sequences with their corresponding annotation class are provided.

Example novel isoforms and novel intergenic transcripts identified by Pac-bio long-read sequencing

NPHS2 (podocin) is a protein-coding gene located on chromosome 1. This gene has 8 exons and has two protein coding splice variants according to the GENCODE database. NCBI’s RefSeq lists three other predicted protein-coding splice variants. Our validated set of glomerular isoforms has 11 different splice variants for this gene (). Two of these precisely match variants in GENCODE, and the other two match to the two predicted splice variants in RefSeq. Four other splice variants have the same first and last exons. However, their different combinations make them novel variants. The remaining three novel splice variants have alternative end sites. Among the seven novel splice variants, there is one novel junction, which is supported by multiple short reads.

Gene model for podocin gene NPHS2.

(a) Transcript structures of two annotated transcripts of NPHS2 in GENCODE. The second shorter annotated transcript is missing exon number 5. (b) Transcripts found in the PacBio glomerular sample and validated by our method. First two isoforms (dark green) match exactly to the annotated transcripts in GENCODE. Next two isoforms (light green) match two predicted transcript variants in NCBI’s RefSeq annotation (XM_017002298.1 and XM_005245483.3). The remaining seven isoforms (gray) isoforms are potential novel transcript variants of NPHS2. The number of uniquely mapped reads to each of the novel junctions are noted above junctions. Among the final set of isoforms, there are 4,208 and 9,501 intergenic transcripts from tubulo-interstitial and glomerular compartments. The majority of these transcripts have single exons, and therefore their junctions do not require validation. There are 76 and 55 multi-exon intergenic transcripts in tubulointerstitial and glomerular compartments. Because they passed the validation criteria, all the junctions in these transcripts are supported by multiple short reads.

Comparison of the expressed set of annotated transcripts in glomerular and tubulo-interstitial compartments

Glomerular and tubulo-interstitial compartments are expected to have distinct transcriptome profiles. In this section, we compared the set of expressed annotated transcripts from each compartment. 3,993 and 13,536 annotated transcripts are expressed in these compartments, of which 8,198 are common transcripts (). We performed KEGG pathway enrichment on the transcripts that are uniquely expressed in each compartment. A full list of enriched pathways is in S1 Table. shows the top 3 enriched pathways for each compartment. Glomerular-only expressed transcripts are enriched for Non-alcoholic fatty liver disease (NAFLD), RAP1 signaling pathway, and ubiquitin-proteosome pathway. Tubulointerstitial transcripts are enriched in metabolic, ribosomal, and aldosterone-induced sodium reabsorption pathways.

Annotated transcripts and enriched KEGG pathways.

(a) Venn diagram of expressed annotated transcripts from glomerular and tubulointerstitial compartments. (b) Top three KEGG pathways and corresponding p-values for the set of transcripts enriched only in glomerular and tubulointerstitial compartments. (c) Illustrated pathways enriched in the two compartments of the kidney.

Discussion

Recent advances in computational methods have shown great promise to unveil biological insights underlying the complex alternative splicing events and isoforms in a variety of cell types and tissues, such as Mandalorion [17], FLAIR [18] and TALON [19]. To address the specific situation in our work, we developed a unique in-house pipeline to integrate both long and short RNA-seq reads across individuals. This pipeline enables us to validate reported isoforms, as well as discover new isoforms in two kidney compartments. In this study, we aim at exploring the potential of using combined information from long- and short-read sequencing information to create catalogs of expressed isoforms for microscopic structures. Towards this goal, we used human kidney tissues, which consists of structures of distinct physiological functions. By micro-dissecting glomerular and tubular compartments, we generated biosamples whose half goes to short-read sequencing and the other half goes to long-read sequencing. We designed a pipeline to first use long reads to identify all putative isoforms, and use the short-reads to confirm their relevant junctions. This approach is distinct from current single-cell sequencing efforts or traditional bulk sequencing, in that a microscopic structure carrying out specific physiological function is the study target. We found tens of thousands of novel transcripts in glomerular and tubular compartments validated by both long and short read sequencing, creating a rich repertoire of transcripts for future functional studies. Preliminary analysis of these transcripts demonstrates drastically different pathway enrichment between the glomeruli and the tubular compartment, supporting the success of this approach. We envision several future directions of investigation. First, such experimental protocols of microscopic investigation of isoforms can be carried out in other tissues. Second, many novel transcripts are identified in the study, connecting them to isoform function prediction methods, e.g., [29] will help us to elucidate the activated pathways of these isoforms. Third, connecting the expression patterns from these microscopic structures to single-cell level sequencing can help us understand the variability of the expression patterns of these isoforms. Fourth, the focus of this paper is on health tissues. It will be interesting to study the isoform changes between the healthy state and the disease state of the same microscopic structures. We foresee all these will be exciting opportunities when such microscopic dissection followed by isoform identification with long and short-reads will become well accepted in the medical field.

Distribution of the lengths of the consensus full-length transcripts.

(TIF) Click here for additional data file.

Compartment-specific genes expression enrichment analysis.

(XLSX) Click here for additional data file. 24 Oct 2021 Dear Dr. Guan, Thank you very much for submitting your manuscript "Micro-dissection and integration of long and short reads to create a robust catalog of kidney compartment-specific isoforms" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. In light of the reviews (below this email), we would like to invite the resubmission of a significantly-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent to reviewers for further evaluation. When you are ready to resubmit, please upload the following: [1] A letter containing a detailed list of your responses to the review comments and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. [2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file). Important additional instructions are given below your reviewer comments. Please prepare and submit your revised manuscript within 60 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email. Please note that revised manuscripts received after the 60-day due date may require evaluation and peer review similar to newly submitted manuscripts. Thank you again for your submission. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Katalin Susztak Guest Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************** Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: In this paper, the authors used both PacBio Iso-Seq long reads and Illumina RNA-seq short reads’ information to analyze and identify expressed isoforms for defining microscopic structures. They introduced a pipeline, in which they used long-reads of kidney tissues from two compartments: glomerular and tubule-interstitial and validated the reads by RNA-seq reads of the same tissue. From their experiment, they identified a substantial number of transcription start sites (TSSs), transcription end sites (TESs) and Splice junctions from a filtered set of high quality transcripts. The manuscript is well written and easy to understand. The figures and tables are also clear and elaborate. However, I have some concerns about the work: Major: 1. The novelty of the study is limited. The authors use the existing methods to analyze the Iso-seq and RNA-seq data, and identify expressed annotated and unannotated isoforms. The contribution to the computational biology field is not clear. This study can be easy applied to other tissue samples with both matched RNA-seq and Iso-seq data. 2. hg19 annotation is quite old (2009). It would be better to run the experiments based on the latest hg38 annotation. Minor: 1. Abstract line 2: insterstitial -> interstitial 2. Abstract line 5: microdissed -> microdissected? Reviewer #2: The paper titled “Micro-dissection and integration of long and short reads to create a robust catalog of kidney compartment-specific isoforms” describes an developed experimental and computational pipeline for identifying isoforms at microscopic structure-level. It applies Pacific Biosciences SMRT analysis software and Illumina reads approach to discover novel transcripts. Although it is a promising approach, current manuscript lacks details and interpretations. 1. The authors claimed that they developed the approach for isoform identification. However, as far as I know, there are some additional methods that have been developed recent years, such as Mandalorion (Byrne et al. Nat. Comm. 2017) FLAIR (Tang et al, Nat. Comm. 2020), TALON (Wyman et al. biorxiv). I do not mean that they need to compare to all existing methods, but it is important that a comparison is performed to demonstrate the performance of this developed approach. 2. In general, it would be better to describe with more details of read correction, transcript assembly, and transcript quantification in the results part of the manuscript to illustrate the power of this approach and the reason that this approach performs good, such as what are the advantages and what are the drawback of this approach. 3. To test the accuracy of this approach, it would be better to provide the rate of false discovered isoforms and illustrate the reason of these false discovered isoforms. 4. For enrichment results, it would be better to show P-values, gene/transcript count for each pathway, and top pathways in one Figure. 5. The Introduction and Discussion sections are not comprehensive and do not present readers a view of the field. The authors should expand it and describe what are already available and what are the specific features of existing methods for PacBio data. Some recently published methods on novel isoform discovery are not cited. While I understand that this paper focuses on application of PacBio data, it is still important to summarize state-of-the-art methods to present a comprehensive view of the current state of art. 6. The font size in Figures is too small to read ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: No: Illumina RNA-seq and PacBio Iso-seq of the 25 samples are not provided. Reviewer #2: Yes ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No Reviewer #2: No Figure Files: While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, . PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at . Data Requirements: Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5. Reproducibility: To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols 9 Dec 2021 Submitted filename: response.pdf Click here for additional data file. 19 Mar 2022 Dear Dr Yuanfang Guan, We are pleased to inform you that your manuscript 'Micro-dissection and integration of long and short reads to create a robust catalog of kidney compartment-specific isoforms' has been provisionally accepted for publication in PLOS Computational Biology. Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests. Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated. IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS. Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology. Best regards, Katalin Susztak Guest Editor PLOS Computational Biology Ilya Ioshikhes Deputy Editor PLOS Computational Biology *********************************************************** No further comments Reviewer's Responses to Questions Comments to the Authors: Please note here if the review is uploaded as an attachment. Reviewer #1: The authors have responded well to the previous critiques. ********** Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available? The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified. Reviewer #1: None ********** PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files. If you choose “no”, your identity will remain anonymous but your review may still be made public. Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy. Reviewer #1: No 11 Apr 2022 PCOMPBIOL-D-21-01107R1 Micro-dissection and integration of long and short reads to create a robust catalog of kidney compartment-specific isoforms Dear Dr Guan, I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course. The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript. Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers. Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work! With kind regards, Olena Szabo PLOS Computational Biology | Carlyle House, Carlyle Road, Cambridge CB4 3DN | United Kingdom ploscompbiol@plos.org | Phone +44 (0) 1223-442824 | ploscompbiol.org | @PLOSCompBiol

Table 1

Validation of PacBio transcript features from kidney tissue.

	TSS	Splice Junctions	TES	Validated Multi-exon CFLs
Glomerular
PacBio with Illumina ERCB
Annotation only	3732	1194	4850	47785
Short reads only		5583	39
Short reads or annotation		111506	5071
Total	38831	171742	26260	71492
Tubulointerstitial
PacBio with Illumina TN
Annotation only	3534	7329	5278	53540
Short reads only		3105	8
Short reads or annotation		119932	5318
PacBio with Illumina ERCB
Annotation only		516	4883
Short reads only		8824	104
Short reads or annotation		125651	5425
Total	42896	185185	24625	75573

TSS- transcription start site; TES- transcription end site; CFL consensus full-length transcripts; ERCB European renal cDNA bank; TN tumor nephrectomy.

Table 2

Classification of validated-collapsed isoforms to GENCODE annotation using the Cuffcompare tool.

	Type of Match	Validated Tubulo—interstitial Isoforms	Validated Glomerular Isoforms
1	Complete match of intron chain with an annotated isoform	10407	10882
2	Contained within a reference isoform	3852	3857
Total annotated transcripts		14259	14739
3	Potentially novel isoform	8910	7767
4	Intergenic transcript	4208	9501
Total novel transcripts		13118	17268
5	A transfrag falling entirely within a reference intron	11407	16627
6	Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron	3729	4956
7	Exonic overlap with reference on the opposite strand	2310	3420

27 in total

1. Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing.

Authors: Jason L Weirather; Pegah Tootoonchi Afshar; Tyson A Clark; Elizabeth Tseng; Linda S Powers; Jason G Underwood; Joseph Zabner; Jonas Korlach; Wing Hung Wong; Kin Fai Au
Journal: Nucleic Acids Res Date: 2015-06-03 Impact factor: 16.971

2. Identification of novel transcripts in annotated genomes using RNA-Seq.

Authors: Adam Roberts; Harold Pimentel; Cole Trapnell; Lior Pachter
Journal: Bioinformatics Date: 2011-06-21 Impact factor: 6.937

3. High-Throughput Single-Cell Sequencing of both TCR-β Alleles.

Authors: Tomonori Hosoya; Hongyang Li; Chia-Jui Ku; Qingqing Wu; Yuanfang Guan; James Douglas Engel
Journal: J Immunol Date: 2018-10-31 Impact factor: 5.422

4. STAR: ultrafast universal RNA-seq aligner.

Authors: Alexander Dobin; Carrie A Davis; Felix Schlesinger; Jorg Drenkow; Chris Zaleski; Sonali Jha; Philippe Batut; Mark Chaisson; Thomas R Gingeras
Journal: Bioinformatics Date: 2012-10-25 Impact factor: 6.937

5. GMAP: a genomic mapping and alignment program for mRNA and EST sequences.

Authors: Thomas D Wu; Colin K Watanabe
Journal: Bioinformatics Date: 2005-02-22 Impact factor: 6.937

6. Modular activation of nuclear factor-kappaB transcriptional programs in human diabetic nephropathy.

Authors: Holger Schmid; Anissa Boucherot; Yoshinari Yasuda; Anna Henger; Bodo Brunner; Felix Eichinger; Almut Nitsche; Eva Kiss; Markus Bleich; Hermann-Josef Gröne; Peter J Nelson; Detlef Schlöndorff; Clemens D Cohen; Matthias Kretzler
Journal: Diabetes Date: 2006-11 Impact factor: 9.461

7. Renal gene and protein expression signatures for prediction of kidney disease progression.

Authors: Wenjun Ju; Felix Eichinger; Markus Bitzer; Jun Oh; Shannon McWeeney; Celine C Berthier; Kerby Shedden; Clemens D Cohen; Anna Henger; Stefanie Krick; Jeffrey B Kopp; Christian J Stoeckert; Steven Dikman; Bernd Schröppel; David B Thomas; Detlef Schlondorff; Matthias Kretzler; Erwin P Böttinger
Journal: Am J Pathol Date: 2009-06 Impact factor: 4.307

8. Ensembl 2021.

Authors: Kevin L Howe; Premanand Achuthan; James Allen; Jamie Allen; Jorge Alvarez-Jarreta; M Ridwan Amode; Irina M Armean; Andrey G Azov; Ruth Bennett; Jyothish Bhai; Konstantinos Billis; Sanjay Boddu; Mehrnaz Charkhchi; Carla Cummins; Luca Da Rin Fioretto; Claire Davidson; Kamalkumar Dodiya; Bilal El Houdaigui; Reham Fatima; Astrid Gall; Carlos Garcia Giron; Tiago Grego; Cristina Guijarro-Clarke; Leanne Haggerty; Anmol Hemrom; Thibaut Hourlier; Osagie G Izuogu; Thomas Juettemann; Vinay Kaikala; Mike Kay; Ilias Lavidas; Tuan Le; Diana Lemos; Jose Gonzalez Martinez; José Carlos Marugán; Thomas Maurel; Aoife C McMahon; Shamika Mohanan; Benjamin Moore; Matthieu Muffato; Denye N Oheh; Dimitrios Paraschas; Anne Parker; Andrew Parton; Irina Prosovetskaia; Manoj P Sakthivel; Ahamed I Abdul Salam; Bianca M Schmitt; Helen Schuilenburg; Dan Sheppard; Emily Steed; Michal Szpak; Marek Szuba; Kieron Taylor; Anja Thormann; Glen Threadgold; Brandon Walts; Andrea Winterbottom; Marc Chakiachvili; Ameya Chaubal; Nishadi De Silva; Bethany Flint; Adam Frankish; Sarah E Hunt; Garth R IIsley; Nick Langridge; Jane E Loveland; Fergal J Martin; Jonathan M Mudge; Joanella Morales; Emily Perry; Magali Ruffier; John Tate; David Thybert; Stephen J Trevanion; Fiona Cunningham; Andrew D Yates; Daniel R Zerbino; Paul Flicek
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

9. GENCODE 2021.

Authors: Adam Frankish; Mark Diekhans; Irwin Jungreis; Julien Lagarde; Jane E Loveland; Jonathan M Mudge; Cristina Sisu; James C Wright; Joel Armstrong; If Barnes; Andrew Berry; Alexandra Bignell; Carles Boix; Silvia Carbonell Sala; Fiona Cunningham; Tomás Di Domenico; Sarah Donaldson; Ian T Fiddes; Carlos García Girón; Jose Manuel Gonzalez; Tiago Grego; Matthew Hardy; Thibaut Hourlier; Kevin L Howe; Toby Hunt; Osagie G Izuogu; Rory Johnson; Fergal J Martin; Laura Martínez; Shamika Mohanan; Paul Muir; Fabio C P Navarro; Anne Parker; Baikang Pei; Fernando Pozo; Ferriol Calvet Riera; Magali Ruffier; Bianca M Schmitt; Eloise Stapleton; Marie-Marthe Suner; Irina Sycheva; Barbara Uszczynska-Ratajczak; Maxim Y Wolf; Jinuri Xu; Yucheng T Yang; Andrew Yates; Daniel Zerbino; Yan Zhang; Jyoti S Choudhary; Mark Gerstein; Roderic Guigó; Tim J P Hubbard; Manolis Kellis; Benedict Paten; Michael L Tress; Paul Flicek
Journal: Nucleic Acids Res Date: 2021-01-08 Impact factor: 16.971

10. An endoplasmic reticulum stress-regulated lncRNA hosting a microRNA megacluster induces early features of diabetic nephropathy.

Authors: Mitsuo Kato; Mei Wang; Zhuo Chen; Kirti Bhatt; Hyung Jung Oh; Linda Lanting; Supriya Deshpande; Ye Jia; Jennifer Y C Lai; Christopher L O'Connor; YiFan Wu; Jeffrey B Hodgin; Robert G Nelson; Markus Bitzer; Rama Natarajan
Journal: Nat Commun Date: 2016-09-30 Impact factor: 14.919