Literature DB >> 31022173

Details in the evaluation of circular RNA detection tools: Reply to Chen and Chuang.

Xiangxiang Zeng1,2, Wei Lin2, Maozu Guo3, Quan Zou4.   

Abstract

Entities:  

Mesh:

Substances:

Year:  2019        PMID: 31022173      PMCID: PMC6527241          DOI: 10.1371/journal.pcbi.1006916

Source DB:  PubMed          Journal:  PLoS Comput Biol        ISSN: 1553-734X            Impact factor:   4.475


× No keyword cloud information.
Chia-Ying Chen and Trees-Juen Chuang (referred as CYC & TJC below) recently submitted their comment [1] on our previous paper [2]. In their paper, they scrutinized the CircBase [3] candidates that we used and pointed out several weak points of our paper. In summary, they suggested that the positive dataset we derived from CircBase required further evaluation. They also indicated that using all of these candidates as our dataset was not appropriate. They further suggested that three main confounding factors may affect our assessment of circRNA detection tools and that their performances should be re-evaluated. Before we begin to discuss their comment, we will briefly introduce the positive dataset we used. First, as stated in our previous paper, the 14,689 candidates detected in HeLa cells were downloaded from CircBase and reported by the study of Salzman et al. [4]. These candidates were not identified with the use of find_circ [5] tool. As described in the study of Salzman et al. [4], all UCSC annotated exons in scrambled order were used to construct a custom database and identify circRNA candidates. Second, in our positive dataset, constant coverage of 10× for the intervening sequence and a minimum of two read pairs (paired-end simulated reads) to cross the back-spliced junction sites were generated for each candidate. Now, we will discuss the three confounding factors they listed in their paper. First, they suggested to remove 1046 candidates with unannotated exon boundaries from the positive dataset, especially candidates without canonical splice signals, such as GT-AG, GC-AG, or AT-AC, for the junctions. As mentioned above, CircBase-deposited circRNA candidates that we used were identified by Salzman et al. [4]; the candidates identified by their method should all match the exon boundaries. The discrepancies may be caused by inconsistent gene annotation files used. Salzman et al. [4] used UCSC known genes [6], whereas CYC & TJC used NCBI RefSeq-identified mRNA annotation files. We manually checked several candidates marked with “junctions with unannotated exon boundaries” in CYC & TJC’s Supplemental Dataset S1. The junction sites of these candidates were annotated as exon boundaries in UCSC known genes annotation file (http://hgdownload.soe.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz). Thus, detection of circRNAs with annotated exon boundaries relies on the gene annotation files used, and novel candidates may be missed because of the incompleteness of the current database [7]. For example, Szabo et al. [7] reinforced an annotation-based algorithm with a de novo module and discovered a validated circRNA from the not-fully-annotated RMST gene and several U12 circRNAs produced from unannotated boundaries. Such case was also demonstrated by Xiao-Ou Zhang et al. [8]. They detected thousands of novel exons (non-RefSeq, non-Ensembl, or non-UCSC known genes) in circRNAs by using an updated CIRCexplorere2 tool, and several of them were confirmed by Northern blot analysis and Sanger sequencing after RT-PCR [8]. Other examples were shown by Salzman et al. [4], they found several noncoding RNA genes expressed circular isoforms in mouse and human [4]. Gao et al. also provided evidence of intronic or intergenic circRNAs [9]. Moreover, the well-known CDR1as [5, 10] is an intergenic circRNA by definition. To study the mechanism of circularization, Starke et al. observed that both canonical splice sites are essential; however, they also cannot rule out the potential use of cryptic sites for circularization [11]. Their experimental data showed that when the normal 5′ or 3′ splice site was mutated, circRNAs can also be formed with the use of cryptic, noncanonical 5′ and 3′ splice sites [11]. Given the above-mentioned evidence, excluding candidates with unannotated exon boundaries or without canonical splicing sites is subject to discussion. Second, they suggested the removal of 2316 candidates, of which the concatenated exon sequences flanking back-spliced junction sites exhibited ambiguous alignments. We checked these candidates on HeLa and Hs68 samples. As shown in Table 1, we found that some of them were not depleted (≥ onefold enrichment) or even significantly enriched (≥ fivefold enrichment) after RNase R treatment. (A Detailed discussion on two examples can be referred to Section I of the Supplementary File.) Therefore, suggesting that all of the candidates with ambiguous alignments are false calls and should be excluded from the analysis is inappropriate. However, sequencing reads produced from these candidates may result in multiple hits due to their ambiguous alignments, and it’s important to take into account of factors, such as sequencing base quality, alignment mismatches, minimum number of bases overhang both sides of the junction sites, and mapping uniqueness of the supporting back-spliced junction reads [7].
Table 1

‘2316 ambiguous CircBase circRNAs’ on HeLa and Hs68 samples.

DatasetHeLaHs68
ToolsRNaseR-RNaseR+Not depletedPercent (%)EnrichedPercent (%)RNaseR-RNaseR+Not depletedPercent (%)EnrichedPercent (%)
CF1101687971.822421.821024078583.335957.84
CE1101677971.822421.821034078683.506058.25
CIRI14821711175.002718.2412639011288.898164.29
DCC961376466.671717.71913308189.015358.24
FC82914150.001113.41522274076.923261.54
KNIFE17019911165.292514.7113139510983.217355.73
MS881226169.321112.50702496187.144260.00
NCLS34371955.8838.82231001669.571147.83
PF18620611461.292613.9814144911883.698459.57
SG17821310558.992312.921373668058.395640.88
UB55632036.3635.452964517.24413.79

Note: Candidates with ≥ 2 supporting back-spliced junction reads were used in the analysis, and the number of supporting reads was normalized with sequencing depth before fold change calculation. After RNase R treatment, detected candidates with ≥ onefold enrichment was defined as ‘Not depleted’, while candidates with ≥ fivefold enrichment was regarded as ‘Enriched’. CF: circRNA_finder; CE: CIRCexplorer; FC: find_circ; MS: MapSplice; SG: Segemehl; NCLS: NCLScan; PF: PTESFinder; UB: UROBORUS.

Note: Candidates with ≥ 2 supporting back-spliced junction reads were used in the analysis, and the number of supporting reads was normalized with sequencing depth before fold change calculation. After RNase R treatment, detected candidates with ≥ onefold enrichment was defined as ‘Not depleted’, while candidates with ≥ fivefold enrichment was regarded as ‘Enriched’. CF: circRNA_finder; CE: CIRCexplorer; FC: find_circ; MS: MapSplice; SG: Segemehl; NCLS: NCLScan; PF: PTESFinder; UB: UROBORUS. Third, they suggested that “unqualified reads” with ambiguous alignments and different supporting read counting methods of the tools affected our reported results. First, we would like to clarify that the result of CIRI, MapSplice, and find_circ that we provided in our previous paper [2] only included candidates with ≥ 2 supporting back-spliced junction reads because of the limited output with default parameter setting of the three tools. Thus, no circRNAs with one supporting reads for these tools are included in Fig 3B of CYC & TJC’s comment paper. If candidates with one supporting reads were reported by the three tools, then the total number of CircBase circRNAs identified by all 11 tools is expected to be more than 3580 events (Fig 3B of CYC & TJC’s comment paper). As for “unqualified reads”, the 4 reads they listed in Fig 3C of their paper were back-spliced junction reads generated by CIRI-simulator [9] to support this circRNA. (A detailed discussion on two of these reads can be referred to Section II of the Supplementary File.) As for “different counting methods” used by different tools, it possibly affects the detection of circRNAs with small size. If the spliced length of the candidates is smaller than the insert size of the sequencing library, then both mates of the paired-end reads possibly cross the back-spliced junction sites. If both mates of the paired-end reads cross the back-spliced junction site, then this case is beneficial to all tools because of increased opportunities to detect the back-spliced junction event. For Fig 4 of our previous paper, by focusing our analysis on common candidates with spliced length exceeding the insert size of the sequencing library, we eliminated the influence of different counting methods. For Table 1 of our previous paper, we generated sufficient (≥ 2) back-spliced junction reads for each circRNA in the positive dataset. And it was a common practice to keep candidates with ≥ 2 supporting reads for further analysis [12] [5, 9] [13], while reliable methods to reduce false-positive circRNAs still remains to be developed. In summary, it’s feasible to assess the sensitivity of each tool by keeping candidates with ≥ 2 supporting reads (Table 1 & Fig 4 of our previous paper). Finally, CYC & TJC emphasized that either RTase- and non-RTase-based experiments or at least two different types of RTase-based experiments should be conducted to validate the authenticity of the circRNA candidates. We believe that the origins (from different tissues/cell lines) of our collected circRNAs will not affect the fairness of our evaluation. However, we acknowledge that not all of the 282 circRNAs, which we compiled from 17 published studies, were validated using methods indicated by CYC & TJC, such circRNAs should be collected if possible. In our previous paper [2], to evaluate the performance of 11 circRNA detection tools, we generated a synthetic positive dataset from 14,689 candidates deposited in CircBase [3] that were previously identified from HeLa cells by using an annotation-based method [4]. Although the authenticity of these candidates still remains to be verified, they should all match the exon boundaries annotated in UCSC knownGene database [6]. In CYC & TJC’s comment paper, they further scrutinized these candidates. After analysis, they suggested that three main confounding factors may compromise the fairness of our assessment. Consequently, they suggested the removal of candidates with unannotated exon boundaries, particularly those without canonical splice sites. In addition, they suggested to exclude candidates with ambiguous alignments. As discussed in a previous study [14] and also shown by our data, although these heuristic filtering steps can eliminate particular types of false positives, they may create blind spots and reduce sensitivity. Third, they suggested that our evaluation of the tools was affected by unqualified reads with ambiguous alignments and different supporting read-counting methods. However, all the unqualified reads listed in Fig 3C of the comment paper are back-spliced junction reads generated by CIRI-simulator [9]. The discrepancies may be caused by the failure of BLAT [15] to detect supporting reads of which only a small portion spans the back-spliced junction sites. In our previous paper, prior to further analysis, relevant steps were adopted to minimize the effect of different counting methods. In summary, CYC & TJC underlined several knowledge-based filtering steps and an experimental validation method to address the bioinformatic and experimental challenges in detecting circRNAs, but whether these heuristic filtering steps should be enforced still requires further discussion. Finally, we reanalyzed the positive and mixed datasets with their suggested removal of ‘uncertain circRNA candidates’. Data in Table 1 of our previous paper were updated as Table 2 below. In general, our previous conclusions drawn from these two datasets are robust to the change.
Table 2

Summary of accuracy measures on the positive and mixed datasets.

DatasetsPositiveMixed
Tools#DetectedTPS (%)P (%)F1AUC#DetectedTPS (%)P (%)F1AUC
CIRI107141068692.2999.740.960.92108501066892.1398.320.950.92
CF8186810970.0399.060.820.708239810970.0398.420.820.70
DCC7506746064.4399.390.780.647510746064.4399.330.780.64
FC9085903578.0399.450.870.789795903578.0392.240.850.66
SG113811067792.2193.810.930.89121261030889.0285.010.870.84
CE9970993685.8199.660.920.869972993685.8199.640.920.86
MS8208816870.5499.510.830.708206815970.4699.430.820.70
UB8434798568.9694.680.800.678517750064.7788.060.750.58
KNIFE114061136098.1199.600.990.98118191130097.5995.610.970.92
PF104961045890.3299.640.950.90105241046590.3899.440.950.90
NCLS7218721462.3099.940.770.627220721662.3299.940.770.62

TP: true positives; S: sensitivity; P: precision; F1: F1 score; AUC: area under precision/recall curve; CF: circRNA_finder; CE: CIRCexplorer; FC: find_circ; MS: MapSplice; SG: Segemehl; NCLS: NCLScan; PF: PTESFinder; UB: UROBORUS. Note: there were a total of 11579 true positives in these two datasets. After we removed the ‘uncertain circRNA candidates’ listed in CYC & TJC’s comment paper, 3110 candidates instead of 3150 (probably a typo in their paper) were obtained after we merged the data from their DataSet S1 and Dataset S2 files.

TP: true positives; S: sensitivity; P: precision; F1: F1 score; AUC: area under precision/recall curve; CF: circRNA_finder; CE: CIRCexplorer; FC: find_circ; MS: MapSplice; SG: Segemehl; NCLS: NCLScan; PF: PTESFinder; UB: UROBORUS. Note: there were a total of 11579 true positives in these two datasets. After we removed the ‘uncertain circRNA candidates’ listed in CYC & TJC’s comment paper, 3110 candidates instead of 3150 (probably a typo in their paper) were obtained after we merged the data from their DataSet S1 and Dataset S2 files. (I) Examples of not-depleted or even enriched “ambiguous CircBase circRNAs” after RNase R treatment. (II) Examples of back-spliced junction read pairs being mistaken as “unqualified reads”. (DOCX) Click here for additional data file.
  14 in total

1.  The UCSC Known Genes.

Authors:  Fan Hsu; W James Kent; Hiram Clawson; Robert M Kuhn; Mark Diekhans; David Haussler
Journal:  Bioinformatics       Date:  2006-02-24       Impact factor: 6.937

2.  Exon circularization requires canonical splice signals.

Authors:  Stefan Starke; Isabelle Jost; Oliver Rossbach; Tim Schneider; Silke Schreiner; Lee-Hsueh Hung; Albrecht Bindereif
Journal:  Cell Rep       Date:  2014-12-24       Impact factor: 9.423

Review 3.  Detecting circular RNAs: bioinformatic and experimental challenges.

Authors:  Linda Szabo; Julia Salzman
Journal:  Nat Rev Genet       Date:  2016-10-14       Impact factor: 53.242

4.  Natural RNA circles function as efficient microRNA sponges.

Authors:  Thomas B Hansen; Trine I Jensen; Bettina H Clausen; Jesper B Bramsen; Bente Finsen; Christian K Damgaard; Jørgen Kjems
Journal:  Nature       Date:  2013-02-27       Impact factor: 49.962

5.  MapSplice: accurate mapping of RNA-seq reads for splice junction discovery.

Authors:  Kai Wang; Darshan Singh; Zheng Zeng; Stephen J Coleman; Yan Huang; Gleb L Savich; Xiaping He; Piotr Mieczkowski; Sara A Grimm; Charles M Perou; James N MacLeod; Derek Y Chiang; Jan F Prins; Jinze Liu
Journal:  Nucleic Acids Res       Date:  2010-08-27       Impact factor: 16.971

6.  Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development.

Authors:  Linda Szabo; Robert Morey; Nathan J Palpant; Peter L Wang; Nastaran Afari; Chuan Jiang; Mana M Parast; Charles E Murry; Louise C Laurent; Julia Salzman
Journal:  Genome Biol       Date:  2015-06-16       Impact factor: 13.583

7.  circBase: a database for circular RNAs.

Authors:  Petar Glažar; Panagiotis Papavasileiou; Nikolaus Rajewsky
Journal:  RNA       Date:  2014-09-18       Impact factor: 4.942

8.  CIRI: an efficient and unbiased algorithm for de novo circular RNA identification.

Authors:  Yuan Gao; Jinfeng Wang; Fangqing Zhao
Journal:  Genome Biol       Date:  2015-01-13       Impact factor: 13.583

9.  Circular RNA profile in gliomas revealed by identification tool UROBORUS.

Authors:  Xiaofeng Song; Naibo Zhang; Ping Han; Byoung-San Moon; Rose K Lai; Kai Wang; Wange Lu
Journal:  Nucleic Acids Res       Date:  2016-02-11       Impact factor: 16.971

10.  Cell-type specific features of circular RNA expression.

Authors:  Julia Salzman; Raymond E Chen; Mari N Olsen; Peter L Wang; Patrick O Brown
Journal:  PLoS Genet       Date:  2013-09-05       Impact factor: 5.917

View more
  2 in total

1.  DeepciRGO: functional prediction of circular RNAs through hierarchical deep neural networks using heterogeneous network features.

Authors:  Lei Deng; Wei Lin; Jiacheng Wang; Jingpu Zhang
Journal:  BMC Bioinformatics       Date:  2020-11-12       Impact factor: 3.169

Review 2.  Advances in the Identification of Circular RNAs and Research Into circRNAs in Human Diseases.

Authors:  Shihu Jiao; Song Wu; Shan Huang; Mingyang Liu; Bo Gao
Journal:  Front Genet       Date:  2021-03-19       Impact factor: 4.599

  2 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.