| Literature DB >> 34179780 |
Sébastien Riquier1, Chloé Bessiere1, Benoit Guibert1, Anne-Laure Bouge2, Anthony Boureux1, Florence Ruffle1, Jérôme Audoux2, Nicolas Gilbert1, Haoliang Xue3, Daniel Gautheret3, Thérèse Commes1.
Abstract
The huge body of publicly available RNA-sequencing (RNA-seq) libraries is a treasure of functional information allowing to quantify the expression of known or novel transcripts in tissues. However, transcript quantification commonly relies on alignment methods requiring a lot of computational resources and processing time, which does not scale easily to large datasets. K-mer decomposition constitutes a new way to process RNA-seq data for the identification of transcriptional signatures, as k-mers can be used to quantify accurately gene expression in a less resource-consuming way. We present the Kmerator Suite, a set of three tools designed to extract specific k-mer signatures, quantify these k-mers into RNA-seq datasets and quickly visualize large dataset characteristics. The core tool, Kmerator, produces specific k-mers for 97% of human genes, enabling the measure of gene expression with high accuracy in simulated datasets. KmerExploR, a direct application of Kmerator, uses a set of predictor gene-specific k-mers to infer metadata including library protocol, sample features or contaminations from RNA-seq datasets. KmerExploR results are visualized through a user-friendly interface. Moreover, we demonstrate that the Kmerator Suite can be used for advanced queries targeting known or new biomarkers such as mutations, gene fusions or long non-coding RNAs for human health applications.Entities:
Year: 2021 PMID: 34179780 PMCID: PMC8221386 DOI: 10.1093/nargab/lqab058
Source DB: PubMed Journal: NAR Genom Bioinform ISSN: 2631-9268
Figure 1.Kmerator Suite and Kmerator levels: definitions. (A) The Kmerator Suite is a set of three tools: (1) Kmerator extracts gene/transcript k-mer signatures. It takes as input a reference genome and a reference transcriptome + a list of gene or transcript sequences to extract specific k-mers from. The output is a set of fasta files (one per input gene/transcript sequence) with the specific k-mers. (2) countTags quantifies input k-mers in a set of input sequencing raw files (fastq files) and outputs a count table. (3) KmerExploR is a particular application of Kmerator/countTags to visualize input RNA-seq dataset (set of fastq files) characteristics. The default usage includes characteristics related to the sequencing protocol (ribosomal depletion, polyA+, strand-specific protocol, 5′/3′ bias), tissue origin (sex) and possible contaminations (mycoplasma, virus, other species or HeLa cell line). Users can also visualize their own signatures with the advanced usage. Details are given in the text and Supplementary Figure S1. (B) Kmerator extracts gene/transcript k-mer signatures with three possible levels of stringency. This figure describes how the different levels are defined (transcript, gene or chimera) for two example genes A and B. Example gene A has three isoforms: A1, A2 and A3. A1 is the only one with a free interval, i.e. a region not covered by other isoforms, and is defined as the principal transcript (APPRIS database). Therefore, at the transcript level, each transcript has its own specific k-mer set, depending on its coverage with other isoforms. At the gene level, the principal transcript defined with the APPRIS database is used, and specific k-mers can be common to several isoforms. At the chimera level (example of A1–B1 fusion), the k-mer is not described in annotations.
List of predictor genes, by category, included in KmerExploR and associated RNA-seq dataset names used in this paper
| Datasets | Predictor genes | Total | References and details | |
|---|---|---|---|---|
|
| Dataset- FEATURES | HIST2H2AC, HIST2H2AB, HIST1H4J, HIST1H4I, HIST1H4F, HIST1H4D, HIST1H4C, HIST1H4B, HIST1H3I, HIST1H3H, HIST1H3G, HIST1H3F, HIST1H3E, HIST1H3C, HIST1H3B, HIST1H3A, HIST1H2BN, HIST1H2BM, HIST1H2BL, HIST1H2BH, HIST1H2BF, HIST1H2BE, HIST1H2BB, HIST1H2BA, HIST1H2AK, HIST1H2AH, HIST1H2AG, HIST1H2AB, HIST1H1T, HIST1H1E, HIST1H1D, HIST1H1B, HIST1H1A | 24 512 |
|
|
| Dataset- FEATURES | VPS29, SNRPD3, REEP5, RAB7A, PSMB4, PSMB2, GPI, EMC7, CHMP2A, C1orf43, VPS29_rev, SNRPD3_rev, REEP5_rev, RAB7A_rev, PSMB4_rev, PSMB2_rev, GPI_rev, EMC7_rev, CHMP2A_rev, C1orf43_rev | 36 638 |
|
|
| Dataset- FEATURES | UTY, TMSB4Y, TBL1Y, RPS4Y1, NLGN4Y, EIF1AY, DDX3Y | 21 996 |
|
|
| Dataset- FEATURES | VPS29, SNRPD3, REEP5, RAB7A, PSMB4, PSMB2, GPI, EMC7, CHMP2A, C1orf43 | 12 705 |
|
|
| Dataset-MYCO | Mycoplasma_orale, Mycoplasma_hyorhinis, Acholeplasma_laidlawii, Mycoplasma_hominis, Mycoplasma_arginini, Mycoplasma_fermentans | 363 025 | ( |
|
| Dataset-VIRUS-CCLE | Human_gammaherpesvirus_4, Human_herpesvirus_4, Human_herpesvirus_8, Murine_leukemia_virus, Hepatitis_C_virus_genotype, Human_immunodeficiency_virus_1, Human_T_lymphotropic_virus_1, Squirrel_monkey_retrovirus, Human_T_lymphotropic_virus_2, Human_papillomavirus_type_92, Hepatitis_B_virus_strain, Human_immunodeficiency_virus_2, MuLV_related_virus_22Rv1/CWR, Bovine_viral_diarrhea_virus | 516 882 | ( |
|
| Dataset-HELA-CCLE | L1_mut7486, L1_mut7258, L1_mut6842, L1_mut6625, L1_mut6460, L1_mut6401, L1_mut5875, E7_mut806, E7_mut751, E6_mut549, E6_mut485, E6_mut287, E6_mut104, E1_mut2269, E1_mut1994, E1_mut1843, E1_mut1807, E1_mut1353, E1_mut1012 | 589 | ( |
|
| Dataset- SPECIES | Homo_sapiens_MT_CO1, Danio_rerio_mt_co1, Zea_mays_COX1, Saccharomyces_cerevisiae_COX1, Rattus_norvegicus_Mt_co1, Mus_musculus_mt_Co1, Gallus_gallus_MT_CO1, Drosophila_melanogaster_mt_CoI, Caenorhabditis_elegans_ctc_3_MTCE, Arabidopsis_thaliana_COX1 | 12 119 | MT-CO1 (and orthologs) |
|
| Dataset- LEUCEGENE | PML–RARA, RARA–PML, RUNX1T1–RUNX1, RUNX1–RUNX1T1 | 724 | |
|
| Dataset- LEUCEGENE | NONE | 78 | ( |
|
| Dataset- LEUCEGENE | TET2, KRAS, CEBPA | 10 |
The samples included in each dataset and some metadata are detailed in Supplementary Table S1.
Figure 2.Kmerator performances on the whole transcriptome. We extracted k-mer signatures from all the human Ensembl transcriptome v91 at both gene (54 874 coding and non-coding genes, left) and transcript (i.e. 199 181 transcripts, right) levels. (A) The first pie chart represents the proportion of genes having specific k-mers (turquoise) versus those without specific k-mers (red). (B) In the same way, we represented the proportion of transcripts having specific k-mers (turquoise) or not (red). For these two classes, we looked at the percentage having free intervals, i.e. regions in the transcript not shared with other isoforms (secondary pie). Most of the transcripts lacking specific k-mers do not have free intervals (91%). We tested Kmerator sensitivity to quantify simulated data, at both gene (C) and transcript (D) levels. We represented the k-mer counts normalized per billion of k-mers in the sample (Y-axis) as a function of the true expression in TPM (X-axis), on the whole simulated dataset. R is the Spearman’s correlation coefficient between k-mer counts and TPM. Each point on the graph is a transcript and the color scale depends on the transcript density on the graph.
Figure 3.KmerExploR default usage: basic features. All presented bar plots are direct output of KmerExploR and they are generated from the Dataset-FEATURES described in Supplementary Table S1 (103 paired-end ENCODE samples) except for the orientation (C), which is a subset of eight RNA-seq from the Dataset-FEATURES. For each bar plot, the legend lists the set of predictor genes for which k-mer mean counts are computed (see also Table 1). Samples are on the X-axis. Panels (A), (B) and (C) have the mean k-mer counts by gene normalized per billion of k-mers on the Y-axis. (A) Sex determination. Samples are sorted by sex in the order female, then male. (B) PolyA+ selection versus ribo-depletion by histone detection. Samples are sorted by protocol in this order: polyA, ribo-depletion, unknown. (C) Stranded versus unstranded sequencing protocol. For this category, both fastq files by sample are shown. The first four samples are unstranded and the last four samples are stranded. (D) Read position biases along 5′ UTR, 3′ UTR and CDS regions. After computing k-mer mean counts by gene, they are summed up by 5′ UTR, 3′ UTR or CDS regions and converted in % (Y-axis).
Figure 4.KmerExploR default usage: contaminations. All presented bar plots are direct output of KmerExploR and all bar plot datasets are described in Supplementary Table S1. For each bar plot, the legend lists the set of predictors for which k-mer mean counts are computed (details in Table 1). Samples are on the X-axis. Panels (A), (B) and (D) have the mean k-mer counts by gene normalized per billion of k-mers on the Y-axis. (A) Mycoplasma contamination on the Dataset-MYCO (33 single-read samples). (B) Virus detection on the Dataset-VIRUS-CCLE (22 paired-end samples). (C) Species determination on the Dataset-SPECIES (27 paired-end samples). For this category, after computing k-mer mean counts by species, they are converted in % (Y-axis) to avoid big expression differences between species. (D) HeLa determination on the Dataset-HELA-CCLE (three paired-end samples). The sample in the middle is a HeLa cell line and the two others are negative controls (SF767 and SiHa cells).
Figure 5.KmerExploR advanced usage: quantification of transcriptomic events outside the annotations. All presented bar plots are direct output of KmerExploR and they are all generated from the Dataset-LEUCEGENE described in Supplementary Table S1 (131 paired-end samples). This dataset includes normal CD34+ cells as control (in green on the X-axis) and different AML subtypes (in black on the X-axis). For each bar plot, the legend lists the set of predictors for which k-mer mean counts normalized per billion (Y-axis) are computed. (A) Chimera detection. Two well-known fusion RNAs associated with chromosomal translocation and their reciprocal counterparts are presented: RUNX1–RUNXT1 t(x,21) and PML–RARA t(15,17). (B) Mutation detection. TET2, KRAS and CEBPA genes are used in AML diagnosis. The bar plot shows four different mutations for these genes, detected specifically in some AML samples. The reference allele for each of these mutations is detected in almost all samples. (C) New lncRNA detection: NONE ‘chr2-p21’ lncRNA described in (23). This transcript is expressed in the whole dataset but shows different levels of expression depending on AML subtype.