| Literature DB >> 30154828 |
Adam McDermaid1,2, Xin Chen3, Yiran Zhang1,4, Cankun Wang1, Shaopeng Gu4, Juan Xie1,2, Qin Ma1,2.
Abstract
One of the main benefits of using modern RNA-Sequencing (RNA-Seq) technology is the more accurate gene expression estimations compared with previous generations of expression data, such as the microarray. However, numerous issues can result in the possibility that an RNA-Seq read can be mapped to multiple locations on the reference genome with the same alignment scores, which occurs in plant, animal, and metagenome samples. Such a read is so-called a multiple-mapping read (MMR). The impact of these MMRs is reflected in gene expression estimation and all downstream analyses, including differential gene expression, functional enrichment, etc. Current analysis pipelines lack the tools to effectively test the reliability of gene expression estimations, thus are incapable of ensuring the validity of all downstream analyses. Our investigation into 95 RNA-Seq datasets from seven plant and animal species (totaling 1,951 GB) indicates an average of roughly 22% of all reads are MMRs. Here we present a machine learning-based tool called GeneQC (Gene expression Quality Control), which can accurately estimate the reliability of each gene's expression level derived from an RNA-Seq dataset. The underlying algorithm is designed based on extracted genomic and transcriptomic features, which are then combined using elastic-net regularization and mixture model fitting to provide a clearer picture of mapping uncertainty for each gene. GeneQC allows researchers to determine reliable expression estimations and conduct further analysis on the gene expression that is of sufficient quality. This tool also enables researchers to investigate continued re-alignment methods to determine more accurate gene expression estimates for those with low reliability. Application of GeneQC reveals high level of mapping uncertainty in plant samples and limited, severe mapping uncertainty in animal samples. GeneQC is freely available at http://bmbl.sdstate.edu/GeneQC/home.html.Entities:
Keywords: EM-algorithm; RNA-Seq read alignment; elastic-net; gene expression; k-means clustering; machine learning; mapping uncertainty; mixture model fitting
Year: 2018 PMID: 30154828 PMCID: PMC6102479 DOI: 10.3389/fgene.2018.00313
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Multi-mapped reads.
| Datasets | 10 | 10 | 10 | 10 | 13 | 11 | 11 | 10 | 10 | 95 |
| Size(GB) | 153.7 | 152.3 | 151.8 | 385.7 | 348.1 | 249.9 | 249.9 | 129.9 | 129.9 | 1,951 |
| Unique-mapped | 69–89% | 55–82% | 52–88% | 47–66% | 61–69% | 56–71% | 59–70% | 41–73% | 41–75% | 55% |
| Multi-mapped | 8–17% | 9–25% | 5–34% | 17–33% | 17–25% | 16–27% | 15–24% | 9–37% | 9–36% | 22% |
| Un–mapped | 2–17% | 8–23% | 4–16% | 13–25% | 9–18% | 12–21% | 12–22% | 3–31% | 2–31% | 23% |
| (Multi-mapped)/(total mapped) | 8–18% | 10–31% | 6–39% | 22–39% | 21–28% | 19–32% | 19–28% | 11–47% | 11–47% | 29% |
The alignment statistics for the 95 analyzed datasets across seven species, indicating the ranges of percentages for the uniquely aligned, multi-mapped, and un-mapped reads, as well as the proportion of multi-mapped out of the total mapped reads.
Figure 1Mapping Uncertainty and GeneQC. (A) The MMR percentages for the 95 datasets across seven species. More detailed information is showcased in Table 1; (B) GeneQC takes a read alignment, reference genome, and annotation file as inputs; (C) The first step of GeneQC is to extract features related to mapping uncertainty for each annotated gene; (D) Using the extracted features, elastic-net regularization is used to calculate the D-score, which represents the mapping uncertainty for each gene; (E) A series of Mixture Normal and Mixture Gamma distributions are fit to the D-scores; and (F) The mixture models are used to categorize the D-scores into different levels of mapping uncertainty along with a statistical alternative likelihood value for each gene.
Figure 2Genomic, transcriptomic, and network feature development. (A) Genes with significant similarity are displayed, with D1 being the maximum value of ss*l. In this situation, genes y2, y3, &y4 all have the same ss value, but gene y3 has a longer consecutive string of matching base pairs (l) than the other values, making it the more similar genomic location. (B) Graphical representation of the sets of reads aligned to each gene. D2 is the largest overlapping proportion of shared ambiguous or multi-mapped reads between the target gene, gene i, and all other genomic locations that have at least one read potentially aligned to both locations. (C) This graph displays the significant interactions of gene i with other genomic locations. Each node represents a genomic location, with the red edges representing sequence similarity scores and black edges representing multi-mapping proportions. In this situation, D1 = 310, D2 = 0.24, and D3 = (3+1) = 0.602.
GeneQC example output.
| 1439.981 | 0.022727 | 1.041393 | 0.022765 | Low | 0.106445 | |
| 228 | 1 | 0.69897 | 0.509935 | High | 0.012702 | |
| 2560 | 1 | 0.477121 | 0.498094 | High | 0.015754 | |
| 321.9987 | 0.005017 | 2.060698 | 0.020863 | Low | 0.10397 | |
| 365 | 0.0224 | 1.78533 | 0.027916 | Low | 0.113361 | |
| 157 | 0.04878 | 0.954243 | 0.033132 | Low | 0.120682 | |
| 691.9874 | 0.7809523 | 0.47712125 | 0.39143804 | Medium | 2.15E-54 | |
| 855 | 1 | 0.477121 | 0.499807 | High | 0.015276 | |
| 4864 | 1 | 0.477121 | 0.495779 | High | 0.016419 |
GeneQC analysis of seven species.
| SRR3305038 | 0.02 | 0.58 | 0.01 | 0.29 | |
| SRR2080995 | 0.04 | 0.46 | 0.16 | 0.24 | |
| SRR5274891 | 0.06 | 0.66 | 0.04 | 0.33 | |
| SRR5188171 | 0.01 | 0.32 | 0.09 | 0.16 | |
| ATW_AAOSW_6_2 _B06BTABXX.IND12 | 0.02 | 0.60 | 0.15 | 0.31 | |
| SRR6029567 | 0.05 | 0.84 | 0.32 | 0.43 | |
| SRR6111161 | 0.06 | 0.84 | 0.28 | 0.42 |
This table shows the sample ID and relevant metrics for each of the seven datasets analyzed. Mean values for D.
Figure 3GeneQC application. The results related to the analysis of seven datasets representing five plant and two animal species. (A) Categorizations for the level of mapping uncertainty per gene are shown relative to all categorizations. (B) Boxplots for the three extracted features of each gene are shown for each analyzed sample. D1, D2, and D3 represent the sequence similarity, proportion of shared MMR, and degree weight, respectively. Each value is shown normalized between 0 and 1. Only genes with mapping uncertainty are displayed. (C) Derived D-scores for each gene are shown by species, as calculated from the three features in (B). Higher D-scores represent higher levels of mapping uncertainty.