| Literature DB >> 36153479 |
Sonja Zehetmayer1, Martin Posch2, Alexandra Graf2.
Abstract
BACKGROUND: In RNA-sequencing studies a large number of hypothesis tests are performed to compare the differential expression of genes between several conditions. Filtering has been proposed to remove candidate genes with a low expression level which may not be relevant and have little or no chance of showing a difference between conditions. This step may reduce the multiple testing burden and increase power.Entities:
Keywords: Gene expression; Gene filter; Multiple testing; Next generation sequencing
Mesh:
Substances:
Year: 2022 PMID: 36153479 PMCID: PMC9509565 DOI: 10.1186/s12859-022-04928-z
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.307
Simulation strategies. More details for each setting can be found in the Additional file 1
| Simulation | Description and data sources |
|---|---|
| NB | The count data are assumed to follow a negative binomial distribution (NB), dispersion and mean parameters are fixed and equal for all |
| NB with distributed parameters | Read counts follow a NB distribution, dispersion and mean parameters vary across genes and are based on real RNA-seq data sets according to [ |
| SimSeq [ | Counts based on real data read counts adjusted by a correction factor to generate differential expressions, dependence between genes is imitated from real data sets Bottomly [ |
| PROPER [ | Read counts follow a NB distribution, dispersion and mean parameters vary across genes and are based on a real RNA-seq data set (Cheung [ |
| PROPER with fixed sequencing depth [ | As PROPER. Here, the empirical average expressions sampled from the Cheung data are standardised to reach a fixed sequencing depth. |
Types of filtering methods
| Filter | Description | Considered thresholds |
|---|---|---|
| Mean-based | These filters are based on the gene-wise overall mean counts from both conditions. Genes with a mean expression less than some threshold given by the specified percentile percentage of mean counts are removed by the filter and not considered for the test decision (e.g., [ | Percentile % |
| Max-based | Genes with maximum counts (over both conditions ) less than a threshold given by the specified percentile percentage of maximum counts are removed from the analysis and not considered for the test decision (e.g., [ | Percentile % |
| CPM | Robinson and Oshlack (2010) [ | |
| Jaccard | Max-based filter [ | |
| Zero-based | This filter counts the sum of zero counts per gene and removes genes with more than |
Fig. 1Power comparison of different filters. Power values for several filtering methods and simulation strategies for , , , (or for SimSeq (Bottomly)). The power of each filtering method is plotted as a function of the actual mean proportion of filtered genes across all simulation runs for the set of genes with at least one non-zero count; only the proportion of the basic filter is based on the total number of hypotheses m. The basic, Jaccard and no filter results are represented by a point because these methods are based on a fixed threshold
Fig. 2Power comparison of different filters and sequencing depths. Power values for several filtering methods for PROPER simulation for , , , and sequencing depths 5m, 10m, and 50m. The power of each filtering method is plotted as a function of the actual mean proportion of filtered genes across all simulation runs for the set of genes with at least one non-zero count; only the proportion of the basic filter is based on the total number of hypotheses m. The basic, Jaccard and no filter results are represented by a point because these methods are based on a fixed threshold
Fig. 3Adaptive filter I. Differences in power for adaptive filter and selection of applied filters compared to no filter for several scenarios. The plotted filtering methods and the corresponding percentile percentages are given in the legend. , , or , m, and are parameters on the x-axis, (lfdr adjustment). Note that the range of the y-axis is chosen result-based; filtering methods with low power may not be visible on some plots
Fig. 4Adaptive filter II. Differences in power for adaptive filter and selection of applied filters compared to no filter for NB sim distributed for Kidney and Sultan data, SimSeq simulation for Bottomly and mouse mammary data sets and PROPER simulation for sequencing depths 5m and 50m for varying , , ( for the SimSeq Bottomly data and for SimSeq mouse data) and (lfdr adjustment). The plotted filtering methods and corresponding percentile percentages are given in the legend. Note that the range of the y-axis is chosen result-based; filtering methods with low power may not be visible on some plots
Description of data sets
| Data set | m(% of genes with only zero counts) | Description | |
|---|---|---|---|
| Kidney | 20531 (3) | 72/ 72 | non-tumour versus tumour samples [ |
| Kidney 2 | 20531 (5) | 10/10 | random sample of Kidney data set |
| Bottomly | 36536 (35) | 10/11 | C57BL/6J versus DBA/2J (mice strains) [ |
| Mouse mammary | 27179 (21) | 6/6 | basal versus luminal cell types in mice [ |
| Sultan | 52580 (83) | 2/2 | human embryonic kidney versus B cell lines [ |
| Airway | 64102 (52) | 4/4 | Airway smooth muscle cell lines [ |
| Airway 2 | 64102 (52) | 2/2 | random sample of Airway data set [ |
| De novo assembly: | |||
| Yuen | 96831 (12) | 3/3/3/3 | transcriptomes of lucinid clam of 4 organs [ |
| Only data simulation: | |||
| Cheung | 52580 (76) | 41 | lymphoblastoid cell lines from unrelated |
| individuals [ |
only a subset of 17580 genes with a reduced percentage of genes with only zeros is used for data simulation
Real data application
| No filter | Basic | Mean-based | Max-based | Zero-based | Jaccard | |
|---|---|---|---|---|---|---|
| Bottomly | 1443 | 1324 (35) | 1488 (24) | 1417 (15) | 1371 (34) | |
| Sultan | - | 2801 (83) | 3445 (15) | 2864 (42) | ||
| Airway | 0 | 1029 (48) | 1554 (60) | 1576 (60) | 1235 (26) | |
| Airway 2 | - | 102 (52) | 275 (80) | 120 (16) | 197 (54) | |
| Mouse | 9151 | 8772 (21) | 9173 (6) | 9192 (5) | 8593 (28) | |
| Kidney | 13076 | 13075 (3) | 13282 (5) | 11784 (19) | 13072 (2) | |
| Kidney 2 | 5777 | 6355 (3) | 6355 (3) | |||
| Yuen | ||||||
| gill vs. mantle | 7932 | 7932 (6) | 9912 (38) | 8619 (25) | 7932 (0) | |
| gill vs. foot | 7534 | 7534 (8) | 9537 (38) | 8194 (27) | 7543 (0) | |
| gill vs. vmass | 6093 | 6093 (4) | 8079 (44) | 7023 (23) | 6093 (0) | |
| mantle vs. foot | 5291 | 5291 (12) | 5802 (41) | 5470 (13) | 5291 (0) | |
| mantle vs. vmass | 2468 | 2468 (6) | 3655 (47) | 3034 (23) | 2468 (0) | |
| foot vs. vmass | 3605 | 3605 (7) | 5054 (35) | 4178 (27) | 3605 (0) |
Maximum number of rejections for each filtering method and the corresponding observed proportion of filtered genes in parentheses (for the basic filter based on all genes, for other filters on the non-zero genes) for several data sets, multiplicity adjustment with lfdrs, Filtering is performed at the end (order (a)). The adaptive filter with the highest number of rejections is highlighted in bold