| Literature DB >> 36063052 |
Justine Labory1,2, Gwendal Le Bideau2, David Pratella1, Jean-Elisée Yao1, Samira Ait-El-Mkadem Saadi2, Sylvie Bannwarth2, Loubna El-Hami1,2, Véronique Paquis-Fluckinger2, Silvia Bottini1.
Abstract
MOTIVATION: Current advances in omics technologies are paving the diagnosis of rare diseases proposing a complementary assay to identify the responsible gene. The use of transcriptomic data to identify aberrant gene expression (AGE) has demonstrated to yield potential pathogenic events. However, popular approaches for AGE identification are limited by the use of statistical tests that imply the choice of arbitrary cut-off for significance assessment and the availability of several replicates not always possible in clinical contexts.Entities:
Mesh:
Substances:
Year: 2022 PMID: 36063052 PMCID: PMC9563686 DOI: 10.1093/bioinformatics/btac603
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.931
Fig. 1.The workflow of ABEILLE to identify AGE. It is comprised of two phases: (A) a supervised phase to identify parameters intervals of aberration and (B) an unsupervised one to identify AGE. (A) We feed the VAE with the semi-synthetic dataset (I_expr) and let the model generate the reconstructed counts (R_expr) (A-I). AGEs present in the original dataset represent a perturbation to the data distribution. The integrity of reconstructed values by the VAE is compromised for AGEs. Thus, comparing I_expr and R_expr should lead the identification of AGEs (A-II left). To enhance the comparison and evaluate the reconstruction fidelity, we established two novel metrics: the divergence score and the delta count (A-II right). For each gene, the divergence score and delta count are plotted for each patient composing the cohort. We thus apply a linear regression model and calculate parameters associated to this model to evaluate its position on the plot and its reconstruction fidelity (A-III). These parameters were used to feed a decision tree using the CART algorithm. The obtained decision tree is showed in A-IV. This decision tree gives the intervals of the regression parameters to classify gene expression as AGE or NGE. (B) The first step is to run the VAE and the linear regression model on unlabeled dataset in order to calculate the regression parameters on the values of divergence score and delta count obtained by comparing I_expr and R_expr (B-I and B-II). The procedure is the same as described for the supervised phase. The obtained regression parameters on the unlabeled dataset are compared to the intervals obtained with ABEILLE decision tree during the supervised phase. This allow to classify gene expression as AGE or NGE based on the regression parameters (B-III). An isolation forest approach is used to calculate the anomaly score to be associated to AGEs (B-IV)
Fig. 2.ABEILLE VAE features captures biological signals. (A) Encoding dimensions 10 and 1 stratify patients batch group. (B) Encoding dimension 118 separates patients by sex. (C) Full ABEILLE encoding dimensions by Kremer sample heatmap. Patients on the y axis, biological features on the x axis according to the legend in the plot
Fig. 3.The three novel parameters defined in ABEILLE: the divergence score, the delta count and the Aberrant Score. Measure of the divergence score versus delta count (first row) and divergence score versus anomaly score (second row) (A) representing the top five AGEs and (B) the bottom five AGEs found by ABEILLE in the Kremer dataset sorted by anomaly score; (C) five randomly selected genes where no aberrant expression was found in any patient of the cohort. The red point represents the AGE with the patient identifier indicated closer to the point and black point the NGE
Fig. 4.Benchmark of ABEILLE, OUTRIDER and OutPyR on real data. (A) The number of AGEs per patient is reported sorted in descending order for ABEILLE, OUTRIDER and Kremer, in ascending order for OutPyR. (B) Venn diagram representing the number of AGEs shared by ABEILLE, OUTRIDER and Kremer. (C) Mosaic plot showing the proportion of AGEs shared by ABEILLE and Kremer or OUTRIDER and Kremer with respect to not shared AGEs. (D) Summary of the functional analysis results performed on AGEs identified by the 3 approaches on 11 ontologies regarding terms related to mitochondrial biology and diseases. (E) Performances of ABEILLE and OUTRIDER on small datasets