Hannes Wartmann1, Sven Heins1, Karin Kloiber1, Stefan Bonn1. 1. Institute of Medical Systems Biology, Center for Biomedical AI, University Medical Center Hamburg-Eppendorf, 20251 Hamburg, Germany.
Abstract
BACKGROUND: Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Missing annotations makes it impossible for researchers to find datasets specific to their needs. FINDINGS: Here, we investigate RNA-sequencing metadata prediction based on gene expression values. We present a deep-learning-based domain adaptation algorithm for the automatic annotation of RNA-sequencing metadata. We show, in multiple experiments, that our model is better at integrating heterogeneous training data compared with existing linear regression-based approaches, resulting in improved tissue type classification. By using a model architecture similar to Siamese networks, the algorithm can learn biases from datasets with few samples. CONCLUSION: Using our novel domain adaptation approach, we achieved metadata annotation accuracies up to 15.7% better than a previously published method. Using the best model, we provide a list of >10,000 novel tissue and sex label annotations for 8,495 unique SRA samples. Our approach has the potential to revive idle datasets by automated annotation making them more searchable.
BACKGROUND: Recent technological advances have resulted in an unprecedented increase in publicly available biomedical data, yet the reuse of the data is often precluded by experimental bias and a lack of annotation depth and consistency. Missing annotations makes it impossible for researchers to find datasets specific to their needs. FINDINGS: Here, we investigate RNA-sequencing metadata prediction based on gene expression values. We present a deep-learning-based domain adaptation algorithm for the automatic annotation of RNA-sequencing metadata. We show, in multiple experiments, that our model is better at integrating heterogeneous training data compared with existing linear regression-based approaches, resulting in improved tissue type classification. By using a model architecture similar to Siamese networks, the algorithm can learn biases from datasets with few samples. CONCLUSION: Using our novel domain adaptation approach, we achieved metadata annotation accuracies up to 15.7% better than a previously published method. Using the best model, we provide a list of >10,000 novel tissue and sex label annotations for 8,495 unique SRA samples. Our approach has the potential to revive idle datasets by automated annotation making them more searchable.
Authors: Peter A C 't Hoen; Marc R Friedländer; Jonas Almlöf; Michael Sammeth; Irina Pulyakhina; Seyed Yahya Anvar; Jeroen F J Laros; Henk P J Buermans; Olof Karlberg; Mathias Brännvall; Johan T den Dunnen; Gert-Jan B van Ommen; Ivo G Gut; Roderic Guigó; Xavier Estivill; Ann-Christine Syvänen; Emmanouil T Dermitzakis; Tuuli Lappalainen Journal: Nat Biotechnol Date: 2013-09-15 Impact factor: 54.908
Authors: Zachary D Stephens; Skylar Y Lee; Faraz Faghri; Roy H Campbell; Chengxiang Zhai; Miles J Efron; Ravishankar Iyer; Michael C Schatz; Saurabh Sinha; Gene E Robinson Journal: PLoS Biol Date: 2015-07-07 Impact factor: 8.029
Authors: Aaron M Smith; Jonathan R Walsh; John Long; Craig B Davis; Peter Henstock; Martin R Hodge; Mateusz Maciejewski; Xinmeng Jasmine Mu; Stephen Ra; Shanrong Zhao; Daniel Ziemek; Charles K Fisher Journal: BMC Bioinformatics Date: 2020-03-20 Impact factor: 3.169