Travis S Johnson1,2, Tongxin Wang2,3, Zhi Huang2,4, Christina Y Yu1,2, Yi Wu2, Yatong Han5, Yan Zhang1,6, Kun Huang2,7, Jie Zhang8. 1. Department of Biomedical Informatics, The Ohio State University College of Medicine, Columbus, OH, USA. 2. Department of Medicine, Indiana University School of Medicine, Indianapolis, IN, USA. 3. Department of Computer Science, Indiana University, Bloomington, IN, USA. 4. School of Electrical and Computer Engineering, Purdue University, West Lafayette, IN, USA. 5. Harbin Engineering University, Harbin, China. 6. The Ohio State University Comprehensive Cancer Center (OSUCCC - James), Columbus, OH, USA. 7. Regenstrief Institute, Indiana University School of Medicine, Indianapolis, IN, USA. 8. Department of Medical and Molecular Genetics, Indiana University School of Medicine, Indianapolis, IN, USA.
Abstract
MOTIVATION: Rapid advances in single cell RNA sequencing (scRNA-seq) have produced higher-resolution cellular subtypes in multiple tissues and species. Methods are increasingly needed across datasets and species to (i) remove systematic biases, (ii) model multiple datasets with ambiguous labels and (iii) classify cells and map cell type labels. However, most methods only address one of these problems on broad cell types or simulated data using a single model type. It is also important to address higher-resolution cellular subtypes, subtype labels from multiple datasets, models trained on multiple datasets simultaneously and generalizability beyond a single model type. RESULTS: We developed a species- and dataset-independent transfer learning framework (LAmbDA) to train models on multiple datasets (even from different species) and applied our framework on simulated, pancreas and brain scRNA-seq experiments. These models mapped corresponding cell types between datasets with inconsistent cell subtype labels while simultaneously reducing batch effects. We achieved high accuracy in labeling cellular subtypes (weighted accuracy simulated 1 datasets: 90%; simulated 2 datasets: 94%; pancreas datasets: 88% and brain datasets: 66%) using LAmbDA Feedforward 1 Layer Neural Network with bagging. This method achieved higher weighted accuracy in labeling cellular subtypes than two other state-of-the-art methods, scmap and CaSTLe in brain (66% versus 60% and 32%). Furthermore, it achieved better performance in correctly predicting ambiguous cellular subtype labels across datasets in 88% of test cases compared with CaSTLe (63%), scmap (50%) and MetaNeighbor (50%). LAmbDA is model- and dataset-independent and generalizable to diverse data types representing an advance in biocomputing. AVAILABILITY AND IMPLEMENTATION: github.com/tsteelejohnson91/LAmbDA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Rapid advances in single cell RNA sequencing (scRNA-seq) have produced higher-resolution cellular subtypes in multiple tissues and species. Methods are increasingly needed across datasets and species to (i) remove systematic biases, (ii) model multiple datasets with ambiguous labels and (iii) classify cells and map cell type labels. However, most methods only address one of these problems on broad cell types or simulated data using a single model type. It is also important to address higher-resolution cellular subtypes, subtype labels from multiple datasets, models trained on multiple datasets simultaneously and generalizability beyond a single model type. RESULTS: We developed a species- and dataset-independent transfer learning framework (LAmbDA) to train models on multiple datasets (even from different species) and applied our framework on simulated, pancreas and brain scRNA-seq experiments. These models mapped corresponding cell types between datasets with inconsistent cell subtype labels while simultaneously reducing batch effects. We achieved high accuracy in labeling cellular subtypes (weighted accuracy simulated 1 datasets: 90%; simulated 2 datasets: 94%; pancreas datasets: 88% and brain datasets: 66%) using LAmbDA Feedforward 1 Layer Neural Network with bagging. This method achieved higher weighted accuracy in labeling cellular subtypes than two other state-of-the-art methods, scmap and CaSTLe in brain (66% versus 60% and 32%). Furthermore, it achieved better performance in correctly predicting ambiguous cellular subtype labels across datasets in 88% of test cases compared with CaSTLe (63%), scmap (50%) and MetaNeighbor (50%). LAmbDA is model- and dataset-independent and generalizable to diverse data types representing an advance in biocomputing. AVAILABILITY AND IMPLEMENTATION: github.com/tsteelejohnson91/LAmbDA. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Ye Zhang; Kenian Chen; Steven A Sloan; Mariko L Bennett; Anja R Scholze; Sean O'Keeffe; Hemali P Phatnani; Paolo Guarnieri; Christine Caneda; Nadine Ruderisch; Shuyun Deng; Shane A Liddelow; Chaolin Zhang; Richard Daneman; Tom Maniatis; Ben A Barres; Jian Qian Wu Journal: J Neurosci Date: 2014-09-03 Impact factor: 6.167
Authors: Maayan Baron; Adrian Veres; Samuel L Wolock; Aubrey L Faust; Renaud Gaujoux; Amedeo Vetere; Jennifer Hyoje Ryu; Bridget K Wagner; Shai S Shen-Orr; Allon M Klein; Douglas A Melton; Itai Yanai Journal: Cell Syst Date: 2016-09-22 Impact factor: 10.304
Authors: Erica A K DePasquale; Daniel Schnell; Phillip Dexheimer; Kyle Ferchen; Stuart Hay; Kashish Chetal; Íñigo Valiente-Alandí; Burns C Blaxall; H Leighton Grimes; Nathan Salomonis Journal: Nucleic Acids Res Date: 2019-12-02 Impact factor: 16.971
Authors: David Lähnemann; Johannes Köster; Ewa Szczurek; Davis J McCarthy; Stephanie C Hicks; Mark D Robinson; Catalina A Vallejos; Kieran R Campbell; Niko Beerenwinkel; Ahmed Mahfouz; Luca Pinello; Pavel Skums; Alexandros Stamatakis; Camille Stephan-Otto Attolini; Samuel Aparicio; Jasmijn Baaijens; Marleen Balvert; Buys de Barbanson; Antonio Cappuccio; Giacomo Corleone; Bas E Dutilh; Maria Florescu; Victor Guryev; Rens Holmer; Katharina Jahn; Thamar Jessurun Lobo; Emma M Keizer; Indu Khatri; Szymon M Kielbasa; Jan O Korbel; Alexey M Kozlov; Tzu-Hao Kuo; Boudewijn P F Lelieveldt; Ion I Mandoiu; John C Marioni; Tobias Marschall; Felix Mölder; Amir Niknejad; Lukasz Raczkowski; Marcel Reinders; Jeroen de Ridder; Antoine-Emmanuel Saliba; Antonios Somarakis; Oliver Stegle; Fabian J Theis; Huan Yang; Alex Zelikovsky; Alice C McHardy; Benjamin J Raphael; Sohrab P Shah; Alexander Schönhuth Journal: Genome Biol Date: 2020-02-07 Impact factor: 13.583
Authors: Travis S Johnson; Shunian Xiang; Bryan R Helm; Zachary B Abrams; Peter Neidecker; Raghu Machiraju; Yan Zhang; Kun Huang; Jie Zhang Journal: Sci Rep Date: 2020-10-22 Impact factor: 4.379
Authors: Travis S Johnson; Christina Y Yu; Zhi Huang; Siwen Xu; Tongxin Wang; Chuanpeng Dong; Wei Shao; Mohammad Abu Zaid; Xiaoqing Huang; Yijie Wang; Christopher Bartlett; Yan Zhang; Brian A Walker; Yunlong Liu; Kun Huang; Jie Zhang Journal: Genome Med Date: 2022-02-01 Impact factor: 11.117