Alessandro Romanel1, Tuo Zhang2,3, Olivier Elemento2,4, Francesca Demichelis1,4. 1. CIBIO, University of Trento, Trento, Italy. 2. Caryl and Israel Englander Institute for Precision Medicine, New York Presbyterian Hospital-Weill Cornell Medicine, New York, NY, USA. 3. Genomics Core Facility. 4. Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, USA.
Abstract
SUMMARY: Whole exome sequencing (WES) is widely utilized both in translational cancer genomics studies and in the setting of precision medicine. Stratification of individual's ethnicity is fundamental for the correct interpretation of personal genomic variation impact. We implemented EthSEQ to provide reliable and rapid ethnicity annotation from whole exome sequencing individual's data, validated it on 1000 Genome Project and TCGA data (2700 samples) demonstrating high precision, and finally assessed computational performances compared to other tools. EthSEQ can be integrated into any WES based processing pipeline and exploits multi-core capabilities. AVAILABILITY AND IMPLEMENTATION: R package available at github.com/aromanel/EthSEQ and CRAN repository. CONTACT: alessandro.romanel@unitn.it or f.demichelis@unitn.it. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
SUMMARY: Whole exome sequencing (WES) is widely utilized both in translational cancer genomics studies and in the setting of precision medicine. Stratification of individual's ethnicity is fundamental for the correct interpretation of personal genomic variation impact. We implemented EthSEQ to provide reliable and rapid ethnicity annotation from whole exome sequencing individual's data, validated it on 1000 Genome Project and TCGA data (2700 samples) demonstrating high precision, and finally assessed computational performances compared to other tools. EthSEQ can be integrated into any WES based processing pipeline and exploits multi-core capabilities. AVAILABILITY AND IMPLEMENTATION: R package available at github.com/aromanel/EthSEQ and CRAN repository. CONTACT: alessandro.romanel@unitn.it or f.demichelis@unitn.it. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Interrogation of the entire coding genome for germline and somatic variations through Whole Exome Sequencing (WES) is rapidly becoming a preferred approach for the exploration of large cohorts (such as The Cancer Genome Atlas initiative) especially in the context of precision medicine programs (Beltran ). In this setting, the estimation of individual’s ethnical background is fundamental for the correct interpretation of variant association studies and of personal genomic variations importance (Petrovski and Goldstein, 2016; Price ; Spratt ; Zhang ). To enable effective annotation of individual’s ethnicity and improve downstream analysis and interpretation of germline and somatic variations, we developed EthSEQ, a tool that implements a rapid and reliable pipeline for ethnicity annotation from WES data. The tool can be used to annotate ethnicity of individuals with germline WES data available and can be integrated in any WES-based processing pipeline. EthSEQ also exploits multi-core technologies when available.
2 Approach
EthSEQ provides an automated pipeline, implemented as R package, to annotate the ethnicity of individuals from WES data inspecting differential SNPs genotype profiles while exploiting variants covered by the specific assay. As input the tool requires genotype data at SNPs positions for a set of individuals with known ethnicity (the reference model) and either a list of BAM files or genotype data of individuals with unknown ethnicity. EthSEQ then annotates the ethnicity of each individual using an automated procedure (Supplementary Fig. S1a) and returns detailed information about individual’s inferred ethnicity, including aggregated visual reports.The reference model builds on genotype data of individuals with known ethnicity; 1000 Genome Project individuals data is here used to construct platform-specific reference models relying on the most conserved ethnic groups EUR (Caucasian), AFR (African), EAS (East Asian) and SAS (South Asian) for multiple WES designs: Agilent HaloPlex, Agilent SureSelect and Roche Nimblegen (). More generally, given a set of genomic regions and genotype data of a set of individuals annotated for ethnicity, a procedure to automatically generate a reference model is also provided by EthSEQ. The target model is created either from the input list of individual’s germline BAM files that are genotyped at all reference model’s positions using the genotyping module of ASEQ (Romanel ) (depth of coverage ≥ 10X and read/base mapping qualities ≥ 20 here required by default to guarantee confident genotype calls) or from genotypes provided as input to EthSEQ in VCF format.Principal component analysis (PCA) is next performed by means of SNPRelate R package (Zheng ) on aggregated target and reference models genotype data; only SNPs that satisfy user-defined call rate are retained. The space defined by the first two PCA components is then automatically inspected to first generate the smallest convex sets identifying the ethnic groups described in the reference model and next to annotate the ethnicity of the individuals of interest (Supplementary Fig. S1b). Individuals positioned inside an ethnic group set (or intersecting group sets) are annotated with the corresponding ethnicity and labeled with INSIDE. For individuals positioned outside all ethnic group sets, the relative contribution of each group is computed through the distances from the centroids using the procedure described in Supplementary Figure S2, and top ranked contributing groups are reported (labeled CLOSEST).To better discern ethnicity annotations across ancestrally close groups within a study cohort (for instance Ashkenazi and Caucasians), a multi-step inference procedure is provided. Given a tree of ethnic group sets such that sibling nodes have non-intersecting ethnic groups and child nodes have ethnic groups included in the parent node ethnic groups, ethnicity of individuals is inferred by a pre-order traversal of the tree. At each node with ethnic groups S, annotations resulted from the analysis of the parent node is refined by reducing both reference and target models on individuals with annotations in S only. Global annotation of all individuals is updated throughout the tree traversal.
3 Performances and results
Performances of EthSEQ ethnicity inference method were tested for precision, computational time and dependence on SNP set size on two main datasets, 1000 Genomes Project genotype data and germline samples TCGA data.Initial precision tests utilized 1000 Genomes Project data; we randomly divided 2096 individuals into reference and target model groups while preserving the ethnic groups proportions, and ran EthSEQ relying only on SNPs in WES platform-specific captured regions (Supplementary Methods). Analyses were performed using reference models either built considering major ethnic groups annotations (EUR, AFR, EAS and SAS) or considering annotations for all the corresponding 21 populations reported in the 1000 Genome Project. In the first case, individuals’ ethnicities were all correctly classified (100% precision and more than 97% of the individuals annotated with the INSIDE label) (Supplementary Fig. S3, Table S1). When the fine-grained annotation was used, ethnicity inference reached a precision of 92.2% with the multi-step refinement analysis. For instance, when considering European individuals only that includes 5 populations, precision reached 94% (Supplementary Methods for details).Finally, EthSEQ performances were compared to LASER 2.0 trace module (Wang ) performances on the same data. Results in the PCA space were highly concordant (Supplementary Figs S4 and S5), but EthSEQ was up-to 10X faster using a single core and up-to 18X faster when exploiting parallel computation (Supplementary Fig. S6) on multi-individual analyses (Supplementary Methods for details).Further, EthSEQ was ran on germline WES data from 604 TCGA (cancergenome.nih.gov) individuals with reported interview-based race classification (as per TCGA nomenclature, race is annotated as 513 White, 42 Black or African American and 49 Asian). 505 White individuals were annotated by EthSEQ as EUR, 37 Black or African American individuals as AFR and 48 Asian individuals as EAS or SAS for an overall precision of 97.7%. EthSEQ results were compared to results from fastSTRUCTURE tool (Raj ) fed with genotype data generated by EthSEQ pre-processing module. For 594 individuals (98.3%) the two analyses inferred the same major ethnic contribution. Both tools inferred 5 individuals originally annotated as Black or African American in TCGA as admixed with a major Caucasian contribution, one originally annotated as Asian as non-admixed Caucasian, and two originally annotated as White as major African contribution (see Supplementary Methods and Table S1). In terms of tool specific results, 4.6% of individuals were inferred as admixed by fastSTRUCTURE that explained the TCGA dataset population structure with 3 clusters achieving a precision of 98%; 7.9% of individuals were inferred as CLOSEST by EthSEQ with the majority with SAS main contributions, not captured by fastSTRUCTURE, and secondary African contribution correctly detected by EthSEQ when above 15%. EthSEQ analysis resulted 3.2X faster.The effectiveness of the multi-step refinement analysis was recently proven in a precision medicine setting study (Zhang ) were ethnicity based stratification was key to interpret the relevance of germline cancer-associated variants. Specifically, our analysis ruled out the possibility that the high fraction of germline cancer-associated variants observed in the clinical cohort of 343 patients with metastatic tumors (Beltran ) was due to the presence of Ashkenazi inheritance (Carmi ) shown to carry high percentage of cancer-associated variants. Provided an Agilent HaloPlex reference model including Ashkenazi genome data (Carmi ) the identification of Ashkenazi individuals required the multi-step analysis to precisely discern them from the ancestrally close European individuals; 29.7% of Ashkenazi individuals were identified confirming the anticipated fraction of about 30% based on an internal cancer registry.To measure the impact of the number of available SNPs on EthSEQ precision, we extended the performance analyses by randomly down-sampling the number of SNPs both in the 1000 Genome Project and in the TCGA dataset (Supplementary Methods). Supplementary Figure S7 shows that using the multi-step refinement analysis, 2000 SNPs are sufficient to reach more than 98% precision. Overall, this data indicates that EthSEQ is also amenable to targeted sequencing NGS data.
4 Conclusions
We presented EthSEQ, a rapid, reliable and easy to use R package to annotate individuals ethnicity from WES data. EthSEQ can be used to process single sample or multi-sample datasets, provides a large variety of pre-computed platform-specific reference models, a simple and transparent mode to generate ethnicity annotations starting from a list of BAM files and can be easily integrated into any WES based processing pipeline also exploiting multi-core capabilities.
Funding
This work has been supported by the Prostate Cancer Foundation Challenge Award 2014 (F.D., A.R.), the Caryl and Israel Englander Institute for Precision Medicine, New York and the European Research Council ERCCoG648670 (F.D.).Conflict of Interest: none declared.Click here for additional data file.Click here for additional data file.
Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330
Authors: Himisha Beltran; Kenneth Eng; Juan Miguel Mosquera; Alexandros Sigaras; Alessandro Romanel; Hanna Rennert; Myriam Kossai; Chantal Pauli; Bishoy Faltas; Jacqueline Fontugne; Kyung Park; Jason Banfelder; Davide Prandi; Neel Madhukar; Tuo Zhang; Jessica Padilla; Noah Greco; Terra J McNary; Erick Herrscher; David Wilkes; Theresa Y MacDonald; Hui Xue; Vladimir Vacic; Anne-Katrin Emde; Dayna Oschwald; Adrian Y Tan; Zhengming Chen; Colin Collins; Martin E Gleave; Yuzhuo Wang; Dimple Chakravarty; Marc Schiffman; Robert Kim; Fabien Campagne; Brian D Robinson; David M Nanus; Scott T Tagawa; Jenny Z Xiang; Agata Smogorzewska; Francesca Demichelis; David S Rickman; Andrea Sboner; Olivier Elemento; Mark A Rubin Journal: JAMA Oncol Date: 2015-07 Impact factor: 31.777
Authors: Daniel E Spratt; Tiffany Chan; Levi Waldron; Corey Speers; Felix Y Feng; Olorunseun O Ogunwobi; Joseph R Osborne Journal: JAMA Oncol Date: 2016-08-01 Impact factor: 31.777
Authors: Shai Carmi; Ken Y Hui; Ethan Kochav; Xinmin Liu; James Xue; Fillan Grady; Saurav Guha; Kinnari Upadhyay; Dan Ben-Avraham; Semanti Mukherjee; B Monica Bowen; Tinu Thomas; Joseph Vijai; Marc Cruts; Guy Froyen; Diether Lambrechts; Stéphane Plaisance; Christine Van Broeckhoven; Philip Van Damme; Herwig Van Marck; Nir Barzilai; Ariel Darvasi; Kenneth Offit; Susan Bressman; Laurie J Ozelius; Inga Peter; Judy H Cho; Harry Ostrer; Gil Atzmon; Lorraine N Clark; Todd Lencz; Itsik Pe'er Journal: Nat Commun Date: 2014-09-09 Impact factor: 14.919
Authors: Verena Sailer; Kenneth Wa Eng; Tuo Zhang; Rohan Bareja; David J Pisapia; Alexandros Sigaras; Bhavneet Bhinder; Alessandro Romanel; David Wilkes; Evan Sticca; Joanna Cyrta; Rema Rao; Sheena Sahota; Chantal Pauli; Shaham Beg; Samaneh Motanagh; Myriam Kossai; Jacqueline Fontunge; Loredana Puca; Hanna Rennert; Jenny Zhaoying Xiang; Noah Greco; Rob Kim; Theresa Y MacDonald; Terra McNary; Mirjam Blattner-Johnson; Marc H Schiffman; Bishoy M Faltas; Jeffrey P Greenfield; David Rickman; Eleni Andreopoulou; Kevin Holcomb; Linda T Vahdat; Douglas S Scherr; Koen van Besien; Christopher E Barbieri; Brian D Robinson; Howard Alan Fine; Allyson J Ocean; Ana Molina; Manish A Shah; David M Nanus; Qiulu Pan; Francesca Demichelis; Scott T Tagawa; Wei Song; Juan Miguel Mosquera; Andrea Sboner; Mark A Rubin; Olivier Elemento; Himisha Beltran Journal: JCO Precis Oncol Date: 2019-09-20
Authors: Francesco Orlando; Alessandro Romanel; Blanca Trujillo; Michael Sigouros; Daniel Wetterskog; Orsetta Quaini; Gianmarco Leone; Jenny Z Xiang; Anna Wingate; Scott Tagawa; Anuradha Jayaram; Mark Linch; Mariam Jamal-Hanjani; Charles Swanton; Mark A Rubin; Alexander W Wyatt; Himisha Beltran; Gerhardt Attard; Francesca Demichelis Journal: NAR Cancer Date: 2022-05-27
Authors: Franklin W Huang; Juan Miguel Mosquera; Andrea Garofalo; Coyin Oh; Maria Baco; Ali Amin-Mansour; Bokang Rabasha; Samira Bahl; Stephanie A Mullane; Brian D Robinson; Saud Aldubayan; Francesca Khani; Beerinder Karir; Eejung Kim; Jeremy Chimene-Weiss; Matan Hofree; Alessandro Romanel; Joseph R Osborne; Jong Wook Kim; Gissou Azabdaftari; Anna Woloszynska-Read; Karen Sfanos; Angelo M De Marzo; Francesca Demichelis; Stacey Gabriel; Eliezer M Van Allen; Jill Mesirov; Pablo Tamayo; Mark A Rubin; Isaac J Powell; Levi A Garraway Journal: Cancer Discov Date: 2017-05-17 Impact factor: 39.397
Authors: Haakon B Nygaard; E Zeynep Erson-Omay; Xiujuan Wu; Brianne A Kent; Cecily Q Bernales; Daniel M Evans; Matthew J Farrer; Carles Vilariño-Güell; Stephen M Strittmatter Journal: J Gerontol A Biol Sci Med Sci Date: 2019-08-16 Impact factor: 6.053
Authors: Daniel J Renouf; Steven J M Jones; Kasmintan A Schrader; My Linh Thibodeau; Eric Y Zhao; Caralyn Reisle; Carolyn Ch'ng; Hui-Li Wong; Yaoqing Shen; Martin R Jones; Howard J Lim; Sean Young; Carol Cremin; Erin Pleasance; Wei Zhang; Robert Holt; Peter Eirew; Joanna Karasinska; Steve E Kalloger; Greg Taylor; Elisa Majounie; Melika Bonakdar; Zusheng Zong; Dustin Bleile; Readman Chiu; Inanc Birol; Karen Gelmon; Caroline Lohrisch; Karen L Mungall; Andrew J Mungall; Richard Moore; Yussanne P Ma; Alexandra Fok; Stephen Yip; Aly Karsan; David Huntsman; David F Schaeffer; Janessa Laskin; Marco A Marra Journal: Cold Spring Harb Mol Case Stud Date: 2019-04-01
Authors: Yanran Wang; Maximilian Miller; Yuri Astrakhan; Britt-Sabina Petersen; Stefan Schreiber; Andre Franke; Yana Bromberg Journal: Genome Med Date: 2019-09-30 Impact factor: 11.117
Authors: Jian Carrot-Zhang; Nyasha Chambwe; Jeffrey S Damrauer; Theo A Knijnenburg; A Gordon Robertson; Christina Yau; Wanding Zhou; Ashton C Berger; Kuan-Lin Huang; Justin Y Newberg; R Jay Mashl; Alessandro Romanel; Rosalyn W Sayaman; Francesca Demichelis; Ina Felau; Garrett M Frampton; Seunghun Han; Katherine A Hoadley; Anab Kemal; Peter W Laird; Alexander J Lazar; Xiuning Le; Ninad Oak; Hui Shen; Christopher K Wong; Jean C Zenklusen; Elad Ziv; Andrew D Cherniack; Rameen Beroukhim Journal: Cancer Cell Date: 2020-05-11 Impact factor: 38.585