| Literature DB >> 33097812 |
Tomoya Tanjo1, Yosuke Kawai2, Katsushi Tokunaga2, Osamu Ogasawara3, Masao Nagasaki4,5.
Abstract
Studies in human genetics deal with a plethora of human genome sequencing data that are generated from specimens as well as available on public domains. With the development of various bioinformatics applications, maintaining the productivity of research, managing human genome data, and analyzing downstream data is essential. This review aims to guide struggling researchers to process and analyze these large-scale genomic data to extract relevant information for improved downstream analyses. Here, we discuss worldwide human genome projects that could be integrated into any data for improved analysis. Obtaining human whole-genome sequencing data from both data stores and processes is costly; therefore, we focus on the development of data format and software that manipulate whole-genome sequencing. Once the sequencing is complete and its format and data processing tools are selected, a computational platform is required. For the platform, we describe a multi-cloud strategy that balances between cost, performance, and customizability. A good quality published research relies on data reproducibility to ensure quality results, reusability for applications to other datasets, as well as scalability for the future increase of datasets. To solve these, we describe several key technologies developed in computer science, including workflow engine. We also discuss the ethical guidelines inevitable for human genomic data analysis that differ from model organisms. Finally, the future ideal perspective of data processing and analysis is summarized.Entities:
Mesh:
Year: 2020 PMID: 33097812 PMCID: PMC7728600 DOI: 10.1038/s10038-020-00862-1
Source DB: PubMed Journal: J Hum Genet ISSN: 1434-5161 Impact factor: 3.172
Large-scale cohort studies with genomic information
| Project | Description | Website | Country | Reference |
|---|---|---|---|---|
| Human Genome Project (HGP) | The Initial sequencing program of the human genome | International | [ | |
| International HapMap Project | Study of the common pattern of human genetic variation using SNP array | International | [ | |
| 1000 Genomes Project | Determining the human genetic variation by means of whole-genome sequencing in population scale | International | [ | |
| Human Genome Diversity Project | Biological samples and genetic data collection from different population groups throughout the world | International | [ | |
| Simon Genome Diversity Project | Whole-genome sequencing project of diverse human populations | International | [ | |
| Genome Asia 100k | WGS-based genome study of people in South and East Asia | International | [ | |
| UK Biobank | Biobank study involving 500,000 residents in the UK | UK | [ | |
| Genomics England | WGS-based genome study of patient with rare disease and their families and cancer patients in England | UK | [ | |
| FinnGen | Nationwide biobank and genome cohort study in Finland | Finnland | [ | |
| Tohoku Medical Megabank Project | Biobank and genome cohort study for local area (north-east region) in Japan | Japan | [ | |
| Biobank Japan | Nationwide patient-based biobank and genome cohort study in Japan | Japan | [ | |
| Trans-Omics for Precision Medicine (TOPMed) | A genomic medicine research project to perform omics analysis pre-existing cohort samples | USA | [ | |
| BioMe Biobank | Electronic health record-linked biobank of patients from the Mount Sinai Healthcare System | USA | [ | |
| Michigan Genomics Initiative | Electronic health record-linked biobank of patients from the University of Michigan Health System | USA | [ | |
| BioVU | Repository of DNA samples and genetic information in Vanderbilt University Medical Center | USA | [ | |
| DiscovEHR | Electronic health record-linked genome study of participants in Geisinger’s MyCode Community Health Initiative | USA | [ | |
| eMERGE | Consortium of biorepositories with electronic medical record systems and genomic information | USA | [ | |
| Kaiser Permanente Research Bank | Nationwide biobank collecting genetic information from a blood sample, medical record information, and survey data on lifestyle from seven areas of US | USA | [ | |
| Million Veteran Program | Genome cohort study and biobank of participants of the Department of Veterans Affairs (VA) health care system | USA | [ | |
| CARTaGENE | Biobank study of 43,000 Québec residents | Canada | [ | |
| lifelines | Multigenerational cohort study that includes over 167,000 participants from the northern population of the Netherlands | Netherlands | [ | |
| Taiwan Biobank | Nationwide biobank and genome cohort study of residents in Taiwan | Taiwan | [ | |
| China Kadoorie Biobank | Genome cohort study of patients with chronic diseases in China | China | [ |
Fig. 1The simple hello world example by using workflow description languages: (a) Nextflow, (b) WDL, and (c) CWL
Fig. 2Example of the GUI editor of workflow engine; snapshot of the Radix Composer. The flow shows an RNA-Seq pipeline
Fig. 3Example of “Dockerfile”, a tool description, and a workflow description of kallisto workflow in CWL. a “Dockerfile” to build an RNA-Seq fastq data processing tool kallisto. b A CWL sub-workflow used in d. The workflow creates the index file for kallisto of the target transcript sequences with fasta format. c CWL sub-workflow used in d. The workflow processes RNA-Seq fastq files to generate their abundance of transcripts. d The main CWL workflow operates the sub-workflow in b and c