| Literature DB >> 34370400 |
Zeeshan Ahmed1,2, Eduard Gibert Renart1, Deepshikha Mishra1, Saman Zeeshan3.
Abstract
Whole genome and exome sequencing (WGS/WES) are the most popular next-generation sequencing (NGS) methodologies and are at present often used to detect rare and common genetic variants of clinical significance. We emphasize that automated sequence data processing, management, and visualization should be an indispensable component of modern WGS and WES data analysis for sequence assembly, variant detection (SNPs, SVs), imputation, and resolution of haplotypes. In this manuscript, we present a newly developed findable, accessible, interoperable, and reusable (FAIR) bioinformatics-genomics pipeline Java based Whole Genome/Exome Sequence Data Processing Pipeline (JWES) for efficient variant discovery and interpretation, and big data modeling and visualization. JWES is a cross-platform, user-friendly, product line application, that entails three modules: (a) data processing, (b) storage, and (c) visualization. The data processing module performs a series of different tasks for variant calling, the data storage module efficiently manages high-volume gene-variant data, and the data visualization module supports variant data interpretation with Circos graphs. The performance of JWES was tested and validated in-house with different experiments, using Microsoft Windows, macOS Big Sur, and UNIX operating systems. JWES is an open-source and freely available pipeline, allowing scientists to take full advantage of all the computing resources available, without requiring much computer science knowledge. We have successfully applied JWES for processing, management, and gene-variant discovery, annotation, prediction, and genotyping of WGS and WES data to analyze variable complex disorders. In summary, we report the performance of JWES with some reproducible case studies, using open access and in-house generated, high-quality datasets.Entities:
Keywords: bioinformatics application; database; gene; variants; whole exome; whole genome
Mesh:
Year: 2021 PMID: 34370400 PMCID: PMC8409305 DOI: 10.1002/2211-5463.13261
Source DB: PubMed Journal: FEBS Open Bio ISSN: 2211-5463 Impact factor: 2.693
Fig. 1JWES pipeline for the whole genome and exome data processing, modeling, and downstream analysis. The figure explains all the data processing and analysis steps, which include input, QC, trimming, alignment, sort, mark duplicates, insert size, sort and index, create realignment targets, realign indels, Base Quality Score Recalibration (BQSR), analyze covariates, apply BQSR, recalibrate, extract filtered, compute coverage, annotate and predict.
Fig. 2JWES pipeline data and workflow. The figure explains overall roadmap of JWES, which includes input preparation, automatics script generation, output files management, and variants data storage in database
Fig. 3JWES database design. The figure explains ERD of JWES database, which includes three tables: WES Info, WES Samples, and WES Variant
List of publicly available NGS datasets and extracted total number of variants using JWES. Table 1 provides an overview of different whole genome/exome sequencing projects selected for variant discovery using JWES. Table 1 includes data type, project ids, sample numbers and ids, total variant count, sources URL, and date last accessed.
| Data Type | Project IDs | Sample numbers | Sample IDs | Total variants | Source URL | Date accessed |
|---|---|---|---|---|---|---|
| WGS | PRJNA657985 | 1 | SRR12474733 | 43 685 |
| 06‐28‐2021 |
| WGS | PRJNA657938 | 1 | SRR12486921 | 2 736 453 |
| 06‐28‐2021 |
| WGS | PRJNA624223 | 1 | SRR12328890 | 1 793 959 |
| 06‐28‐2021 |
| WGS | PRJEB39632 | 3 | ERR4387385, ERR4387386, ERR4387388 | 154 016 |
| 06‐28‐2021 |
| WGS | PRJNA649101 | 7 | SRR12336742, SRR12336753, SRR12336755, SRR12336756, SRR12336761, SRR12336765, SRR12336766 | 54 930 |
| 06‐28‐2021 |
| WGS/ATAC‐seq | PRJNA207663 | 1 | SRR891275 | 20 749 |
| 02‐28‐2021 |
JWES performance evaluation details based on processed high‐quality—in‐house generated WGS datasets. Table 2 provides an overview of processing time (hours) taken by the JWES pipeline to complete the task of variant calling. The performance of the JWES is based upon number of features including, the size of sample (RAW Data), VCF file size, Memory, Nodes, and designated CPUs‐per‐task.
| Sample IDs | Total variants | RAW data – sample sizes | VCF file size (SNP and Indel) | Time (h) | Number of nodes | CPUs – per – tasks | Memory |
|---|---|---|---|---|---|---|---|
| 1 | 4 867 674 | 1.2 TB | 2.6 GB 654 MB | 65 | 1 | 8 | 46G |
| 2 | 4 928 789 | 1.5 TB | 2.6 GB 678 MB | 74 | 1 | 8 | 46G |
| 3 | 5 808 057 | 1.7 TB | 3.1 GB 812 MB | 77 | 1 | 8 | 46G |
| 4 | 4 897 749 | 1.3 TB | 2.6 GB 657 MB | 61 | 1 | 8 | 46G |
| 5 | 4 883 410 | 1.4 TB | 2.6 GB 671 MB | 70 | 1 | 8 | 46G |
| 6 | 4 983 681 | 1.6 TB | 2.6 GB 698 MB | 83 | 1 | 8 | 46G |
| 7 | 5 000 735 | 1.5 TB | 2.6 GB 698 MB | 88 | 1 | 8 | 46G |
| 8 | 5 902 241 | 1.9 TB | 3.1 GB 837 MB | 95 | 1 | 8 | 46G |
| 9 | 4 870 099 | 1.1 TB | 2.6 GB 654 MB | 57 | 1 | 8 | 46G |
| 10 | 4 925 968 | 1.3 TB | 2.6 GB 675 MB | 67 | 1 | 8 | 46G |
Fig. 4JWES visualization. The figure presents Circos graph plotting all the variants for all chromosomes. The internal histogram represents the total number of variants found in the protein‐coding genes, and the external histogram represents variants found in the noncoding genes