| Literature DB >> 27766951 |
Yang Liu1,2, Saad M Khan1,2, Juexin Wang2,3, Mats Rynge4, Yuanxun Zhang3, Shuai Zeng2,3, Shiyuan Chen2,3, Joao V Maldonado Dos Santos5, Babu Valliyodan5,6, Prasad P Calyam3, Nirav Merchant7, Henry T Nguyen5,6, Dong Xu1,2,3, Trupti Joshi8,9,10,11.
Abstract
BACKGROUND: With the advances in next-generation sequencing (NGS) technology and significant reductions in sequencing costs, it is now possible to sequence large collections of germplasm in crops for detecting genome-scale genetic variations and to apply the knowledge towards improvements in traits. To efficiently facilitate large-scale NGS resequencing data analysis of genomic variations, we have developed "PGen", an integrated and optimized workflow using the Extreme Science and Engineering Discovery Environment (XSEDE) high-performance computing (HPC) virtual system, iPlant cloud data storage resources and Pegasus workflow management system (Pegasus-WMS). The workflow allows users to identify single nucleotide polymorphisms (SNPs) and insertion-deletions (indels), perform SNP annotations and conduct copy number variation analyses on multiple resequencing datasets in a user-friendly and seamless way.Entities:
Mesh:
Year: 2016 PMID: 27766951 PMCID: PMC5074001 DOI: 10.1186/s12859-016-1227-y
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Details of soybean NGS resequencing datasets generated
| Datasets | Number of sequenced lines | Coverage | Raw Data | # of reads (Millions) | Data Source |
|---|---|---|---|---|---|
| MSMC | 106 | 14.7 | 1.9 | 196.4 | Valliyodan et al. 2016 [ |
| USB Phase I | 300 | 17.6 | 3.63 | 194.40 | Unpublished |
| USB Phase II | 50 | 44.2 | 1.97 | 486.24 | Unpublished |
| Soja lines | 45 | 16.7 | 0.55 | 182.71 | Unpublished |
| Brazilian lines | 28 | 14.8 | 0.34 | 184.87 | Maldonado et al. 2016 [ |
Fig. 1Flowchart of steps and tools utilized in PGen workflow and downstream analysis
The PGen workflow consists of several individual tasks with diverse core and memory requirements, which were assigned based on tools’ applicability of multiple threads and memory cost after testing
| Tasks | Base code | Cores (Threads) | Memory (GB) |
|---|---|---|---|
| Indexing of reference genome | BWA/samtools/picard tools | 1 | 4 |
| Alignment to reference genome | BWA | 1 | 21 |
| Sorting sam files | Picard tools | 1 | 21 |
| Removal of PCR duplicates | Picard tools | 1 | 21 |
| Add or replace read groups | Picard tools | 1 | 21 |
| Create realign target | GATK_RealignerTargetCreator | 15 | 20 |
| Realign indels | GATK_IndelRealigner | 1 | 10 |
| Calling variants | GATK_HaplotypeCaller | 1 | 3 |
| Select SNPs and indels | GATK_SelectVariants | 14 | 10 |
| Filtering variants | GATK_VariantFiltration | 14 | 10 |
| Create genotype GVCF | GATK_GenotypeGVCFs | 1 | 10 |
| Merge GVCFs | GATK_CombineGVCFs | 1 | 20 |
| Combine variants | GATK_CombineVariants | 1 | 10 |
Fig. 2Message passing interface (MPI) jobs within PGen workflow using five soybean line examples
Fig. 3Interfaces of PGen workflow submission via SoyKB website: a PGen introduction and structure webpage, b upload page for input files to iPlant folder for computation, c create workflow page with inputs of raw data, reference genome filtering options and selected computing resource, d workflow monitoring with submission history and debug information, and e workflow result page for downloading outputs
Summary of results for NGS resequencing datasets analyzed with the PGen workflow
| Datasets | # of sequenced lines | # of SNPs | # of Indels | # of Non synonymous SNPs | # of CNVs |
|---|---|---|---|---|---|
| MSMC | 106 | 10,218,141 | 10,218,141 | 297,245 | 3,330 |
| USB Phase I | 300 | 11,972,497 | 1,590,729 | 221,013 | 7,444 |
| USB Phase II | 50 | 7,865,994 | 1,213,795 | 152,171 | 6,892 |
| Soja lines | 45 | 18,066,361 | 2,198,125 | 356,129 | 6,022 |
| Brazilian lines | 28 | 5,835,185 | 1,329,844 | 541,762 | 3,880 |
Comparison of running time of PGen workflow of one sample using different computing resources
| Resources | Job-runtime (sec) | Invocation-runtime (sec) | Cumulative job wall time | Host |
|---|---|---|---|---|
| ISI | 42374.0 | 41461.091 | 8 h, 29 mins | workflow.isi.edu |
| TACC-Stampede | 14054.0 | 31173.54 | 9 h, 11 mins | stampede.tacc.utexas.edu |
| TACC-Wrangler | 27146.0 | 27670.924 | 3 h, 25 mins | wrangler.tacc.utexas.edu |
Fig. 4NGS Resequencing Data Browser tool in SoyKB: a SoyKB NGS browser introduction and data resource webpage, b summary page of sample name, plant introduction (PI) number and raw data information, c data quality summary page with downloadable FastQC html report. d SNP page of variants information with selected gene region for each soybean line, e SNPEFF page showing annotated variant effects for selected gene region or gene IDs, and f CNV page with gain and loss CNV region information per chromosome