| Literature DB >> 30509173 |
Peter Sona1, Jong Hui Hong1, Sunho Lee1, Byong Joon Kim1, Woon-Young Hong1, Jongcheol Jung1, Han-Na Kim2, Hyung-Lae Kim2, David Christopher3, Laurent Herviou3, Young Hwan Im3, Kwee-Yum Lee1,4, Tae Soon Kim1,5, Jongsun Jung6.
Abstract
BACKGROUND: The use of whole genome sequence has increased recently with rapid progression of next-generation sequencing (NGS) technologies. However, storing raw sequence reads to perform large-scale genome analysis pose hardware challenges. Despite advancement in genome analytic platforms, efficient approaches remain relevant especially as applied to the human genome. In this study, an Integrated Genome Sizing (IGS) approach is adopted to speed up multiple whole genome analysis in high-performance computing (HPC) environment. The approach splits a genome (GRCh37) into 630 chunks (fragments) wherein multiple chunks can simultaneously be parallelized for sequence analyses across cohorts.Entities:
Keywords: Genome analysis; Genome sizing; Infrastructure; Sequencing; Statistics; Storage; Whole genome
Mesh:
Year: 2018 PMID: 30509173 PMCID: PMC6276166 DOI: 10.1186/s12859-018-2499-1
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Basic communication and data processing in IGS. A BAM file (a) used for generating three major databases (b). The data is organically arranged and cloned into four-dimensional (4-D) information (Phenotype variable ID, Marker ID, Sample ID, and Function annotation) as shown in panel (c). In each request, IGS extracts 4-D data. All extracted information is a sub-clone (d), and the data is subjected to an in-house statistical tool, IGscan (e), which provides statistical analyses
Fig. 2Schematic representation of genome chunking workflow. a A step-by-step procedure involved in creating 630 chunks from a genome. b Generation of chunks by setting cut points, and practical steps involved in creating chunks from a given chromosome length. The entire genome is divided into 5 Mb segments (n = 630) by making virtual cuts. Next, a 2.5 Mb distance is added to both ends of the initial cut point to determine the presence of functional sites and to allow Haploview interval analysis among intergenic regions. c Haploview analysis: The three distinctive regions, marked 1, 2, and 3 are the new cut points of a chunk selected by Haploview analysis to identify the relationships among the selected SNPs. We recalculated the length of each cut point to include related biological information to obtain informative chunks (as denoted by 20_6_7_hap.LD.PNG, 20_7_8_hap.LD.PNG, and 20_8_9_hap.LD.PNG, respectively). These regions represent a precise sequence information ranging from 4 to 6 Mb in length, which could be information related to CNV or genes. d Distribution of chunks. The graph illustrates the distribution of functionally related chunks along with functionally unrelated chunks and the classification of chunks based on their respective numbers of markers
Example of chunk distribution of chromosome 6 of the reference genome
| Chunk ID | Chrs | Start | End | Related Function |
|---|---|---|---|---|
| 6_7_220 | 6 | 29,678,325 | 35,156,630 | HLA region |
| 6_8_221 | 6 | 35,156,631 | 40,140,014 | – |
| 6_9_222 | 6 | 40,140,015 | 46,461,804 | Microvascular_complications_of_diabetes_1 |
| 6_10_223 | 6 | 46,461,805 | 49,686,975 | – |
| 6_11_224 | 6 | 49,686,976 | 55,283,232 | – |
| 6_12_225 | 6 | 55,283,233 | 60,365,606 | – |
| 6_13_226 | 6 | 60,365,607 | 66,419,118 | Epilepsy/Dysle23ia/EYES SHUT /DROSOPHIL |
| 6_14_227 | 6 | 66,419,119 | 67,721,229 | – |
| 6_15_228 | 6 | 67,721,230 | 73,114,845 | – |
| 6_16_229 | 6 | 73,114,846 | 77,870,236 | – |
Fig. 3Three mosaic structures representing the organization of chunks. a A data point of allele depth, genotype, or haplotype. Each dot designates a single chunk entity. Repeated addition of chunks yields (b) a matrix of three databases
Performance comparison of Maha-Fs and SGI-UV300
| Method | Process Steps | Maha-Fs | SGI-UV300 | ||
|---|---|---|---|---|---|
| Core:Memory | 4:12 | 16:64 | 4:12 | 16:64 | |
| mapping | 1. split | 35.1 | 37.7 | 7.1 | 7.1 |
| 2. sickle | 1.2 | 0.4 | 2.2 | 0.2 | |
| 3. BWA-MEM | 9.4 | 2.5 | 7.6 | 3.1 | |
| 4. Picard-Fix Mate Information | 2.6 | 1.9 | 4.3 | 2.7 | |
| recalibration | 5. Picard-Mark Duplicates | 1.5 | 1.1 | 7.3 | 2.7 |
| 6. GATK-RealignerTarget Creator | 2.9 | 1.8 | 4.2 | 2.7 | |
| 7. GATK-Indel Re-aligner | 1.6 | 1.1 | 2.7 | 1.7 | |
| 8. GATK-Base Re-calibrator | 2.1 | 1.0 | 5.5 | 1.1 | |
| 9. GATK-Print Reads | 2.9 | 2.2 | 7.4 | 3.2 | |
| 10. GATK-Haplotype Caller | 8.0 | 5.6 | 10.8 | 8.1 | |
| Time | Total process time (min) | 79.8 | 55.2 | 59.0 | 32.5 |
IGscan QC analysis Vs number of chunk use
| Module | No. samples | No. chunks | Time (mins) |
|---|---|---|---|
| QC | 2504 | 1 | 15 |
| QC | 2504 | 630 | 60 |