Javier Navarro1, Bruno Nevado2, Porfidio Hernández1, Gonzalo Vera3, Sebastián E Ramos-Onsins3. 1. Computer Architecture and Operating Systems Department, Universitat Autònoma de Barcelona, Barcelona, Spain. 2. Department of Plant Sciences, University of Oxford, Oxford, UK. 3. Centre for Research in Agricultural Genomics (CRAG), CSIC-IRTA-UAB-UB, Barcelona, Spain.
Abstract
The accurate estimation of nucleotide variability using next-generation sequencing data is challenged by the high number of sequencing errors produced by new sequencing technologies, especially for nonmodel species, where reference sequences may not be available and the read depth may be low due to limited budgets. The most popular single-nucleotide polymorphism (SNP) callers are designed to obtain a high SNP recovery and low false discovery rate but are not designed to account appropriately the frequency of the variants. Instead, algorithms designed to account for the frequency of SNPs give precise results for estimating the levels and the patterns of variability. These algorithms are focused on the unbiased estimation of the variability and not on the high recovery of SNPs. Here, we implemented a fast and optimized parallel algorithm that includes the method developed by Roesti et al and Lynch, which estimates the genotype of each individual at each site, considering the possibility to call both bases from the genotype, a single one or none. This algorithm does not consider the reference and therefore is independent of biases related to the reference nucleotide specified. The pipeline starts from a BAM file converted to pileup or mpileup format and the software outputs a FASTA file. The new program not only reduces the running times but also, given the improved use of resources, it allows its usage with smaller computers and large parallel computers, expanding its benefits to a wider range of researchers. The output file can be analyzed using software for population genetics analysis, such as the R library PopGenome, the software VariScan, and the program mstatspop for analysis considering positions with missing data.
The accurate estimation of nucleotide variability using next-generation sequencing data is challenged by the high number of sequencing errors produced by new sequencing technologies, especially for nonmodel species, where reference sequences may not be available and the read depth may be low due to limited budgets. The most popular single-nucleotide polymorphism (SNP) callers are designed to obtain a high SNP recovery and low false discovery rate but are not designed to account appropriately the frequency of the variants. Instead, algorithms designed to account for the frequency of SNPs give precise results for estimating the levels and the patterns of variability. These algorithms are focused on the unbiased estimation of the variability and not on the high recovery of SNPs. Here, we implemented a fast and optimized parallel algorithm that includes the method developed by Roesti et al and Lynch, which estimates the genotype of each individual at each site, considering the possibility to call both bases from the genotype, a single one or none. This algorithm does not consider the reference and therefore is independent of biases related to the reference nucleotide specified. The pipeline starts from a BAM file converted to pileup or mpileup format and the software outputs a FASTA file. The new program not only reduces the running times but also, given the improved use of resources, it allows its usage with smaller computers and large parallel computers, expanding its benefits to a wider range of researchers. The output file can be analyzed using software for population genetics analysis, such as the R library PopGenome, the software VariScan, and the program mstatspop for analysis considering positions with missing data.
Entities:
Keywords:
HPC; MPI; Population genomics; SNP caller; next-generation sequencing; parallelization
The analysis of genome variability using next-generation sequencing (NGS) data has been successfully performed in many species, mainly in model species where large read depth was usually achieved and the reference sequence for that species was available.[1-6]However, the study of nonmodel organisms becomes a challenge when the reference sequence is not available and exists the reference of a related species. Moreover, research projects trying to engage population genomics analysis of nonmodel species with limited budgets are forced to choose between few individuals at relative high read depth or, alternatively, to more individuals but moderate to low read depth.Popular NGS pipelines for calling single-nucleotide polymorphisms (SNPs) are not always adequate for population genomics analyses and should be used with caution. Previous publications[7-11] show that important errors are introduced during the estimation of the levels and the patterns of variability under common experimental designs,[12] moderate or low read depth and/or nonreference species.Bioinformatic tools for the analysis of variants from NGS data have been mainly designed for detecting SNPs but not for the analysis of variability. Accurate estimation of the levels of variability has fundamental importance for the detection of selective events and for understanding the history of the populations. Most popular SNP callers[6,13] are mainly focused on the high recovery detection of SNPs while keeping lower false discovery rates. Li[6] developed a statistical framework for SNP calling, which accurately assigns the frequency spectrum of the alleles using a likelihood ratio test. Although this algorithm is effectively efficient for large sample data sets and even for low read depths, the frequency of the variants deviates from the expected value for small data sets, especially when divergent reference sequences are used.[10] Furthermore, the number of missing positions (positions without information about the nucleotide are present there) in regions with no variants has not been specifically taken into account in most variability analyses. In summary, estimates of variability can be very far from the real one because the inaccurate frequency of the variants and the unclear information about the effective positions to be considered for the analysis.In this work, we present a new version of the software GH caller. GH caller reads pileup and mpileup format files and outputs a FASTA format file ready to be analyzed by different software tools for the study of population genomics, such as PopGenome,[14] VariScan,[15] and mstatspo.[10] It implements the Roesti et al[16] and Lynch[17] algorithm to accurately account the number of effective positions and the frequency of the variants, making this tool highly suitable for population genomics studies, especially at low read depths where it outperforms alternative SNP callers such as SAMtools or GATK (see the work by Nevado et al[10]). This new version of GH caller has been mainly developed to obtain results in a reasonable time, which in the field of NGS analysis is critical. In this sense, GH caller has been optimized for faster processing in 2 ways: (1) modifying the code algorithm and (2) using several high-performance computing (HPC) strategies.
Methods
Features of GH caller
GH caller is a tool for genome variability analysis that reads NGS mapped reads from diploid individual(s) in a pileup or mpileup file, performs SNP calling as developed by Roesti et al,[16] and outputs the results in a multi-FASTA file. For each individual and site, 2 bases are represented, including missing values (N) if any, as they contain the diploid information available for each homologous position. GH caller counts the number of chromosomes that are effectively read per site making the frequency estimation unbiased in relation to the sample, which is useful for calculating neutrality tests and nucleotide variation.[10]
Input and output data
GH caller reads mpileup or pileup formats obtained from BAM standard NGS files using SAMtools.[18] The first 3 columns represent the chromosome or scaffold name, the position number, and the reference sequence, respectively. The rest of columns are grouped in sets of 3 elements, representing the data of a diploid individual with the number of reads, the bases seen, and the base quality. The pileup format contains information for a single individual, whereas mpileup can contain data from multiple individuals. The name of each individual can be included using the flag –names to help the identification of the sequences. Also, the optional out-group sequence can be appended in the same FASTA file using the flags (–outgroup; see Supplementary Table 1 for more options). Note that each pileup file and each FASTA reference file must contain a single chromosome. Also, mpileup data should be filtered for mapping quality, to avoid reads with very low mapping quality introducing errors.The output of GH caller is a multi-FASTA format file with 2 sequences per individual and a length equal to the same length as the input scaffold/chromosome. Each 2 consecutive sequences represent the unphased genotype of each individual.
Algorithm
Frequency caller algorithm for diploid samples
GH caller implements the Roesti algorithm,[16] which is based on the algorithms proposed by Lynch[17] and Hohenlohe et al.[19] In this implementation, a preliminary filtering is convenient for erasing the possible noise produced by very-low-quality sequencing data but also for regions where the numbers of reads were disproportionally high or too low to be used for the variability estimation. In Supplementary Table 1, the filtering options available are described. The American Standard Code for Information Interchange (ASCII) code for each used sequencing platform (flag –platform) can also be included to calculate the quality of the sequences appropriately.The Roesti algorithm implemented in GH caller calculates the genotype of each site and individual with a multinomial (see equations (1) to (3) below), assuming a fixed raw sequence error (flag –error). The 2 most likely genotypes are compared with a likelihood ratio test (LRT) against a χ2 distribution (flag –chivalue) as indicated in the algorithm proposed by Hohenlohe et al. In the case LRT is not significant or the number of reads is below a threshold (eg, flag –minreads = 6), GH caller calculates the likelihood of a single nucleotide (so defines 1 of the 2 nucleotides to estimate as N). Likelihood ratio test is performed for the 2 most likely of the 4 remaining possibilities (A/N, C/N, T/N, and G/N). If significant, it keeps the one with the highest probability; otherwise, it assigns N/N. In a more formal notation, the detailed algorithm explained above is described as follows.Define as the number of reads that pass the base quality threshold per site and per individual. Define minreads as the minimum number of reads to attempt a genotype call.If do (A), otherwise do (B).(A)—Define all 10 possible genotypes for 4 different nucleotides, ie, AA, AC, CC, AG, CG, GG, AT, CT, GT, and TT. As indicated in equations (1a) and (1b) by Hohenlohe et al,[19] for a given nucleotide, namely, and a number of reads for each nucleotide , , , and , define as a given raw sequence error rate, then the likelihood to have a heterozygous genotype (1, 2) is as follows:and the likelihood to be homozygous is as follows:Calculate all 10 combinations and choose the 2 genotypes with the highest likelihood, namely, and .If , keep the highest likelihood genotype , where is the threshold value, given by the user for a 1-degree-of-freedom test. Otherwise, do step B.(B)—Call only one of the nucleotides of the genotype, leaving the remaining a missing site (N). The likelihood is as follows:Calculate all 4 combinations (ie, N and each of the 4 nucleotides) and choose the 2 haplotypes with the highest likelihood, namely, and .If , keep the highest likelihood haplotype (, ie, (1, N)), where is the threshold value given by the user for a 1-degree-of-freedom test. Otherwise, keep the missing genotype (N, N).
Parallel algorithm
The parallel implementation is based on a previous serial version of GH caller.[10] The new approach takes advantage of the inherent data parallelism existing in the algorithm, where all the steps can be applied for each position. Thanks to this property, we can split input data in groups of positions and distribute them among the available processing units, where all computations will be performed concurrently. The main phases of the parallel program are as follows:All the involved processes initialize the parallel environment.The main process (master) indexes the input file, generates a list of chunks of data to be assigned, and maps the corresponding piece of input data to the other processes (workers).The parallel computations start asynchronously on each worker. Locally, each process reads and parses a subset of data, applies the base calling algorithm to these data, and finally writes its partial results to the final FASTA output file.If it was selected as a command line option, a synchronization point (barrier) is used to allow the master process to append a reference/out-group sequence to the FASTA output file.Phases 1 and 3 are independent and only need a synchronization signal to begin, hence are performed in parallel, whereas phases 2 and 4 are serial and must be performed in order in a single master process. This sequence of steps is shown in Figure 1 as a flowchart of the parallel algorithm.
Figure 1.
Flowchart of the master/worker algorithm, a parallel implementation of the Roesti algorithm.
Flowchart of the master/worker algorithm, a parallel implementation of the Roesti algorithm.Our application uses the standard Message Passing Interface (MPI)[20] and distributes the data among the available processes by mapping subsets of data with a default size of 100 MB (hereinafter “chunks”), giving 1 chunk to each worker process at a time. All computational tasks involving these data subsets can be performed simultaneously and individually on each worker process. Each chunk does not need to be replicated in the local memory of other processes, and memory used is released when data are written to disk: this allows GH caller to process data sets without any size limits and use all the aggregated memory available of all the allocated processors, at the same time.
Accuracy of results and usage of computational resources
We had the following goals in our experiments: (1) accuracy of results and (2) usage of computational resources. First, we contrasted the serial and the parallel versions to validate all the changes introduced to the initial algorithm and thus to be sure both applications produce exactly the same results. Second, to analyze the computer performance of this new improved algorithm, we considered NGS data in several scenarios with a single population. The NGS sequences from the western lowland gorilla Gorilla gorilla[21] were used for this purpose. As in the study by Nevado et al,[10] already mapped files to gorilla genomes constructed in this previous work were subsampled using picard-tools DownsampleSam.[22] To obtain all BAM files used in this work as input data, we kept the total sequencing costs fixed but sampling a varying number of individuals (diploid) with different read depth levels. As a result, each individual has a BAM file with a read depth level in a range of 2 to 20 while keeping total sequencing costs fixed at 40× (see Table 1 for more details of the different variables and the ranges considered). We estimated population genome-wide statistics from this data set to compare the overall accuracy of our application versus the new comparable algorithm provided by SAMtools.[6,23] In addition, we have compared the results of GH caller and SAMtools in a set of real data sequences[24]. Here, we have analyzed 6 different lines (VED, IRK, SC, CAL, TRI, and CV) of Cucumis melo and 1 out-group (cucumber, ie, Cucumis sativus). From these lines, we have decided to use the chromosome 1 data set, as is large enough to perform a comparative analysis. We have used 2 different pipelines, both starting from individual pileup files for each variety. First, we used the SAMtools pipeline (as is detailed in the original manuscript of Sanseverino et al[24]) to obtain all estimated SNPs. Following the original manuscript, the SNP file containing all the information was converted into a single multi-FASTA file. Second, we used GH caller to estimate the number and the frequency of the SNPs from pileup files. GH caller outputs FASTA files that were were concatenated into a single one. After that, the 2 multi-FASTA files (1 from SNP calling with SAMtools and 1 from SNP calling of GH caller) were analyzed using mstatspop (available from NumGenomics Web site, http://bioinformatics.cragenomica.es/numgenomics/people/sebas/software/software.html) to obtain the levels and patterns of variability. Finally, we have quantified the benefits from parallelization and optimization done in this application, and we have evaluated the overall effectiveness of the application, particularly, assessing how it can scale across different computing platforms.
Table 1.
Different variables and ranges considered in this study.
Variable
Cases
Data
Notes
Experimental design
2 individuals, 20× read depth
2I20X
Total sequencing cost are fixed at 40×
5 individuals, 8× read depth
5I8X
10 individuals, 4× read depth
10I4X
20 individuals, 2× read depth
20I2X
Bioinformatic tool
SAMtools
This genotype calling approach uses a new algorithm that can properly handle multiallelic SNPs
GH caller
Genotype-haplotype calling approach using our parallel version of GH caller
Different variables and ranges considered in this study.
SAMtools pipeline
We used the well-known package SAMtools to compare results against our GH caller pipeline. In our pipeline, each individual’s data are analyzed separately. We used version 1.2.0 that includes an algorithm to properly handle multiallelic SNPs.[6,23]
Figure 2A shows the main steps of the pipeline.
Figure 2.
SNP calling and filtering pipelines used in our experiments. (A) In case of SAMtools’ pipeline, all steps are repeated for each individual. One extra step is needed to convert VCF output obtained in step 3 to a FASTA format input file required by mstatspop. (B) In case of GH caller, the output file obtained in step 2 is already produced in the required format.
SNP calling and filtering pipelines used in our experiments. (A) In case of SAMtools’ pipeline, all steps are repeated for each individual. One extra step is needed to convert VCF output obtained in step 3 to a FASTA format input file required by mstatspop. (B) In case of GH caller, the output file obtained in step 2 is already produced in the required format.We started with the SAMtools’ command, SAMtools mpileup (step 1), computing genotype likelihoods (-g), reading a maximum of 1000 reads per-position (-d 1000), and using both minimum base quality threshold and minimum mapping quality of 20, and disabling the probabilistic realignment for the computation of base alignment quality (BAQ), which helps to reduce false SNPs’ discoveries caused by misalignments. We instruct the program to generate an uncompressed output (-u), choosing a list of tags as output format (-t): number of high-quality bases (DP), per-sample Phred-scaled strand bias P value (SP), number of high-quality nonreference bases (SP), and number of high-quality ref-forward, ref-reverse, alt-forward, and alt-reverse bases (DP4). All other options were kept with their default values.For analysis with SAMtools, we used the BCFtools call command (step 2a) with the new multiallelic and rare-variant calling algorithm (-m) (http://samtools.github.io/bcftools/call-m.pdf) and aggregating to the output also gVCF blocks of homozygous reference calls, where the minimum depth (-g) should be set to half the depth in each scenario. Variant calls were filtered with bcftools filter command of SAMtools package (step 2b), with the following options (-saFilter -g3 -G10) and a exclude expression (-e) which marks as low-quality SNPs those with quality below 10 (%QUAL < 10) or 15 if less than 2 reads support alternative base (AC < 2 && %QUAL < 15), and those with a minimum depth below threshold, which should be half the average depth for each scenario (FMT/DP < “$MINDEP”).The new filtered VCF format file was transformed to FASTA format using vcf2fasta (available from GitHub, https://github.com/brunonevado/vcf2fas). Finally, FASTA output files were analyzed with mstatspop.
GH caller pipeline
Figure 2B shows the main steps of this pipeline. We started the pipeline with a SAMtools’
mpileup command (step 1), using a minimum base quality threshold and a minimum mapping quality, both with a value of 20 and disabling the probabilistic realignment for the computation of BAQ. Other tool settings are kept with their default values. To start applying a genotype-haplotype per base calling, we run GH caller (step 2) with a set of options similar to the previous work: we used a threshold of 6 reads for genotype calls (–minreads = 6) and we only considered bases with a base quality above 20 (–baseq = 20). Moreover, for each position and individual, we discarded sites having above double and below half the average read depth. That is, for data with an average read depth of 20× we would discard positions above 40 and below 10 reads (–maxdep = 40, –mindep = 10). We set the raw sequencing error used to calculate the genotype/haplotype likelihoods to 0.01 (–error = 0.01) (see the algorithm above and the definition of the error parameter in the equations). Finally, as parameter to apply the LRT, we set the χ2 distribution to its default value (see options in Supplementary Table 1).
Estimating population genetics statistics
The different statistics of variability allow us to detect differences in the patterns of expected variability considering different aspects of the site frequency spectrum, ie, low, intermediate, or high SNP frequencies, and detect possible biases. The variability estimators of Watterson (θ),[25] Tajima[26] (π), Achaz[27] (θ), Fay and Wu[28] (θ), and Fu and Li[29] (θ), were calculated over windows of 100 kB for the entire chromosome. Also, the neutrality tests of Tajima[30] (TD), Fu and Li[29] (D FLD), and Fay and Wu[28] (H) were considered. Population statistics were calculated using the mstatspop tool from NumGenomics Web site in both pipelines (Figure 2, step 3).
Computing performance
We used a hetereogeneous cluster, which consists of 2 groups of nodes with different hardware configurations. As first hardware configuration (scenario A), we chose 1 server with 1 processor Intel Xeon E31240 with frequency 3.30 GHz providing 4 cores with 2 threads per core, 32 GB of RAM and running the operating system Scientific Linux 6.3. In this scenario, our application was evaluated using a local file system (XFS) using 1 internal SATA hard drive of 3 TB. This scenario is equivalent in terms of computational power to current workstation computers. The second configuration (scenario B) is composed by 6 servers or nodes. Each node has 2 processors Intel Xeon X5660 processors with frequency 2.8 GHz providing 6 cores per processor with 2 threads per core, 96 GB of RAM, and running the same Linux distribution than scenario A. All these 6 nodes were connected via network links of 10 GB Ethernet. In this scenario, our application was evaluated using the distributed file system Lustre version 2.5.3, configured with 1 Metadata Server and 2 object storage servers, which were connected to the same network switch. This scenario represents typical shared clusters found in many research centers. In both scenarios, we used mpicxx for Open MPI version 1.8.8, and gcc 4.9.1 to compile all the source codes, with -03 optimization enabled.The performance of GH caller is evaluated using the execution time as a metric measured in seconds. In our case, execution time is the maximum amount of time the application could take to execute on a specific hardware platform. is essentially the execution time of the original GH caller application (serial) using only 1 core as computational resources, is the number of cores, and is the running time of the parallel algorithm using cores.First, we compared the execution time elapsed with our new parallel version using both scenarios (A and B) while keeping the same workload and increasing the number of resources (N). Next, we ran the serial version of GH caller with all data sets, so we obtained as a baseline time to compare gains obtained at different times . We performed each experiment 10 times for each pileup input file and measured the time of program execution. Average execution time is calculated taking these 10 replicates.
Memory usage
The memory usage of GH caller is evaluated using the maximum resident set size (RSS memory) for each process during program execution. Resident set size is used to show how much of the memory allocated for a process is currently in main memory, including both stack and heap memory. It does not include neither swap nor memory used by shared libraries as long as the pages from those libraries were currently in memory. We obtained all these values using the accounting tool sacct provided by SLURM (Simple Linux Utility for Resource Management),[31] the software used in the cluster to manage all the jobs submitted by the users. The sacct command displays accounting data for all jobs and job steps in the SLURM job accounting database in a variety of forms for analysis.To identify the impact in memory usage caused using subsets of data with a restricted size, we evaluated our application choosing the default chunk size of 100 MB and compared our results with the serial version of GH caller to find whether there are any gains obtained with respect to the original. In this case, we used only scenario A to run our experiment due to the fact that the differences between hardware in both scenarios should have no impact in memory usage. We launched the same experiments 5 times for each pileup input file and obtained the maximum RSS memory for each process measured by the SLURM accounting system during program execution. An average RSS memory size is calculated by taking all execution replicates.
Results
Accuracy of optimized GH caller versus updated SAMtools SNP caller
In the same direction to what was described by Nevado et al[10] for a previous SAMtools version, there is a strong dependency of SAMtools SNP caller to the average read depth when the level of variability and frequency-dependent neutrality tests are estimated, and it is much less dependant in the case of GH caller (Figures 3 and 4 and Supplementary Figures 4 and 5). As it can be observed, there are no changes in the results of the parallel version with respect to the accuracy of the original GH caller. In case of the second real data analysis (Figures 3 and 4), it is observed that the SAMtools SNP caller has much stronger correlation between levels of variability, and the variance in the read depth per window is compared, suggesting that the differences in the levels of variability and the patterns of variability between GH caller and SAMtools are consequence of artefactual patterns not well considered using the SAMtools SNP caller.
Figure 3.
The levels of variability for different statistics of variability (Theta Watterson, Theta Fu&Li, and Theta Fay&Wu) are estimated per windows of 100 MB across the chromosome 1 using SAMtools (blue) and GH caller (red).
Figure 4.
Correlation of the levels of variability and the variance in the number of read depths per individual. The levels of variability for different statistics of variability (Theta Watterson, Theta Fu&Li, and Theta Fay&Wu from left to right) are estimated per windows of 100 MB across the chromosome 1 using SAMtools (upper plots in blue) and GH caller (bottom plots in red). A significant value indicates that the estimated variability is dependent on the read depth.
The levels of variability for different statistics of variability (Theta Watterson, Theta Fu&Li, and Theta Fay&Wu) are estimated per windows of 100 MB across the chromosome 1 using SAMtools (blue) and GH caller (red).Correlation of the levels of variability and the variance in the number of read depths per individual. The levels of variability for different statistics of variability (Theta Watterson, Theta Fu&Li, and Theta Fay&Wu from left to right) are estimated per windows of 100 MB across the chromosome 1 using SAMtools (upper plots in blue) and GH caller (bottom plots in red). A significant value indicates that the estimated variability is dependent on the read depth.
Usage of computational resources
The main objective of the experiments described here is to demonstrate the utility of the new application for the analysis of variability by improving the usage of computational resources. To obtain all input files used to benchmark our application, we converted all alignment BAM files to mpileup format with different read depth levels and number of individuals required.Table 2 shows the real time elapsed running GH caller in scenarios A and B. It can be clearly observed that the total execution time decreases by including more computational resources in both scenarios.
Table 2.
Execution time running GH caller in scenarios A (workstation) and B (cluster).
Data
Size, GB
Scenario A, No. of cores
Scenario B, No. of cores
2
4
2
4
8
16
32
64
2I20X
14
7 m 44 s
3 m 00 s
7 m 16 s
2 m 34 s
1 m 15 s
54 s
49 s
46 s
5I8X
16
12 m 40 s
4 m 38 s
12 m 24 s
4 m 20 s
2 m 04 s
1 m 32 s
1 m 03 s
58 s
10I4X
18
21 m 32 s
7 m 43 s
21 m 44 s
7 m 27 s
3 m 46 s
2 m 23 s
1 m 32 s
1 m 20 s
20I2X
20
35 m 26 s
12 m 53 s
38 m 29 s
13 m 08 s
6 m 29 s
3 m 50 s
2 m 21 s
1 m 52 s
Execution time running GH caller in scenarios A (workstation) and B (cluster).Supplementary Figure 1 shows the results from the experiments performed in scenario A (workstation). In this experiment, we scaled up to the total number of cores available in this server (4 cores) performing an SNP calling process over all data sets (2I20X, 5I8X, 10I4X, and 20I2X). This results in an elapsed time reduced from 35 minutes 26 seconds (2 cores) to 12 minutes 53 seconds (4 cores), which is a 2.5× decrease using the same input data (20I2X). However, potential gains are being limited due the lack of processing units.The second set of experiments has been performed in scenario B (cluster). In this case, we scaled up to a maximum of 64 cores (6 nodes). Supplementary Figure 2 shows how execution time can be reduced up to 20× (20I2X) by adding more computational resources. We can observe how efficiency decreases when scaling up computational resources, the reason of which will be discussed in the next section.Finally, Figure 5 compares execution time of GH caller using the original version with 1 core versus the parallel version using up to 64 cores over all data sets. Looking at the figure, we can conclude that regardless of input data, execution time can be reduced using the parallel version of GH caller together with all the available resources.
Figure 5.
Comparison of execution times using the original version with 1 core versus the parallel version using all the available cores in both scenarios A and B. We run an SNP calling process over all data sets (2I20X, 5I8X, 10I4X, and 20I2X). In all cases, execution time can be reduced using this parallel version of GH caller, especially using all resources from scenario B (cluster).
Comparison of execution times using the original version with 1 core versus the parallel version using all the available cores in both scenarios A and B. We run an SNP calling process over all data sets (2I20X, 5I8X, 10I4X, and 20I2X). In all cases, execution time can be reduced using this parallel version of GH caller, especially using all resources from scenario B (cluster).In conclusion, we can see that there is a significant reduction at the total execution time used by this new version of GH caller in both scenarios, especially with data sets with highest number of individuals.Table 3 shows the results of executing both GH caller serial (1 core) and parallel (64 cores) with input data of different sizes. As we can observe in the first column, previous serial version of GH caller increases memory consumption accordingly to the size of input data: this version uses a maximum RSS memory size of 0.64 GB with an input data of 14 GB (case 2I20X) but more than 5 GB with an input of 20 GB (20I2X). As it is expected, it increases memory usage in 10× because the number of sequences analyzed is 10 times more when using data set 2I20X than data set 20I2X (see Table 1). This happens because this version stores all results in memory before writing data to disk. For the case of the parallel version, these values are the maximum value achieved by core. A chunk size of 100 MB is used as default when applying the mapping process to distribute the data among all the available MPI worker processes. We adopted this size as a default because the preliminary results (Supplementary Figure 3) shown an average main memory usage of approximately 3 GB that can easily fit in commonly used infrastructures in these days. Moreover, our application allows to choose a custom chunk size using a command line option (-s chunk size). As it can be observed, we broke the previous relation between input data size and memory usage present in serial version. As a consequence, GH caller now allows to limit memory usage using different “chunk” sizes, regardless of the size of input data.
Table 3.
Maximum memory usage per process using scenario B, cluster.
Data
Size, GB
Serial version (1 core)
Parallel version (64 cores)
Max. RSS, MB
Max. RSS, MB
2I20X
14
640.78
2548.54
5I8X
16
1450.15
2522.96
10I4X
18
2763.65
2839.49
20I2X
20
5276.84
3009.27
Abbreviation: RSS, resident set size.
Maximum memory usage per process using scenario B, cluster.Abbreviation: RSS, resident set size.Maximum memory usage (RSS) per process in benchmark execution using scenario B: cluster. Comparison has been done between serial version of GH caller (1 core) and our parallel version using 64 cores with a default chunk size of 100 MB.Because the chunk size employed has a direct impact with the memory used by our application, running this parameter could be useful if our hardware has a different relation between number of processor units and RAM memory, allowing GH caller to take advantage of all computational resources in platforms with different bounds in this ratio (eg, commodity hardware with a total of 8 GB of main memory and 8 cores has maximum of 1 GB of memory per processor unit). This feature additionally allows GH caller to use the whole memory available to the bunch of allocated processors, at the same time.
Discussion
We demonstrated the utility of the GH caller application by analyzing 2 different aspects: execution time and effective memory usage. We emphasize here that our purpose was not to present a discussion of the effects in genetic estimates of experimental designs using different bioinformatics approaches, as this has been done by Nevado et al,[10] but ensure that the accuracy of results still remains at the same level that previous version already presented while improving execution times and memory requirements.We have shown that our parallel implementation can provide a performance advantage on workstations with a limited number of cores and a local file system, but particularly when we can add more computational resources using a data-intensive computing system (cluster). Even though execution time can be considerably reduced by adding more computational resources, we observed a limited efficiency when we scale up the number of processes. We suspect that our input data set is still too small with respect to the number of resources employed. In fact, this can be seen taking as an example the execution time used by the larger data set (20I2X): in this example, total elapsed time was 1 minute and 52 seconds; therefore, the average time employed for every process using 64 processes was just 1.75 seconds . Consequently, gains obtained parallelizing the algorithm are masked by waiting times and other overheads introduced by the MPI implementation. Furthermore, another reason for reduced scalability is the fact that our implementation is not optimized for parallel I/O and cannot take advantage of collective operations, thus leading to poor performance when there are many MPI processes writing to the same output file.[32] Future implementation could explore the possibility of applying different techniques already employed by specialized input/output libraries in HPC such as Parallel netCDF[33] and HDF5[34]: collective input/output operations, data reordering to improve write performance, asynchronous read/write, buffering, and/or lazy write methods. However, for this release, we preferred to focus our efforts in providing solutions where users having access to a workstation or cluster could benefit.In terms of performance, it must be emphasized that the whole elapsed time of the variant analysis pipeline using GH caller has been reduced as a result of the work done at the internal “base-calling” step. Consequently, the time employed at the first step—running SAMtools mpileup command—becomes more relevant. We propose 2 different approaches to reduce this time: first option could be using a new method to generate the mpileup file. This step can be done by dividing input data into regions of interest and performing this reduction process in parallel. Based on this idea, other applications tried to speed up this step using different strategies,[35] so viability and gains of this solution should be explored. Second approach is to extend the functionality of current application and provide new methods to read BAM files directly in a more efficient way than SAMtools mpileup command does, taking advantage of data parallelism property already cited.In respect to the accuracy of the estimation of the new optimized GH caller and their comparison with SAMtools SNP caller, the new optimized GH caller is identical, in terms of output results, to the previous serial version of GH caller. The levels of variability are weakly affected by the read depth at the studied ranges.[10] The strong differences in the levels of variability between GH caller and SAMtools SNP caller are caused by the different objectives to what they were designed. Although SAMtools SNP caller is designed and optimized to detect true SNPs in a sample, GH caller is designed to estimate accurately the level of variability, ie, does not matter if many SNPs are lost if the variability and the frequency of the variants are accurately estimated. Moreover, it is worth to mention that GH caller is designed for resequencing data of diploid individuals, and that while it can be used for other experiments, some specific biases need to be taken into account, eg, allele-specific expression in RNA-Seq data could affect the variability estimates obtained from RNA.In conclusion, we have presented a first parallel version of GH caller that can significantly reduce total execution time and CPU time, but maintaining the quality of results. We have proposed a parallelization strategy that guarantees a good speedup with a limited number of processors and can be executed in either a cluster or a workstation environment. Moreover, this new version of GH caller is not limited by input or output data size, and the memory available of all resources can potentially be used. With our contribution, GH Caller allows taking the full advantage of computational resources available both in clusters and workstations.
Download and Usage
The latest stable software files, both the initial (sequential) and the new (parallel) versions, including sample files and documentation, are available from https://github.com/CRAGENOMICA/pGHcaller. There you also will find how to compile and run the code as well as some example data sets.
Authors: Paul A Hohenlohe; Susan Bassham; Paul D Etter; Nicholas Stiffler; Eric A Johnson; William A Cresko Journal: PLoS Genet Date: 2010-02-26 Impact factor: 5.917
Authors: Miguel Carneiro; Carl-Johan Rubin; Federica Di Palma; Nuno Ferrand; Kerstin Lindblad-Toh; Leif Andersson; Frank W Albert; Jessica Alföldi; Alvaro Martinez Barrio; Gerli Pielberg; Nima Rafati; Shumaila Sayyab; Jason Turner-Maier; Shady Younis; Sandra Afonso; Bronwen Aken; Joel M Alves; Daniel Barrell; Gerard Bolet; Samuel Boucher; Hernán A Burbano; Rita Campos; Jean L Chang; Veronique Duranthon; Luca Fontanesi; Hervé Garreau; David Heiman; Jeremy Johnson; Rose G Mage; Ze Peng; Guillaume Queney; Claire Rogel-Gaillard; Magali Ruffier; Steve Searle; Rafael Villafuerte; Anqi Xiong; Sarah Young; Karin Forsberg-Nilsson; Jeffrey M Good; Eric S Lander Journal: Science Date: 2014-08-29 Impact factor: 47.728
Authors: Javier Prado-Martinez; Peter H Sudmant; Jeffrey M Kidd; Heng Li; Joanna L Kelley; Belen Lorente-Galdos; Krishna R Veeramah; August E Woerner; Timothy D O'Connor; Gabriel Santpere; Alexander Cagan; Christoph Theunert; Ferran Casals; Hafid Laayouni; Kasper Munch; Asger Hobolth; Anders E Halager; Maika Malig; Jessica Hernandez-Rodriguez; Irene Hernando-Herraez; Kay Prüfer; Marc Pybus; Laurel Johnstone; Michael Lachmann; Can Alkan; Dorina Twigg; Natalia Petit; Carl Baker; Fereydoun Hormozdiari; Marcos Fernandez-Callejo; Marc Dabad; Michael L Wilson; Laurie Stevison; Cristina Camprubí; Tiago Carvalho; Aurora Ruiz-Herrera; Laura Vives; Marta Mele; Teresa Abello; Ivanela Kondova; Ronald E Bontrop; Anne Pusey; Felix Lankester; John A Kiyang; Richard A Bergl; Elizabeth Lonsdorf; Simon Myers; Mario Ventura; Pascal Gagneux; David Comas; Hans Siegismund; Julie Blanc; Lidia Agueda-Calpena; Marta Gut; Lucinda Fulton; Sarah A Tishkoff; James C Mullikin; Richard K Wilson; Ivo G Gut; Mary Katherine Gonder; Oliver A Ryder; Beatrice H Hahn; Arcadi Navarro; Joshua M Akey; Jaume Bertranpetit; David Reich; Thomas Mailund; Mikkel H Schierup; Christina Hvilsom; Aida M Andrés; Jeffrey D Wall; Carlos D Bustamante; Michael F Hammer; Evan E Eichler; Tomas Marques-Bonet Journal: Nature Date: 2013-07-03 Impact factor: 49.962