| Literature DB >> 26346699 |
Erinija Pranckevičiene1, Tautvydas Rančelis2, Aidas Pranculis3, Vaidutis Kučinskas4.
Abstract
BACKGROUND: Every next generation sequencing (NGS) platform relies on proprietary and open source computational tools to analyze sequencing data. NGS tools for Illumina platforms are well documented which is not the case with AB SOLiD systems. We applied several computational and variant calling pipelines to analyse targeted exome sequencing data obtained using AB SOLiD 5500 system. Our investigated tools comprised proprietary LifeScope's pipeline in combination with open source color-space competent mapping programs and a variant caller. We present instrumental details of the pipelines that were used and quantitative comparative analysis of variant lists generated by LifeScope's pipeline versus open source tools.Entities:
Mesh:
Year: 2015 PMID: 26346699 PMCID: PMC4562342 DOI: 10.1186/s13104-015-1385-4
Source DB: PubMed Journal: BMC Res Notes ISSN: 1756-0500
Fig. 1Schema comprising an investigated workflow of exome analysis by LifeScope and the alternative pipeline. This schema represents exome’s computational pipeline steps that were applied in the study. The LifeScope’s pipeline includes proprietary programs to perform alignment and variant calling. Alternatively we applied another four approaches based on combinations of LifeScope, SHRiMP, MAQ, BFAST mapping programs and GATK modules for variant calling, using the same exomes as with Lifescope program
Percentages of the exome target regions by the mapping programs in family exomes at the cutoffs of 5×, 10×, 15× and 20×
| Coverage | 5 | 10 | 15 | 20 |
|---|---|---|---|---|
| LifeScope | 94.28 | 91.29 | 87.69 | 83.56 |
| SHRiMP | 93.97 | 89.92 | 84.94 | 79.07 |
| MAQ | 90.91 | 85.17 | 78.82 | 71.92 |
| BFAST | 86.72 | 78.76 | 70.47 | 62.04 |
Fig. 2Coverage of target regions by mapping methods. Average coverage of the targeted exome regions by the mapped reads in family exomes is shown. Five coverage intervals that we created to assess mappings of different methods are presented in the legend. They comprise intervals of [1,10), [10,20), [20,30), [30,60), [60,100) and equal and higher than 100. Each individual barplot shows a percentages of the targeted regions falling into a coverage category for each mapping method: LifeScope, SHRiMP, MAQ and BFAST. The targeted regions are mostly covered by 30–60 reads in all mapping methods
Transition transversion ratio in different variant calling approaches
| Approach | LifeScope | LifeScope-GATK | MAQ-GATK | SHRiMP-GATK | BFAST-GATK |
|---|---|---|---|---|---|
| Ti/Tv | 2.62 | 2.45 | 2.59 | 2.3 | 2.68 |
Counts of SNPs identified in the family exomes present in the datasets of known variants
| LifeScope | LifeScope-GATK | MAQ-GATK | SHRiMP-GATK | BFAST-GATK | |
|---|---|---|---|---|---|
| Proband | |||||
| Total |
|
|
|
|
|
| dbsnp138 | 37,312 | 58,342 | 48,538 | 65,185 | 42,909 |
| 1000G | 35,511 | 54,745 | 46,706 | 62,778 | 42,030 |
| esp6500 | 31,319 | 29,745 | 26,041 | 27,386 | 25,169 |
| In target regions |
|
|
|
|
|
| dbsnp138 | 20,892 | 18,869 | 16,707 | 16,768 | 15,688 |
| 1000G | 19,504 | 17,954 | 16,260 | 16,304 | 15,321 |
| esp6500 | 17,894 | 16,469 | 14,912 | 14,891 | 14,138 |
| Mother | |||||
| Total |
|
|
|
|
|
| dbsnp138 | 34,075 | 42,778 | 27,641 | 38,337 | 26,626 |
| 1000G | 32,455 | 40,316 | 26,472 | 36,856 | 26,170 |
| esp6500 | 29,095 | 26,101 | 17,909 | 23,348 | 19,139 |
| In target regions |
|
|
|
|
|
| dbsnp138 | 20,099 | 17,296 | 12,470 | 15,359 | 12,623 |
| 1000G | 18,806 | 16,459 | 12,143 | 14,939 | 12,384 |
| esp6500 | 17,281 | 15,099 | 11,163 | 13,658 | 11,469 |
| Father | |||||
| Total |
|
|
|
|
|
| dbsnp138 | 35,106 | 45,723 | 36,209 | 42,193 | 27,669 |
| 1000G | 33,381 | 42,939 | 34,777 | 40,782 | 27,204 |
| esp6500 | 29,765 | 26,728 | 22,177 | 24,458 | 19,715 |
| In target regions |
|
|
|
|
|
| dbsnp138 | 20,348 | 17,531 | 14,827 | 15,661 | 12,931 |
| 1000G | 19,015 | 16,668 | 14,447 | 15,296 | 12,692 |
| esp6500 | 17,457 | 15,323 | 13,326 | 14,042 | 11,756 |
For each exome and the approach a total number of all identified SNPs and the SNPs that are only in the targeted regions are shown in italic
Agreement between variant calling approaches on all called SNPs
| Exome | Proband | Mother | Father |
|---|---|---|---|
| Total number of SNP variants | 86,840 | 55,738 | 68,915 |
| Number of SNP variants identified by all approaches |
|
|
|
| Fraction of common variants in total identified by LifeScope (%) | 56 | 40 | 32 |
| LifeScope-GATK (%) | 32 | 44 | 25 |
| MAQ-GATK (%) | 44 | 50 | 31 |
| SHRiMP-GATK (%) | 32 | 36 | 27 |
| BFAST-GATK (%) | 50 | 52 | 41 |
| Number of SNP variants and fraction of total | 2936 (7.6 %) | 4543 (12.9 %) | 4246 (11.7 %) |
| LifeScope-GATK | 5251 (8.7 %) | 4468 (10.2 %) | 4555 (9.7 %) |
| MAQ-GATK | 1943 (3.9 %) | 1074 (3.8 %) | 1369 (3.7 %) |
| SHRiMP-GATK | 14,276 (21.1 %) | 2210 (5.7 %) | 1644 (3.9 %) |
| BFAST-GATK | 1097 (2.5 %) | 574 (2.1 %) | 9851 (35.2 %) |
By common are denoted variants that have been identified by all approaches. By specific are identified variants that were identified exclusively by one approach. The percentages are computed as fraction of the total number of all variants shown in Table 3 identified by that method
Fig. 3Empirical cumulative distribution function (ECDF) of variant quality (QUAL) property assigned by GATK for variants identified in alignments produced by different mapping programs. To compute ECFD only variants that have been identified by all approaches together were used. ECDF’s of different alignments are color-coded: BFAST by blue line, LifeScope-GATK by black, MAQ by green and SHRiMP by red. Panels correspond to the family exomes. ECDF plots of QUAL per method in proband exome are on the top, the mother exome is in the middle and exome of the father is on the bottom. Median QUAL value of LifeScope-GATK consistently apears around 300 in all exomes. For BFAST-GATK it is around 200. Other approaches differ across the exomes
Summary of coverage [depth of coverage (DP) assigned to a variant by GATK] for variants called by the different approaches in family exomes
| Proband variants | All variants per method | Common 21,687 variants | Variants in COSMIC | ||||
|---|---|---|---|---|---|---|---|
| Identification method | Coverage quartiles Q25,Med,Q75 | Total # | Sites % ≥ 8× | Coverage quartiles Q25,Med,Q75 | Sites % ≥ 8× | Coverage quartiles Q25,Med,Q75 | Total # |
| LifeScope | 12,25,46 | 38,626 | 87 | 16,28,46 | 92 | 11,28,51 | 89 |
| LifeScope-GATK | 7,14,24 | 60,313 | 75 | 14,21,30 | 95 | 16,22,35 | 21 |
| MAQ-GATK | 7,13,23 | 49,483 | 74 | 10,17,25 | 86 | 25,33,64 | 15 |
| SHRiMP-GATK | 6,16,34 | 67,669 | 72 | 21,32,50 | 98 | 15,52,120 | 22 |
| BFAST-GATK | 7,13,22 | 43,536 | 73 | 11,18,27 | 89 | 7,8,10 | 10 |
Coverage is summarized as median and 1st and 3rd quartile showing a central tendency. This coverage tendency is presented for: all called variants, common variants called by all methods and approach-specific variants in COSMIC. An additional quality measure—a fraction of variants—covered at least by 8 reads is shown for each method for all and common variants
SNPs annotated in ClinVar and COSMIC databases per method and per person
| LifeScope | LifeScope-GATK | MAQ-GATK | BFAST-GATK | |
|---|---|---|---|---|
| Proband | ||||
| ClinVar total | 1328 | 1261 | 1116 | 1150 |
| Deleterious variants | 44;34pat + 10drug | 42;33pat + 9drug | 38;29pat + 9drug | 34;27pat + 7drug |
| Deleterious at 15× | 32;23pat + 9drug | 38;28pat + 10drug | 24;19pat + 5drug | 21;15pat + 6drug |
| Loss going to 15× | 27 % | 9.5 % | 37 % | 38 % |
| COSMIC total | 468 | 366 | 310 | 299 |
| COSMIC variants at 15× | 349 | 238 | 195 | 168 |
| Loss going to 15× | 25 % | 35 % | 37 % | 44 % |
| Mother | ||||
| ClinVar total | 1288 | 1170 | 841 | 947 |
| Deleterious variants | 47;38pat + 9drug | 41;33pat + 8drug | 38;30pat + 8drug | 36;30pat + 6drug |
| Deleterious at 15× | 20;15pat + 5drug | 30;24pat + 6drug | 11;10pat + 1drug | 11; 8pat + 3drug |
| Loss going to 15× | 57 % | 27 % | 71 % | 69 % |
| COSMIC total | 395 | 292 | 197 | 194 |
| COSMIC at 15× | 231 | 124 | 73 | 63 |
| Loss going to 15× | 42 % | 52 % | 63 % | 68 % |
| Father | ||||
| ClinVar total | 1248 | 1126 | 950 | 959 |
| Deleterious variants | 43;36pat + 7drug | 37;31pat + 6drug | 31;26pat + 5drug | 36;30pat + 6drug |
| Deleterious at 15× | 19;16pat + 3drug | 26;22pat + 4drug | 12;11pat + 1drug | 11;9pat + 2drug |
| Loss going to 15× | 56 % | 30 % | 61 % | 69 % |
| COSMIC total | 444 | 340 | 270 | 197 |
| COSMIC at 15× | 288 | 176 | 122 | 74 |
| Loss going to 15× | 35 % | 37 % | 55 % | 62 % |
Total number of identified variants in ClinVar is shown together with the number of patogenic (pat) and drug response (drug) variants. Counts of high confidence variants covered at least by 15 are presented for both: ClinVar and COSMIC. A fraction of total variants which would be lost by going to a higher coverage is presented by a percentage
Fig. 4Agreement between different variant calling approaches with respect to called SNPs that are present in ClinVar and COSMIC databases. Venn diagrams show how much different approaches agree in identifying harmful variants. The middle area of each diagram shows number of variants common to all methods. The Venn diagram leafs show number of variants specific to each method. On the left are diagrams representing SNPs in ClinVar database and on the right is distribution of identified SNPs present in COSMIC database. The ittop part of the figure shows diagrams of proband, the itmiddle represents mother and the bottom represents father. Substantial agreement between the methods was observed on pathogenic and drug response ClinVar variants
GATK steps
| Step | Command | Description |
|---|---|---|
| Input BAM |
| Marking duplicates |
| Step 1. |
| Replacing all read groups in the INPUT file with a new read group |
| Step 2. |
| Reorder reads in BAM file to match the contig ordering in a provided reference file |
| Step 3. |
| Sorting the aligned reads by coordinate order |
| Step 4. |
| Generating BAM index |
|
| Indel Realignment I (Creating a target list of intervals to be realigned) | |
|
| Indel Realignment II (Performing realignment of the target intervals) | |
| Step 5. |
| Sorting the aligned reads by coordinate order |
| Step 6. |
| Generating BAM index |
|
| Base quality score recalibration I (data-driven adjustment of base quality scores) | |
|
| Base quality score recalibration II (Applying the recalibration to sequence data) | |
| Step 7. |
| Calling variants in sequence data |
| Step 8. |
| Select SNPs from the input file |
| Step 9. |
| Building SNP recalibration model |
|
| Applying SNP recalibration model |
Fig. 5Decision tree diagram showing most discriminative annotations in classification of different categories of common variants identified in proband. Rectangular terminal nodes indicate fractions of variants (percentage) classified by the rules in each branch of the binary tree. Variable n indicates the number of samples from the training set assigned to that node. Node is associated with a class label of the most prevalent variant class. For example node 15 is associated with SHRiMP-GATK variants. Variant class identified by LifeScope-GATK is denoted by lg; SHRiMP-GATK is denoted by s; MAQ-GATK is denoted by m and BFAST-GATK by b. Tree nodes represented by ellipses show GATK variant annotations which were the most important in classifying the variants at each subsequent level. Classification rules are indicated by less or equal than and greater than conditions applied on a threshold value of the parameter. The diagram shows a considerable fraction of SHRiMP-GATK variants in node 15 characterized by larger depth of coverage (DP > 37). Large fraction of MAQ-GATK (m) variants are characterized by higher values of quality by depth (QD) (node 14) and have better GATK-assigned quality (see node 10). A group of LifeScope-GATK (lg) variants (node 14) are characterized by higher quality by depth (QD > 30.34). Another group (node 6) has lower QD and a negative value of BaseQRankSum, indicating poorer base quality support for alternative alleles. A tree size while running C50 algorithm was controlled constraining a split by minimum number of cases (parameter minCases) equal to 5000
Parameters in LifeScope’s secondary and tertiary analysis software modules
| Default value | |
|---|---|
| Alignment parameters | |
| Minimum number of non-matches allowed during indel finding | 9 |
| Maximum deletion size (in a gapped alignment in the first pass) | 19 |
| Maximum insertion size (in a gapped alignment in the first pass) | 4 |
| The minimum edge length required for insertions and deletions on the first pass | 12 |
| Number of mismatches allowed for gap alignments | 3 |
| Minimum mapping quality value (MAPQ) allowed for aligned read | 8 |
| Minimum edge length required for insertions and deletions | 12 |
| The seed window side allowed to the left of the anchor alignment | 40 |
| The seed window side allowed to the right of the anchor alignment | 80 |
| Maximum number of alignments for a read on the first pass which gives the maximum number of hits that are reported in the mapping output | 50 |
| SNPs analysis module using diBayes algorithm variant calling parameters | |
| Minimum allele ratio (Het) | 0.15 |
| Minimum coverage (Het) | 2 |
| Minimum non-reference base QV (Het) | 28 |
| Minimum average non-reference base QV (Hom) | 28 |
| Minimum base quality value for a position | 28 |
| Minimum base quality value of the non-reference allele of a position | 28 |
| Mapping quality value of the read |
|
| SNP call stringency. Alleles on both strands | Not required |
| Threshold of mismatch/alignment-lengh ratio | 1 |
| Base candidate allele quality value |
|
| Minimum number of unique start positions required to call heterozygote, homozygote | 2 |
| Proportion of the total reads containing either of the two candidate alleles | 0.65 |
List of open source tools used in our study of the alternatives to the LifeScope pipeline
| Analysis category | Tool | Web reference |
|---|---|---|
| Convert XSQ to color-space fastq | XSQtools solid2fastq | Life Technologies website Bfast package |
| Alignment to the reference genome | Maq v.0.7.1 |
|
| Default parameters, -n 3 | ||
| Shrimp 2.2.3 |
| |
| Default parameters, -h 85 –strata -o 3 | ||
| Bfast 0.7.0 |
| |
| Default parameters | ||
| Preprocessing | Picard 1.111 |
|
| Samtools 0.1.18 |
| |
| Variant calling | GATK v.3.1 |
|
| Steps and parameters in Table | ||
| Variant summaries | Vcftools 0.1.12 |
|
| Bcftools v.0.2.0 |
| |
| Variant annotation | Annovar |
|