| Literature DB >> 28498906 |
Masaaki Kobayashi1, Hajime Ohyanagi1,2, Hideki Takanashi3, Satomi Asano1, Toru Kudo1, Hiromi Kajiya-Kanegae3, Atsushi J Nagano4,5,6, Hitoshi Tainaka3, Tsuyoshi Tokunaga7, Takashi Sazuka8, Hiroyoshi Iwata3, Nobuhiro Tsutsumi3, Kentaro Yano1.
Abstract
Recent availability of large-scale genomic resources enables us to conduct so called genome-wide association studies (GWAS) and genomic prediction (GP) studies, particularly with next-generation sequencing (NGS) data. The effectiveness of GWAS and GP depends on not only their mathematical models, but the quality and quantity of variants employed in the analysis. In NGS single nucleotide polymorphism (SNP) calling, conventional tools ideally require more reads for higher SNP sensitivity and accuracy. In this study, we aimed to develop a tool, Heap, that enables robustly sensitive and accurate calling of SNPs, particularly with a low coverage NGS data, which must be aligned to the reference genome sequences in advance. To reduce false positive SNPs, Heap determines genotypes and calls SNPs at each site except for sites at the both ends of reads or containing a minor allele supported by only one read. Performance comparison with existing tools showed that Heap achieved the highest F-scores with low coverage (7X) restriction-site associated DNA sequencing reads of sorghum and rice individuals. This will facilitate cost-effective GWAS and GP studies in this NGS era. Code and documentation of Heap are freely available from https://github.com/meiji-bioinf/heap (29 March 2017, date last accessed) and our web site (http://bioinf.mind.meiji.ac.jp/lab/en/tools.html (29 March 2017, date last accessed)).Entities:
Keywords: genome-wide association studies (GWAS); genomic prediction (GP); next-generation sequencing (NGS); restriction-site associated DNA sequencing (RAD-seq); single nucleotide polymorphism (SNP)
Mesh:
Year: 2017 PMID: 28498906 PMCID: PMC5737671 DOI: 10.1093/dnares/dsx012
Source DB: PubMed Journal: DNA Res ISSN: 1340-2838 Impact factor: 4.458
Figure 1Workflow of Heap algorithm.
Figure 2Schematic representation of performance comparison of SNP calling by Heap, Stacks, SAMtools/BCFtools, and GATK. (A) Schematic view of the calculation of definitive answer genotypes with WGS sequencing reads. MAPQ are measures of mapping quality. (B) Schematic view of SNP calling from RAD-seq reads employing Heap, Stacks, SAMtools, and GATK. GQ indicates genotype quality. (C) Definitions of sensitivity, positive predictive value (PPV), and F-score for SNP calling.
Summary of WGS sequencing reads and mapping of the reads in Sorghum
| Sample | Raw reads | Preprocessed reads | Mapped reads | Uniquely mapped reads | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Count (×106) | Total length (Gb) | Count (×106) | Total length (Gb) | Count (×106) | Rate (%) | Count (×106) | Rate (%) | Coverage | ||||
| Mean | 1st quartile | Median | 3rd quartile | |||||||||
| 219.2 | 22.1 | 200.5 | 19.9 | 189.0 | 94.3 | 100.3 | 50.0 | 21.2 | 11 | 20 | 27 | |
| 205.1 | 20.7 | 191.9 | 19.1 | 181.1 | 94.4 | 97.0 | 50.5 | 20.2 | 12 | 19 | 26 | |
| 214.0 | 21.6 | 195.4 | 19.4 | 185.5 | 94.9 | 104.5 | 53.5 | 18.7 | 10 | 18 | 24 | |
| 232.7 | 23.5 | 213.7 | 21.2 | 202.6 | 94.8 | 113.4 | 53.1 | 20.8 | 11 | 20 | 27 | |
| 260.9 | 26.4 | 239.2 | 23.8 | 230.0 | 96.2 | 131.1 | 54.8 | 24.8 | 14 | 24 | 32 | |
| 206.7 | 20.9 | 181.4 | 18.4 | 171.9 | 94.8 | 98.5 | 54.3 | 19.1 | 12 | 18 | 24 | |
| 215.2 | 21.7 | 196.4 | 19.5 | 188.4 | 95.9 | 105.8 | 53.8 | 20.4 | 12 | 19 | 25 | |
| 226.5 | 22.9 | 209.8 | 20.9 | 203.3 | 96.9 | 118.7 | 56.6 | 21.7 | 12 | 21 | 28 | |
| 235.7 | 23.8 | 218.1 | 21.7 | 210.3 | 96.4 | 118.0 | 54.1 | 22.9 | 14 | 22 | 29 | |
| 192.2 | 19.4 | 177.0 | 17.6 | 170.2 | 96.1 | 97.5 | 55.1 | 19.2 | 12 | 18 | 24 | |
| 250.0 | 25.3 | 233.1 | 23.2 | 225.2 | 96.6 | 129.1 | 55.4 | 24.6 | 16 | 24 | 31 | |
| 270.0 | 27.3 | 237.4 | 23.4 | 225.8 | 95.1 | 120.1 | 50.6 | 23.8 | 14 | 23 | 30 | |
| 232.4 | 23.5 | 207.2 | 20.5 | 196.7 | 94.9 | 105.9 | 51.1 | 20.8 | 12 | 20 | 26 | |
| 217.6 | 22.0 | 200.4 | 19.8 | 191.7 | 95.6 | 107.9 | 53.9 | 20.9 | 13 | 20 | 26 | |
| 216.8 | 21.9 | 197.0 | 19.4 | 188.7 | 95.8 | 108.9 | 55.3 | 20.8 | 12 | 20 | 26 | |
| 203.8 | 20.6 | 177.8 | 17.4 | 170.3 | 95.8 | 96.9 | 54.5 | 19.3 | 12 | 18 | 24 | |
| 272.4 | 27.5 | 239.2 | 23.5 | 229.6 | 96.0 | 129.9 | 54.3 | 25.7 | 15 | 25 | 33 | |
| 3871.3 | 391.0 | 3515.5 | 348.6 | 3360.2 | – | 1883.3 | – | – | – | – | – | |
| 227.7 | 23.0 | 206.8 | 20.5 | 197.7 | 95.6 | 110.8 | 53.6 | 21.5 | 12.6 | 20.5 | 27.2 | |
All read counts and lengths are shown in millions and billions, respectively.
aMapping rates are calculated as the ratio of the number of the mapped reads against the number of the preprocessed reads.
Summary of WGS sequencing reads and mapping of the reads in rice
| Sample | Raw reads | Preprocessed reads | Mapped reads | Uniquely mapped reads | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Count (×106) | Total length (Gb) | Count (×106) | Total length (Gb) | Count (×106) | Rate (%) | Count (×106) | Rate (%) | Coverage | ||||
| Mean | 1st quartile | Median | 3rd quartile | |||||||||
| 297.9 | 22.3 | 260.7 | 18.3 | 257.0 | 98.6 | 185.4 | 71.1 | 41.6 | 27 | 42 | 56 | |
| 218.3 | 19.3 | 184.7 | 15.8 | 181.3 | 98.2 | 140.6 | 76.1 | 37.4 | 30 | 40 | 47 | |
| 221.9 | 16.6 | 202.2 | 14.9 | 198.0 | 97.9 | 144.3 | 71.4 | 34.2 | 18 | 35 | 48 | |
| 233.8 | 20.7 | 219.4 | 18.7 | 214.0 | 97.6 | 154.3 | 70.4 | 40.5 | 27 | 44 | 54 | |
| 971.9 | 79.1 | 866.8 | 67.7 | 850.2 | – | 624.6 | – | – | – | – | – | |
| 243.0 | 19.8 | 216.7 | 16.9 | 212.6 | 98.1 | 156.1 | 72.2 | 38.5 | 25.5 | 40.3 | 51.3 | |
All read counts and lengths are shown in millions and billions, respectively.
aMapping rates are calculated as the ratio of the number of the mapped reads against the number of the preprocessed reads.
Summary of RAD-seq reads and mapping of the reads in Sorghum
| Sample | Raw reads | Preprocessed reads | Mapped reads | Uniquely mapped reads | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Count (×106) | Total length (Gb) | Count (×106) | Total length (Gb) | Count (×106) | Rate (%) | Count (×106) | Rate (%) | Coverage in RAD-region | ||||
| Mean | 1st quartile | Median | 3rd quartile | |||||||||
| 1.6 | 0.2 | 1.5 | 0.1 | 1.4 | 91.4 | 0.6 | 41.8 | 5.8 | 1 | 2 | 4 | |
| 1.9 | 0.2 | 1.9 | 0.2 | 1.8 | 92.7 | 0.9 | 45.0 | 7.8 | 1 | 2 | 7 | |
| 2.2 | 0.2 | 2.1 | 0.2 | 2.0 | 94.3 | 1.0 | 47.8 | 8.1 | 1 | 2 | 7 | |
| 1.4 | 0.1 | 1.4 | 0.1 | 1.3 | 94.0 | 0.6 | 42.4 | 5.5 | 1 | 2 | 5 | |
| 1.5 | 0.2 | 1.4 | 0.1 | 1.3 | 93.8 | 0.6 | 45.2 | 6.2 | 1 | 2 | 6 | |
| 2.7 | 0.3 | 2.6 | 0.3 | 2.5 | 93.6 | 1.3 | 48.1 | 9.2 | 1 | 2 | 7 | |
| 2.3 | 0.2 | 2.2 | 0.2 | 2.1 | 93.3 | 1.1 | 47.7 | 8.5 | 1 | 2 | 7 | |
| 1.7 | 0.2 | 1.7 | 0.2 | 1.6 | 94.0 | 0.8 | 48.3 | 6.8 | 1 | 2 | 7 | |
| 2.4 | 0.2 | 2.3 | 0.2 | 2.2 | 94.2 | 1.1 | 48.6 | 7.9 | 1 | 2 | 7 | |
| 2.3 | 0.2 | 2.2 | 0.2 | 2.1 | 91.9 | 1.1 | 48.6 | 9.3 | 1 | 2 | 7 | |
| 1.4 | 0.2 | 1.4 | 0.1 | 1.3 | 93.1 | 0.7 | 45.8 | 6.4 | 1 | 3 | 6 | |
| 2.9 | 0.3 | 2.9 | 0.3 | 2.7 | 92.2 | 1.4 | 47.8 | 9.8 | 1 | 2 | 6 | |
| 1.8 | 0.2 | 1.8 | 0.2 | 1.6 | 92.3 | 0.8 | 45.2 | 6.4 | 1 | 2 | 5 | |
| 2.2 | 0.2 | 2.1 | 0.2 | 2.0 | 93.7 | 1.0 | 48.0 | 7.9 | 1 | 2 | 6 | |
| 2.0 | 0.2 | 2.0 | 0.2 | 1.9 | 93.9 | 1.0 | 48.8 | 7.4 | 1 | 2 | 6 | |
| 1.1 | 0.1 | 1.1 | 0.1 | 1.0 | 93.3 | 0.5 | 45.7 | 5.4 | 1 | 2 | 6 | |
| 4.1 | 0.4 | 4.0 | 0.4 | 3.5 | 86.4 | 1.9 | 47.5 | 6.9 | 1 | 1 | 3 | |
| 35.3 | 3.6 | 34.8 | 3.3 | 32.1 | – | 16.3 | – | – | – | – | – | |
| 2.1 | 0.2 | 2.1 | 0.2 | 1.9 | 92.8 | 1.0 | 46.6 | 7.4 | 1.0 | 2.0 | 6.0 | |
All read counts and lengths are shown in millions and billions, respectively.
aMapping rates are calculated as the ratio of the number of the mapped reads against the number of the preprocessed reads.
Figure 3Performance comparison among SNP calling tools with RAD-seq reads in 17 inbred sorghum lines. Mean values of sensitivities (left), positive predictive values (PPVs) (center), and F-scores (right) of SNP calling by Heap, Stacks, SAMtools, and GATK from RAD-seq reads in 17 inbred sorghum lines are shown. Statistical analysis was performed using the Tukey-Kramer HSD test. Letters above the bars indicate groups that are significantly different (P < 0.05).
Summary of RAD-seq reads and mapping of the reads in rice
| Sample | Raw reads | Preprocessed reads | Mapped reads | Uniquely mapped reads | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Count (×106) | Total length (Gb) | Count (×106) | Total length (Gb) | Count (×106) | Rate (%) | Count (×106) | Rate (%) | Coverage in RAD-region (average) | ||||
| Mean | 1st quartile | Median | 3rd quartile | |||||||||
| 3.6 | 0.2 | 3.6 | 0.2 | 3.5 | 98.4 | 2.4 | 67.3 | 11.2 | 1 | 2 | 4 | |
| 4.2 | 0.2 | 4.2 | 0.2 | 4.1 | 98.1 | 2.8 | 66.9 | 11.7 | 1 | 1 | 4 | |
| 0.7 | 0.0 | 0.7 | 0.0 | 0.7 | 98.7 | 0.4 | 62.0 | 6.5 | 1 | 2 | 4 | |
| 1.9 | 0.1 | 1.9 | 0.1 | 1.8 | 98.4 | 1.3 | 67.2 | 9.2 | 1 | 2 | 4 | |
| 10.3 | 0.5 | 10.3 | 0.5 | 10.1 | – | 6.9 | – | – | – | – | – | |
| 2.6 | 0.1 | 2.6 | 0.1 | 2.5 | 98.4 | 1.7 | 65.9 | 9.7 | 1.0 | 1.8 | 4.0 | |
All read counts and lengths are shown in millions and billions, respectively.
aMapping rates are calculated as the ratio of the number of the mapped reads against the number of the preprocessed reads.
Figure 4Performance comparison among SNP calling tools with RAD-seq reads in 4 inbred rice lines. Mean values of sensitivities (left), PPVs (center), and F-scores (right) of SNP calling by Heap, Stacks, SAMtools, and GATK from RAD-seq reads in 4 inbred rice lines are shown. Letters above the bars indicate groups that are significantly different (P < 0.05), according to the Tukey-Kramer HSD test.