| Literature DB >> 28351369 |
Roman V Briskine1,2, Kentaro K Shimizu3,4.
Abstract
BACKGROUND: Whole genome resequencing projects may implement variant calling using draft reference genomes assembled de novo from short-read libraries. Despite lower quality of such assemblies, they allowed researchers to extend a wide range of population genetic and genome-wide association analyses to non-model species. As the variant calling pipelines are complex and involve many software packages, it is important to understand inherent biases and limitations at each step of the analysis.Entities:
Keywords: Draft reference genome; Polymorphisms; Positional bias; Repetitive elements; Reseqencing; SNPs; Variants
Mesh:
Year: 2017 PMID: 28351369 PMCID: PMC5368935 DOI: 10.1186/s12864-017-3637-2
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
List of analysed assemblies
| Reads | Identifier | Assembler | Pos | k |
|---|---|---|---|---|
| Bcon | abyss_9C | ABySS | 80 | 80 |
| Bcon | merac_6C | Meraculous | 71 | NA |
| Bcon | phus_5C | Phusion | 78 | NA |
| Bcon | sga_7C | SGA | 121 | NA |
| Bcon | soap* | SOAPdenovo2 | 36 | 36 |
| Mzeb | abyss_7C | ABySS | 56 | 56 |
| Mzeb | allp_6C | ALLPATHS-LG | 96 | 96 |
| Mzeb | soap_11E | SOAPdenovo2 | 46 | 46 |
| Sim | allp | ALLPATHS-LG | 96 | 96 |
| Sim | sga_m75 | SGA | 100 | NA |
| Sim | sga_m77 | SGA | 100 | NA |
| Sim | soap_K69 | SOAPdenovo2 | 70 | 70 |
| Sim | soap_K71 | SOAPdenovo2 | 72 | 72 |
This study focused on a subset of the B. constrictor (Bcon) and M. zebra (Mzeb) genome assemblies submitted by the assembler developers to the Assemblathon 2 competition. In addition, we simulated reads from A. thaliana chromosomes 1 and 2 (Sim) and constructed several assemblies with varying parameters. For SGA, we varied the minimum string overlap (-m 75 and -m 77 for sga_m75 and sga_m77 respectively). For SOAPdenovo2, we set the -K parameter to 69 and 71, which corresponded to k = 70 and 72 for soap_K69 and soap_K71 respectively. ‘Pos’ column shows the position (counted from ends of contigs or scaffolds) where variants occur most frequently. ‘NA’ in the ‘k’ column indicates that the choice of k was not reported and could not be determined from other sources
*The Bcon assembly by the SOAPdenovo2 team submitted for the competition was assembled using an incorrectly labelled library. We analysed the corrected version that was constructed after the competition [17]
SNP statistics reported by GATK and thresholds used for filtering
| Abbreviation | Rel | Threshold | Full name |
|---|---|---|---|
| QD | < | 2.0 | Quality by Depth |
| MQ | < | 40.0 | Root mean square of Mapping Quality |
| MQRankSum | < | −12.5 | Mapping Quality Rank Sum test |
| FS | > | 60.0 | Fisher’s exact test for Strand bias |
| SOR | > | 4.0 | Strand bias Odds Ratio |
| ReadPosRankSum | < | −8.0 | Read Position Rank Sum test |
| DP | > | 200.0 | Depth of Coverage |
| GQ | < | 20.0 | Genotype Quality |
SNPs with statistics above or below (Rel) the corresponding threshold were removed from consideration. For detailed description of these statistics and justification for the threshold selection, see Van der Auwera et al. [24] and GATK documentation at https://www.broadinstitute.org/gatk/
Fig. 1Distribution of SNP positions at the 5′ end of contigs. The analysis includes a subset of five B. constrictor and three M. zebra assemblies submitted to Assemblathon 2 [17]. The description of assembly identifiers is given in Table 1
Fig. 2Distribution of SNP positions at the 5′ end of contigs in the simulated data set. The description of assembly identifiers is given in Table 1
SNP counts at position k in the simulated data sets
| Reads | Assembler | Contig | Scaffold | Transf | Untransf | Shared | Masked |
|---|---|---|---|---|---|---|---|
| Sim | allp | 57 | 7 | 21 | 1 | 18 | 12 |
| Sim | sga_m75 | 504 | 463 | NA | NA | NA | NA |
| Sim | sga_m77 | 513 | 479 | NA | NA | NA | NA |
| Sim | soap_K69 | 1481 | 698 | 255 | 451 | 137 | 44 |
| Sim | soap_K71 | 1469 | 769 | NA | NA | NA | NA |
| Bs-1 | allp | 66 | 9 | 35 | 0 | 23 | 15 |
| Bs-1 | sga_m75 | 899 | 711 | NA | NA | NA | NA |
| Bs-1 | sga_m77 | 871 | 687 | NA | NA | NA | NA |
| Bs-1 | soap_K69 | 1916 | 670 | 365 | 429 | 210 | 95 |
| Bs-1 | soap_K71 | 1909 | 692 | NA | NA | NA | NA |
Reads column indicates the origin of aligned reads: ‘Sim’ refers to the simulated paired-end reads while ‘Bs-1’ denotes the actual A. thaliana Bs-1 short-insert library [21]. Contig and Scaffold columns show the number of SNPs at position k in the respective contig and scaffold assemblies. ‘Transf’ column shows the number of SNPs at position k called against scaffolds after the scaffold coordinates were transformed to contig coordinates. Only Sim_allp and Sim_soap_K69 scaffold coordinates were transformed. ‘Untransf’ column indicates the number of SNPs that failed to transform because of contig length threshold (Sim_soap_K69) or scaffold being extended beyond the length specified in the assembler’s scaffold map (Sim_allp). ‘Shared’ column reports the number of SNPs present in both Contig and Tranformed sets. ‘Masked’ column shows the SNPs that appear in Contig and Transformed but not in Scaffold because of the change in their relative positions. All counts except ‘Untransf’ are for SNPs in position k
SNPs called by GATK when using BWA or Bowtie2 (Bt2) to align Bs-1 reads against simulated contig assemblies
| Assembly | Bt2 Total | Bt2 Pos k | BWA Total | BWA Pos k |
|---|---|---|---|---|
| Sim_allp | 97,020 | 3 | 488,839 | 69 |
| Sim_sga_m75 | 97,740 | 28 | 500,829 | 948 |
| Sim_sga_m77 | 97,727 | 27 | 501,055 | 917 |
| Sim_soap_K69 | 96,485 | 35 | 497,002 | 2016 |
| Sim_soap_K71 | 96,003 | 44 | 497,486 | 2004 |
Total columns show the total number of SNPs in any position while ‘Pos k’ columns show the number of SNPs at position k
Fig. 3Distribution of SNP positions at the 5′ end of contigs in the Bs-1 data set with repetitive element annotation. Colour indicates whether the SNPs are within repetitive sequences (blue) or not (orange). SNPs were called from the Bs-1 read alignments. Repetitive elements included all sequences reported by RepeatMasker
Fig. 4Distribution of SNP positions at the 5′ end of contigs in the Bs-1 data set after repetitive element filtering. SNPs were called from the Bs-1 read alignments and SNPs located in the annotated repetitive elements were removed