| Literature DB >> 28531267 |
Xiangqun Zheng-Bradley1, Ian Streeter1, Susan Fairley1, David Richardson1, Laura Clarke1, Paul Flicek1.
Abstract
The 1000 Genomes Project produced more than 100 trillion basepairs of short read sequence from more than 2600 samples in 26 populations over a period of five years. In its final phase, the project released over 85 million genotyped and phased variants on human reference genome assembly GRCh37. An updated reference assembly, GRCh38, was released in late 2013, but there was insufficient time for the final phase of the project analysis to change to the new assembly. Although it is possible to lift the coordinates of the 1000 Genomes Project variants to the new assembly, this is a potentially error-prone process as coordinate remapping is most appropriate only for non-repetitive regions of the genome and those that did not see significant change between the two assemblies. It will also miss variants in any region that was newly added to GRCh38. Thus, to produce the highest quality variants and genotypes on GRCh38, the best strategy is to realign the reads and recall the variants based on the new alignment. As the first step of variant calling for the 1000 Genomes Project data, we have finished remapping all of the 1000 Genomes sequence reads to GRCh38 with alternative scaffold-aware BWA-MEM. The resulting alignments are available as CRAM, a reference-based sequence compression format. The data have been released on our FTP site and are also available from European Nucleotide Archive to facilitate researchers discovering variants on the primary sequences and alternative contigs of GRCh38.Entities:
Keywords: GRCh38; alignment; read mapping; reference genome; sequence reads
Mesh:
Year: 2017 PMID: 28531267 PMCID: PMC5522380 DOI: 10.1093/gigascience/gix038
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 6.524
Figure 1:The alignment pipeline flow chart.
Characteristics of the GRCh38 alignments
| Low coverage WGS | WES | |
|---|---|---|
| Sample count | 2691 (2535) | 2692 (2535) |
| Total bases (Gbp) | 63 744 (60 530) | 28 152 (26 571) |
| Total aligned bases (Gbp) | 66 437 (63 783) | 30 901 (28 297) |
| Percentage mapped | 96.2 (92.6) | 97.5 (93.6) |
| Percentage PCR duplicated | 3.6 (4.1) | 11.9 (13.1) |
| Mapped coverage | 8.2 (7.8) | 3.8 (3.5) |
| Mean target coverage | N/A | 101.09 (104.72) |
| %target base 20X | N/A | 84.4 (87.24) |
| CRAM file size (terabytes) | 21.2 | 9.3 |
Some metrics are presented in comparison with the 1000 Genomes Project phase 3 alignments to the GRCh37 assembly (numbers in parentheses). Mapped coverage was calculated using a nominal 3 Gb genome size.
Figure 2:Measurements of mapping quality and total read depth by chromosome for the low-coverage WGS sequence. (A) Average mapping quality across all samples. (B) Total read count per site with mapping quality of 0 across all samples. (C) Total read depth in all samples.
Figure 3:Percentage of sites on chromosome Y by total coverage, showing the expected peak at approximately ×5000.
Comparison of GRCh37 and GRCh38 genome accessibility masks
| N | L | H | Z | Q | P | |
|---|---|---|---|---|---|---|
| GRCh37-strict | 7.66% | 1.13% | 0.55% | 17.20% | 2.98% | 70.49% |
| GRCh37-pilot | 7.66% | 1.13% | 0.24% | 2.74% | 88.23% | |
| GRCh38-strict | 5.33% | 1.44% | 1.04% | 18.07% | 0.03% | 74.09% |
| GRCh38-pilot | 5.33% | 1.44% | 0.56% | 3.67% | 89.00% |
H: accumulative read depth too high; L: accumulative read depth too low; N: bases that are “N”; P: sites passed the accessibility test; Q: mapping quality less than cutoff; Z: too many reads with mapping quality 0.
| Software | Installation instructions | Codebase |
|---|---|---|
| eHive |
|
|
| ReseqTrack |
|
|
| BWA-MEM |
|
|
| BioBamBam |
|
|
| GATK |
|
|
| CRAMTools |
|
|