Literature DB >> 30916319

GenomeWarp: an alignment-based variant coordinate transformation.

Cory Y McLean^1,2, Yeongwoo Hwang^1,3, Ryan Poplin^1,2, Mark A DePristo^1,2.

Abstract

SUMMARY: Reference genomes are refined to reflect error corrections and other improvements. While this process improves novel data generation and analysis, incorporating data analyzed on an older reference genome assembly requires transforming the coordinates and representations of the data to the new assembly. Multiple tools exist to perform this transformation for coordinate-only data types, but none supports accurate transformation of genome-wide short variation. Here we present GenomeWarp, a tool for efficiently transforming variants between genome assemblies. GenomeWarp transforms regions and short variants in a conservative manner to minimize false positive and negative variants in the target genome, and converts over 99% of regions and short variants from a representative human genome.
AVAILABILITY AND IMPLEMENTATION: GenomeWarp is written in Java. All source code and the user manual are freely available at https://github.com/verilylifesciences/genomewarp. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Disease Gene Species

Mesh：

Year: 2019 PMID： 30916319 PMCID： PMC6821237 DOI： 10.1093/bioinformatics/btz218

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The Human Genome Project produced the first full draft of the human genome sequence (International Human Genome Sequencing Consortium, 2001). Since then, the assembly of the human genome has been refined and updated multiple times (International Human Genome Sequencing Consortium, 2004). Higher quality reference genome sequences improve the mapping and alignment of sequence read data, but present challenges for integrating data mapped to other genome assembly versions. The task of converting genomic regions between genome assemblies, known as lift over, is performed by creating gapped pairwise alignment chains (Kent ) between the assemblies and then transforming the region coordinates based on those chains. Many tools perform genomic region lift over, including UCSC LiftOver (Kuhn ) and CrossMap (Zhao ). These tools support lift over of multiple data formats, with CrossMap supporting Binary Alignment Map, Browser Extensible Data, BigWig, General Feature Format, Gene transfer format, Sequence Alignment Map, Wiggle and Variant Call Format (VCF). An unsupported data type of particular interest is genome-wide variation, in which both variations with respect to the reference assembly and regions that confidently match the reference assembly are encoded. These data are semantically distinct from VCF, as they allow disambiguation between regions in which genotypes are unknown and those that confidently match the reference. As such, genome-wide variation data attempt to represent an individual’s entire genome sequence, encoded with respect to the reference. Genome-wide variation data are often formatted as a Genome VCF (gVCF) file, which encodes variant sites and confidently called regions of the genome in distinct rows. Many popular variant callers, including DeepVariant (Poplin ) and GATK HaplotypeCaller (Van der Auwera ), emit gVCF output and gVCF files are widely used as input to joint genotyping algorithms (Lin ; Poplin ). Translating genome-wide variation data between genome assemblies is more complex than coordinate-only transformations owing to changes in the sequence content between genome assemblies (Fig. 1). Here we describe GenomeWarp, a tool for converting genome-wide short variation data between genome assemblies. Its algorithm is tuned to minimize false positive and negative variants induced by transformation, by marking regions that cannot be guaranteed to transform correctly as unknown. When realigning and recalling variants in a target genome is infeasible, GenomeWarp can accurately convert callsets across genome assemblies.

Fig. 1.

Algorithmic issues encountered when mapping variants between assemblies. Gray boxes indicate confidently called regions. Orange boxes indicate reference genome differences between assemblies. Red letters indicate reported variants in the source genome and their corresponding base pairs in the target genome. Homologous base pairs in the source and target genomes are joined by dotted black lines. (A) Reference sequence changes across genome assemblies can create or remove variants. (B) Indel variant representations can be affected by sequence outside the confidently called regions. The homozygous loss of ‘ATG’ in the source genome matches the removal of that sequence in the target genome. (C) Opposite strand alignments can cause indel representation changes. Since indels are left-aligned by convention, when strands are flipped the reference anchor base moves to the other side of the indel. This may also cause the indel location to change. (D) Indel and single nucleotide polymorphism variants can interact with each other within a single confident region

2 Materials and methods

The workflow of GenomeWarp is as follows (Supplementary Fig. S1): an input gVCF is modified into source variants and confidently called source regions. The regions are preprocessed to contain only canonical DNA characters by splitting any regions that contain ambiguous bases into non-overlapping regions that exclude those characters. The resulting source regions are then lifted over to the target assembly via a chain file of pairwise alignments, resulting in raw target region outputs. Because chain files can map multiple regions in the source assembly to a single region in the target assembly, target regions are post-processed to omit overlapping regions (Supplementary Fig. S2). For each confidently called region that is lifted over to the target assembly, all variant records within the region are collectively considered jointly with the reference sequences to transform the representations into the set of target assembly variants that reflect the same sequence content. Many edge cases must be handled to accurately transform variants within a confidently called region from a source assembly to a target assembly (Fig. 1). The general transformation algorithm requires creating individual haplotypes based on the source and resolving them with respect to the target (Supplementary Fig. S3). However, because the human genome assemblies are quite similar in mapped sequence content (Supplementary Table S1), the general algorithm is rarely needed in practice and simpler transformations can be applied in common cases. GenomeWarp classifies regions based on reference genome composition, whether the homologous regions between assemblies are on the same genome strand, and whether the region contains any insertion/deletion (indel) variants (Supplementary Table S2). A subset of all region type transformations is supported in GenomeWarp v1.2; regions that require haplotype alignment are not transformed. By avoiding alignment, the algorithm does not have to match the alignment parameters used in the original chain file. Unsupported transformations cause the associated confidently called region and its constituent variants to be omitted, effectively turning them into unknown regions. This ensures that the final output of GenomeWarp accurately reflects all variants within the confidently called regions present in the target assembly. The utility of GenomeWarp is demonstrated by its conversion of HG001, the pilot benchmark callset of the Genome in a Bottle Consortium (GiaB) (Zook ), from the GRCh37 to the GRCh38 assembly (Supplementary Table S3). While the GiaB benchmarking regions are likely easier to transform than regions of higher complexity, this should affect performance of all transformation tools. Over 99.9% of benchmarking regions whose coordinates can be lifted over to GRCh38 are successfully transformed, along with 99.4% of single nucleotide variants and 98.7% of indels. Compared to existing conversion methods, GenomeWarp reduces erroneous single nucleotide polymorphisms 19–35-fold and erroneous indels 9–10-fold (Supplementary Note). Indeed, GenomeWarp was used in the generation of subsequent GiaB GRCh38 reference materials for Complete Genomics, Ion Torrent and SOLiD data (Zook ). GenomeWarp completed the conversion using one 2.8 GHz core and 20 GB RAM in 13 min, in contrast to the hundreds of core hours required to align reads and call variants directly. Memory and compute resources scale linearly in the number of regions and variants in the source assembly, and work can be sharded across chromosomes to reduce the total RAM required. The gold standard methodology for identifying variation in a genome assembly is to align reads to that assembly and call variants based on those reads. However, this gold standard may not be possible if the raw reads no longer exist or are otherwise unavailable for analysis. Realigning and recalling variants may also be impractical for computational or cost considerations. In these cases, GenomeWarp provides a computationally efficient mechanism to accurately transform genome-wide short variation data from one assembly to another. Click here for additional data file.

9 in total

1. Initial sequencing and analysis of the human genome.

Authors: E S Lander; L M Linton; B Birren; C Nusbaum; M C Zody; J Baldwin; K Devon; K Dewar; M Doyle; W FitzHugh; R Funke; D Gage; K Harris; A Heaford; J Howland; L Kann; J Lehoczky; R LeVine; P McEwan; K McKernan; J Meldrim; J P Mesirov; C Miranda; W Morris; J Naylor; C Raymond; M Rosetti; R Santos; A Sheridan; C Sougnez; Y Stange-Thomann; N Stojanovic; A Subramanian; D Wyman; J Rogers; J Sulston; R Ainscough; S Beck; D Bentley; J Burton; C Clee; N Carter; A Coulson; R Deadman; P Deloukas; A Dunham; I Dunham; R Durbin; L French; D Grafham; S Gregory; T Hubbard; S Humphray; A Hunt; M Jones; C Lloyd; A McMurray; L Matthews; S Mercer; S Milne; J C Mullikin; A Mungall; R Plumb; M Ross; R Shownkeen; S Sims; R H Waterston; R K Wilson; L W Hillier; J D McPherson; M A Marra; E R Mardis; L A Fulton; A T Chinwalla; K H Pepin; W R Gish; S L Chissoe; M C Wendl; K D Delehaunty; T L Miner; A Delehaunty; J B Kramer; L L Cook; R S Fulton; D L Johnson; P J Minx; S W Clifton; T Hawkins; E Branscomb; P Predki; P Richardson; S Wenning; T Slezak; N Doggett; J F Cheng; A Olsen; S Lucas; C Elkin; E Uberbacher; M Frazier; R A Gibbs; D M Muzny; S E Scherer; J B Bouck; E J Sodergren; K C Worley; C M Rives; J H Gorrell; M L Metzker; S L Naylor; R S Kucherlapati; D L Nelson; G M Weinstock; Y Sakaki; A Fujiyama; M Hattori; T Yada; A Toyoda; T Itoh; C Kawagoe; H Watanabe; Y Totoki; T Taylor; J Weissenbach; R Heilig; W Saurin; F Artiguenave; P Brottier; T Bruls; E Pelletier; C Robert; P Wincker; D R Smith; L Doucette-Stamm; M Rubenfield; K Weinstock; H M Lee; J Dubois; A Rosenthal; M Platzer; G Nyakatura; S Taudien; A Rump; H Yang; J Yu; J Wang; G Huang; J Gu; L Hood; L Rowen; A Madan; S Qin; R W Davis; N A Federspiel; A P Abola; M J Proctor; R M Myers; J Schmutz; M Dickson; J Grimwood; D R Cox; M V Olson; R Kaul; C Raymond; N Shimizu; K Kawasaki; S Minoshima; G A Evans; M Athanasiou; R Schultz; B A Roe; F Chen; H Pan; J Ramser; H Lehrach; R Reinhardt; W R McCombie; M de la Bastide; N Dedhia; H Blöcker; K Hornischer; G Nordsiek; R Agarwala; L Aravind; J A Bailey; A Bateman; S Batzoglou; E Birney; P Bork; D G Brown; C B Burge; L Cerutti; H C Chen; D Church; M Clamp; R R Copley; T Doerks; S R Eddy; E E Eichler; T S Furey; J Galagan; J G Gilbert; C Harmon; Y Hayashizaki; D Haussler; H Hermjakob; K Hokamp; W Jang; L S Johnson; T A Jones; S Kasif; A Kaspryzk; S Kennedy; W J Kent; P Kitts; E V Koonin; I Korf; D Kulp; D Lancet; T M Lowe; A McLysaght; T Mikkelsen; J V Moran; N Mulder; V J Pollara; C P Ponting; G Schuler; J Schultz; G Slater; A F Smit; E Stupka; J Szustakowki; D Thierry-Mieg; J Thierry-Mieg; L Wagner; J Wallis; R Wheeler; A Williams; Y I Wolf; K H Wolfe; S P Yang; R F Yeh; F Collins; M S Guyer; J Peterson; A Felsenfeld; K A Wetterstrand; A Patrinos; M J Morgan; P de Jong; J J Catanese; K Osoegawa; H Shizuya; S Choi; Y J Chen; J Szustakowki
Journal: Nature Date: 2001-02-15 Impact factor: 49.962

2. Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes.

Authors: W James Kent; Robert Baertsch; Angie Hinrichs; Webb Miller; David Haussler
Journal: Proc Natl Acad Sci U S A Date: 2003-09-19 Impact factor: 11.205

3. CrossMap: a versatile tool for coordinate conversion between genome assemblies.

Authors: Hao Zhao; Zhifu Sun; Jing Wang; Haojie Huang; Jean-Pierre Kocher; Liguo Wang
Journal: Bioinformatics Date: 2013-12-18 Impact factor: 6.937

4. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls.

Authors: Justin M Zook; Brad Chapman; Jason Wang; David Mittelman; Oliver Hofmann; Winston Hide; Marc Salit
Journal: Nat Biotechnol Date: 2014-02-16 Impact factor: 54.908

5. From FastQ data to high confidence variant calls: the Genome Analysis Toolkit best practices pipeline.

Authors: Geraldine A Van der Auwera; Mauricio O Carneiro; Christopher Hartl; Ryan Poplin; Guillermo Del Angel; Ami Levy-Moonshine; Tadeusz Jordan; Khalid Shakir; David Roazen; Joel Thibault; Eric Banks; Kiran V Garimella; David Altshuler; Stacey Gabriel; Mark A DePristo
Journal: Curr Protoc Bioinformatics Date: 2013

6. Finishing the euchromatic sequence of the human genome.

Authors:
Journal: Nature Date: 2004-10-21 Impact factor: 49.962

7. A universal SNP and small-indel variant caller using deep neural networks.

Authors: Ryan Poplin; Pi-Chuan Chang; David Alexander; Scott Schwartz; Thomas Colthurst; Alexander Ku; Dan Newburger; Jojo Dijamco; Nam Nguyen; Pegah T Afshar; Sam S Gross; Lizzie Dorfman; Cory Y McLean; Mark A DePristo
Journal: Nat Biotechnol Date: 2018-09-24 Impact factor: 54.908

8. An open resource for accurately benchmarking small variant and reference calls.

Authors: Justin M Zook; Jennifer McDaniel; Nathan D Olson; Justin Wagner; Hemang Parikh; Haynes Heaton; Sean A Irvine; Len Trigg; Rebecca Truty; Cory Y McLean; Francisco M De La Vega; Chunlin Xiao; Stephen Sherry; Marc Salit
Journal: Nat Biotechnol Date: 2019-04-01 Impact factor: 54.908

9. The UCSC genome browser and associated tools.

Authors: Robert M Kuhn; David Haussler; W James Kent
Journal: Brief Bioinform Date: 2012-08-20 Impact factor: 11.622

9 in total

2 in total

1. Detecting archaic introgression and modeling multiple-wave admixture with ArchaicSeeker 2.0.

Authors: Rui Zhang; Kai Yuan; Shuhua Xu
Journal: STAR Protoc Date: 2022-04-14

2. Genomic and Transcriptomic Analyses Reveals ZNF124 as a Critical Regulator in Highly Aggressive Medulloblastomas.

Authors: Zaili Luo; Xinran Dong; Jianzhong Yu; Yong Xia; Kalen P Berry; Rohit Rao; Lingli Xu; Ping Xue; Tong Chen; Yifeng Lin; Jiyang Yu; Guoying Huang; Hao Li; Wenhao Zhou; Q Richard Lu
Journal: Front Cell Dev Biol Date: 2021-02-18

2 in total