Literature DB >> 34083788

Rapid genotype imputation from sequence with reference panels.

Yingguang Frank Chan¹, Simon Myers^2,3, Robert W Davies⁴, Marek Kucka¹, Dingwen Su¹, Sinan Shi², Maeve Flanagan⁵, Christopher M Cunniff⁵.

Abstract

Inexpensive genotyping methods are essential to modern genomics. Here we present QUILT, which performs diploid genotype imputation using low-coverage whole-genome sequence data. QUILT employs Gibbs sampling to partition reads into maternal and paternal sets, facilitating rapid haploid imputation using large reference panels. We show this partitioning to be accurate over many megabases, enabling highly accurate imputation close to theoretical limits and outperforming existing methods. Moreover, QUILT can impute accurately using diverse technologies, including long reads from Oxford Nanopore Technologies, and a new form of low-cost barcoded Illumina sequencing called haplotagging, with the latter showing improved accuracy at low coverages. Relative to DNA genotyping microarrays, QUILT offers improved accuracy at reduced cost, particularly for diverse populations that are traditionally underserved in modern genomic analyses, with accuracy nearly doubling at rare SNPs. Finally, QUILT can accurately impute (four-digit) human leukocyte antigen types, the first such method from low-coverage sequence data.

Entities: Chemical

Mesh：

Year: 2021 PMID： 34083788 PMCID： PMC7611184 DOI： 10.1038/s41588-021-00877-0

Source DB: PubMed Journal: Nat Genet ISSN： 1061-4036 Impact factor: 38.330

Introduction

Large genome-wide association studies (GWAS) are essential to modern human genomics. They pinpoint individual genes that contribute to specific phenotypes, allow for accurate measures of disease heritability, reveal genetic relationships between phenotypes, and allow for dissection of the contribution of tissues to specific phenotypes[1]. In addition, large GWASs are the basis to generate accurate polygenic risk scores[2], which are essential to realizing the promise of precision medicine[3]. Over the last decade, most GWASs have first genotyped half a million or more SNPs using a genotyping microarray, and then imputed additional untyped SNPs using haplotype reference panels[4,5]. The algorithms that perform the phasing and imputation are generally derived from the Li and Stephens model[6], which models each sample individual as a mosaic of reference haplotypes. In such models, imputation accuracy increases with reference panel size, due to longer and hence more recent matches between sample and reference haplotypes, leading to increasingly large reference panels, for example the haplotype reference consortium (HRC)[7]. To handle these large panels, sophisticated phasing and imputation methods have been developed to work specifically from genotyping microarray input, resulting in fast run times and very high accuracy[5,8-11]. Recently, low-coverage whole genome sequencing (lc-WGS) has emerged as an alternative to genotyping microarrays for obtaining genotyping information for imputation[12-14]. This approach has become increasingly attractive as both library costs[15,16] and sequencing costs have decreased, and especially for non-model organisms, avoids expensive array design or low throughput costs[17]. Methods to impute from lc-WGS include some derived from approaches designed primarily for array based imputation[18], with others designed specifically for lc-WGS[17,19,20], or designed principally around non-human imputation[21-23]. However, none of these methods are designed specifically for lc-WGS using large haplotype reference panels. Data from sequencing reads from lc-WGS have different properties than probes from genotyping microarrays, and warrant different statistical models for phasing and imputation. Firstly, probes from genotyping microarrays are short, being tens of base pairs long, and as such, information from different probes, covering different SNPs, is nearly always independent. By contrast, sequencing reads are typically 100-250 bases long, may be paired, and with long read sequencing, may be many thousands of bases long. As such, information at nearby SNPs may not be independent, by coming from the same read(s). This issue is particularly acute when reads are expected to span more SNPs on average, for example highly polymorphic regions like the major histocompatibility complex (MHC), in populations of species with high genetic diversity, and with long read sequencing technologies, such as Oxford Nanopore Technologies (ONT) and Pacific Biosciences. Secondly, information from genotyping microarrays is summed across paternal and maternal haplotypes, while sequencing reads come from either the paternal or maternal haplotype. If the maternal or paternal origin of each sequencing read can be determined, imputation becomes much simpler, because maternal and paternal reads can be separated and haploid imputation performed, which has linear or better computational complexity in reference panel size. This reduction of complexity can greatly speed up computation and ensures efficient processing of many thousands of samples in large studies. Here we present QUILT, a method for rapid genotype imputation and phasing from lc-WGS using a large haplotype reference panel. QUILT uses efficient computational storage of reference haplotypes and an iterative two-step procedure with Gibbs sampling to efficiently impute samples from lc-WGS with linear computational complexity in the number of samples, SNPs and reference haplotypes. We evaluate QUILT’s performance versus genotyping microarrays on three exemplar data types, representing the diversity of sequencing approaches currently available: Illumina short read sequencing data; ONT, one example of long read sequencing data; and haplotagging data, a low-cost, scalable form of barcoded short-read (linked-read) sequencing, featuring partial sequencing of long (>100kb) haplotype segments. We evaluate QUILT’s ability to accurately partition reads and impute genotypes across these different data types using one individual sequenced on all three platforms (NA12878), and three larger additional datasets. We further demonstrate QUILT’s effectiveness in imputing across diverse populations from the 1000 Genomes Project, and QUILT’s ability to accurately impute human leukocyte antigen (HLA) Class I and Class II alleles. Finally, we investigate what these results mean for the relative merits of designing studies using lc-WGS, versus genotyping microarray technologies.

Results

Overview of model

Our method, QUILT, models each sample haplotype to be imputed as a mosaic of a panel of reference haplotypes, using a Li and Stephens model[6]. Let K be the number of reference haplotypes. Because samples are diploid, a naïve implementation of this model has computational complexity that is proportional to K, which is computationally prohibitive for large K. However, fundamentally, each sequencing read comes from either the maternal of paternal chromosome. Conceptually, if we could split them into two such sets, the imputation becomes computationally linear in K. Previously, in the STITCH model, we introduced a “pseudo-haploid” approach that had linear computational complexity[17], by assigning approximating probabilities for each read to come from each background. Here, we improved this approach by implementing a Gibbs sampler through efficient re-use of stored forward-backward probabilities, allowing for re-sampling of all read labels into the two parental chromosomes, using a single forward-backward pass of the hidden Markov model (HMM). As Gibbs sampling is slow when using the full reference panel, we iterate between using the Gibbs sampler on a subset of the reference panel to update the read labels, and performing haploid imputation using the read labels and the full reference panel to update the subset of reference panel, and in the terminal iteration, perform the imputation (Figure 1). The Gibbs sampling is fast, by using a fraction of the panel, and accurate, as it uses the best matching reference haplotypes. Final imputation results are taken as an average across multiple Gibbs processes, and the entire algorithm is computationally linear in K. Further details, including phasing as well as computational efficiencies and speedups used in the model, are described in the Methods and Supplementary Note.

Figure 1

Schematic of QUILT model.

Model shown for one Gibbs sampling. Model is initialized for a vector of read labels, and a subset of reference haplotypes. The QUILT model then iteratively proceeds between Gibbs sampling, to obtain new read labels given the current subset of reference haplotypes, and full haploid imputation, to obtain new reference haplotype subsets using the current read labels. QUILT completes after a pre-specified number of iterations. Genotype dosage is taken as an average across Gibbs samplings, while phase is taken from an additional Gibbs sampling using read labels taken as average across previous samplings.

Imputation performance for NA12878

We obtained high coverage sequence for NA12878 using standard Illumina short read (“Illumina”)[24], haplotagged Illumina short read (HT), and Oxford Nanopore Technologies long read sequence (ONT)[25]. We inferred accurate genotypes and haplotypes using high coverage trio data and phasing both using trio information and a haplotype reference panel (Methods), and use these as approximate truth haplotypes, from which we further generated probabilities that each sequencing read ultimately came from either the maternal or paternal haplotype (Methods). In what follows, we generally imputed 200 Mbp of the genome for comparisons, and for a reference panel, we used a subset of the haplotype reference consortium (HRC) dataset with NA12878 removed (therefore using 54,328 haplotypes)[7] (Methods). In later sections we use different subsets of the HRC, after removing exact matches to test sequences. We chose parameter settings for QUILT that gave a reasonable trade-off between accuracy and speed across a wide range of coverages (Supplementary Figure 1). We assessed phasing accuracy both visually by comparing the continuity of inferred to truth read labels, and quantitatively using phase switch error rate. QUILT is able to achieve increasingly accurate read label phasing as the algorithm progresses, eventually achieving a read level phasing with approximately one switch error per megabase (Figure 2; errors are visualised as flips between runs of blue and orange, over a continuous 20 Mbp region of chromosome 20). When comparing across technology platforms, standard short-read Illumina data (unlinked) generated more switches than longer-read HT or ONT data (Figure 2). Indeed phase switch error rates were 0.09% for ONT, 0.08% for HT, and 0.13% for Illumina at 1.0X coverage (Supplementary Table 1). Phasing accuracy was robustly high across a wide spectrum of sequence coverage, increasing only to 0.11% (HT) at 0.25X coverage, though genotyping error rates increased as coverage decreased (5.26% of heterozygous sites at 0.25X, vs 2.12% at 1.0X, were erroneously imputed as homozygous).

Figure 2

Assessment of read label partitioning.

Per analysis, reads are grouped based on assignment to Hap1 or Hap2, with remaining y-axis variation being jitter. x-axis gives central location of read along 20 Mbp of chromosome 20. Reads are coloured blue and orange to reflect high posterior probability of coming from truth maternal or paternal chromosome, while grey indicates equally likely from either truth chromosome. Switches between runs of orange and blue denote probable switch errors. Columns denote effect of multiple iterations (left-most, for haplotagged 1.0X), different technologies (center, for 1.0X), and coverages (right-most, for haplotagged).

We next assessed genotype imputation accuracy across data types, methods, and coverage (Figure 3A, Supplementary Table 2). To explore the contribution of phasing errors to genotyping errors, we supplied our “truth” read label origins to the QUILT framework to generate what we call “Optimal” imputation results (these are optimal in the sense of possessing idealised phase information, but still vary with read coverage). We stratified results by allele frequencies from the separate genome aggregation database (gnomAD) [26]. Throughout the results, we use “rare” to refer to SNPs with allele frequencies 0.1-0.2%, and “common” to refer to SNPs with frequencies 20-50%. Encouragingly, imputation results for QUILT at 0.5X are quite close to Optimal results and show a benefit for HT, particularly at rare SNPs (Optimal, HT, Illumina respectively; rare r2 = 0.76, 0.709, 0.678; common r2 = 0.988, 0.980, 0.975). This benefit for HT remained at lower coverages, for example for rare SNPs at 0.1X (HT, Illumina respectively; r2 = 0.460, 0.416). Comparisons with ONT are complicated slightly by the higher error rate of this platform, and at 0.5X ONT showed similar accuracy to Ilumina at rare SNPs (rare r2 = 0.669), and somewhat worse accuracy for common SNPs (r2 = 0.937), indicating a trade-off between better phasing and lower accuracy of individual reads.

Figure 3

Imputation accuracy of NA12878 sample.

r2 per-bin is aggregated over SNPs with a given gnomAD allele frequency for a given technology, coverage and method.

We compared QUILT results to GLIMPSE[20] for each analysis (Figure 3A, Supplementary Table 2). As GLIMPSE uses genotype likelihoods from a VCF for input, there is no gain in using linked-read HT data under GLIMPSE versus standard Illumina sequencing. At 0.5X, results for QUILT and GLIMPSE were similar, with QUILT HT being the most accurate (QUILT HT, QUILT Illumina, GLIMPSE Illumina respectively; rare r2 = 0.709, 0.678, 0.672; common r2 = 0.980, 0.975, 0.974). For ONT, accuracy was considerably lower with GLIMPSE than QUILT (QUILT ONT, GLIMPSE ONT; rare r2 = 0.669, 0.428; common r2 = 0.937, 0.771). Run time comparisons between both QUILT and GLIMPSE for a variety of reference panel sizes for both low (NA12878, N=3) and moderate (1000G, N=93) sample sizes confirmed that QUILT has approximately linear computational complexity in reference panel size while GLIMPSE has constant computational complexity in reference panel size for moderate sample sizes (Supplementary Figure 2). Both methods had low and approximately linear relationships between memory usage and reference panel size. Both QUILT and GLIMPSE include parameter settings that trade off accuracy versus run time. Therefore we re-ran both methods, across a range of parameter values varying from their defaults, for NA12878 (Supplementary Figure 3, N=3, with three data types), as well as Illumina for 1000 Genomes data (Supplementary Figure 4, N = 93, a more realistic scenario involving more samples). The results indicate (when considering both speed and accuracy) that for lower-coverage data (0.1X) QUILT is favoured over GLIMPSE, for all three data types but especially HT and ONT data. For moderate coverages (0.5X) the methods are similar, except for ONT, and for high-coverage data (1.0, 2.0X) GLIMPSE is favoured over QUILT for the Illumina and HT data types, while QUILT is clearly favoured for ONT data. We next compared QUILT results to those obtainable from genotyping microarrays. We approximated microarray input using high coverage WGS restricted to the relevant array SNP site list. We examined performance versus the commonly used Affymetrix UK Biobank (UKBB) and Illumina Global Screening Array (GSA) arrays, imputing using Beagle version 5.1[8] (Figure 3B, Supplementary Table 3). Both arrays yield similar accuracies at both rare SNPs (UKBB, GSA respectively; r2 = 0.694, 0.683) and common SNPs (r2 = 0.989, 0.984). For rare SNPs, all three sequencing platforms already exceeded this accuracy at 1.0X coverage (HT, Illumina, ONT respectively; r2 = 0.776, 0.754, 0.741). For the best-performing platform (HT), even at 0.5X coverage imputation was at least as accurate as arrays (r2 = 0.709 for rare SNPs, r2 = 0.988 for common SNPs), demonstrating the utility of QUILT and lc-WGS.

Imputation performance across diverse sequenced samples

In three additional datasets, we confirmed the overall patterns we saw using NA12878 between data type and imputation performance. First, we further compared HT and Illumina using 7 offspring individuals from 5 different families of North American (N=3) and Ashkenazi Jewish (N=2) background (Figure 4A, see Supplementary Table 4 for all coverages). At 0.1X, we see the same ordering as with NA12878 for Optimal, QUILT HT, QUILT Illumina and GLIMPSE for both rare (r2 = 0.463, 0.450, 0.407, 0.328) and common (r2 = 0.911, 0.891, 0.850, 0.804) SNPs, although as coverage increased, the approaches became more similar. Second, we again compared HT and Illumina data using 59 samples of British background from 1000 Genomes (GBR, using 1000 Genomes[24] notation) (Figure 4B, see Supplementary Table 5 for all coverages). At 0.25X, we see the same ordering for QUILT HT, QUILT Illumina and GLIMPSE for both rare (r2 = 0.629, 0.593, 0.519 respectively) and common (r2 = 0.947, 0.931, 0.917 respectively) SNPs. Finally, we further compared Illumina and ONT using 7 samples of diverse genetic backgrounds from Shafin et al.[27] (Figure 4C, see Supplementary Table 6 for all coverages). At 1.0X, as for NA12878 imputation, we see that imputation from Illumina is more accurate than ONT, that QUILT is more accurate than GLIMPSE on ONT data, and similarly, within datatypes, we see the same relationship between Optimal, QUILT, and GLIMPSE, both for Illumina (rare r2 = 0.7, 0.64, 0.598; common r2 = 0.97, 0.925, 0.908 respectively) as well as for ONT (rare r2 = 0.638, 0.629, 0.439; common r2 = 0.905, 0.884, 0.718 respectively).

Figure 4

Imputation accuracy of 5-Family, GBR and ONT samples.

r2 per-bin is aggregate over all SNPs in that gnomAD allele frequency bin across all samples, for a given technology, coverage and method.

Imputation performance relative to genotyping arrays

We next evaluated imputation performance versus genotyping microarrays (UKBB and GSA) across diverse samples using QUILT 1000G project samples for five groups (ASW, CEU, CHB, PJL and PUR) from distinct continental populations (Methods)[24]. As expected, imputation accuracy was highest for CEU (Northern and Western European) samples (Figure 5, Supplementary Table 7), who are most similar to the bulk of the HRC reference panel, where for rare SNPs, arrays were comparable to 0.25X sequencing data imputed using QUILT (r2 = 0.63-0.66). Interestingly for other groups, although absolute imputation quality declined, the relative performance of lc-WGS and QUILT increased compared to genotyping arrays. For example, for CHB (Han Chinese in Beijing) samples, lc-WGS and QUILT outperformed arrays for rare SNPs (QUILT, GSA, UKBB respectively; r2 = 0.581, 0.485, 0.43). For common SNPs imputation accuracy was generally high across platforms. Thus, the benefits of sequencing versus genotyping appear considerably greater for populations less similar to a given reference dataset. Higher coverage 1.0X and 2.0X lc-WGS data outperformed arrays at all frequencies, especially lower frequency variants, with QUILT nearly doubling the above array-based values (e.g. for rare SNPs, CHB, QUILT 1.0X r2 = 0.768, 2.0X r2 = 0.840).

Figure 5

Imputation accuracy of 1000 Genomes samples.

r2 per-bin is aggregate over all SNPs in that gnomAD allele frequency bin across all samples, for a given technology, coverage and method.

HLA imputation performance

Methods for accurate HLA imputation using genotypes[28,29] or high-coverage sequence data[30] have previously been developed. As part of QUILT, we developed a novel HLA imputation algorithm for lc-WGS, that uses reads inside an HLA locus for direct read mapping, and uses the remaining reads for imputation using a labelled reference panel (Online Methods). We used reference HLA data from both the HLA database IPD-IMGT/HLA[31] and 1000 Genomes Project, and used withheld 1000 Genomes data for testing. We imputed HLA types for five populations (ASW, CEU, CHB, PJL and PUR) at each of 5 classical HLA loci (HLA-A, HLA-B, HLA-C, HLA-DQB1, HLA-DRB1). In addition, we assessed the accuracy of array-based HLA imputation using the same reference data and a method derived from the approach of SNP2HLA (Methods)[28,32]. The results demonstrate that accurate HLA imputation from low-coverage sequence data is achievable across both Class I (HLA-A, HLA-B and HLA-C) and II (HLA-DQB1 and HLA-DRB1) loci, with the former showing generally higher accuracies (Figure 6, Supplementary Table 8). Using HLA-A as an example, the array based approach had an accuracy of 0.893 across all populations, while at 0.1X QUILT achieved 0.953 (0.995 CEU, 0.950 ASW (Americans of African Ancestry in SW USA)). Accuracy rose to 0.978 when only considering confidently called alleles. Accuracy decreased to 0.946 when the direct read information was not used, indicating that even at 0.1X direct read mapping can be useful. Results were consistent across coverages (Supplementary Figure 5), although direct read mapping provided a relatively larger boost in accuracy at higher coverage (9.3% at 2.0X) versus lower coverage (1.9% at 0.1X). Results for the other Class I loci of HLA-B and HLA-C, as well as HLA-DQB1 were similar, though HLA-B was less accurate across all samples, again mirroring known results[29]. Accuracies for HLA-DRB1 were lower, and this was the only locus where the array-based accuracy (0.781) slightly exceeded QUILT for 0.1X data (0.772). Nonetheless, accuracy at confidently called alleles remained high for QUILT at 0.1X, being 0.985 for HLA-DQB1, and at least 0.944 for every other locus. As expected, across all analyses, imputation accuracy was generally higher for more common HLA alleles, than rare alleles (Supplementary Table 8). Finally, we examined results for 7 specific HLA alleles chosen because they are either disease associated or associated with adverse drug reactions, similarly to previous work[29]. Except for some DRB alleles, accuracy at these medically important alleles was generally very high (Supplementary Table 9).

Figure 6

Imputation accuracy of HLA loci.

Accuracy is percent of correct unphased HLA alleles versus computationally inferred truth. Results are shown both per-population and in aggregate (ALL). Results are given both using only imputation (Imp only), as well as imputation plus direct read mapping (Joint, the default QUILT output). Results are further given at the subset of individuals with confidently inferred alleles (Joint(>0.90)). As reported elsewhere[29], HLA Class I loci (HLA-A, HLA-B and HLA-C) are less diverse than Class II loci (HLA-DRB1 and HLA-DQB1) and thus yield more accurate imputation results.

Relative effective sample size and power

Finally, we performed a cost/benefit analysis of lc-WGS and QUILT-based imputation, versus widely used genotyping microarrays. We assessed the benefit in both a GWAS and a burden test setting, where for the latter the focus is on testing for an excess of rare coding variants in cases at a specific locus (Methods). We tested a scenario of using samples drawn from the CHB population, using estimates of imputation accuracy from the 1000 Genomes CHB analysis, and assumed a fixed cost of 30 GBP per array, and library costs from Meier et al.[16]. We varied both the cost of phenotyping a sample (i.e. non-genotyping costs), as well as the per-X sequencing costs, from approximate costs available today of 1000 USD / 30X genome, to potential lower future costs of 250 USD / 30X genome. Results show that for both settings, sequencing yields nearly uniformly larger effective sample sizes or greater power than genotyping arrays (Figure 7, Supplementary Table 10, Supplementary Figure 6). For GWAS testing, lc-WGS and QUILT yield effective sample sizes 3.5 and 1.4 times larger than using genotyping microarrays for the inexpensive and expensive phenotyping cost, respectively, for identifying associations with rare SNPs of 0.1-0.2% frequency. For the burden test setting, power differences are even more pronounced. The relative gain in effective sample size or power increases as sequencing and phenotyping costs decrease. Results for fixed phenotyping and sequencing costs show the improvement of using lc-WGS and QUILT are especially pronounced at the lowest coverages. At current sequencing costs, the most cost-effective sequencing coverage is low, at less than 1X coverage, though imputation using higher coverage will be more useful in the future as sequencing costs decrease.

Figure 7

Relative increase in effective sample size and power using lc-WGS and QUILT.

Results are shown as a ratio of effective sample size for the GWAS setting, and a ratio of power for the burden test setting. Results use 1000 Genomes CHB imputation accuracy. Results for the top panel are given as a function of coverage, with variable phenotyping and per-X sequencing costs, for a fixed allele frequency (0.1-0.2%). Results for the bottom panel are given as a function of allele frequency, with varying coverage, assuming fixed phenotyping ($5 / sample) and per-X sequencing costs ($500 / 30X). All results assume a library preparation cost of 1.36 GBP /sample and an array cost of 30 GBP / sample.

Discussion

Inexpensive and accurate genotyping solutions are needed to perform the next generation of GWAS. In this work, we showed that QUILT can unlock the power of lc-WGS for this task, simultaneously improving accuracy and reducing costs for diverse data types and individuals from varied populations, compared to traditional genotyping arrays. Here we consider why we observed these results, the broad implications of the results observed here for GWAS and other studies, and finally natural extensions of this work. To begin, it is helpful to consider imputation from lc-WGS, supposing we knew the true read label assignment, i.e. whether each sequencing read came from the maternal or paternal chromosome. Neglecting the impact of errors at rates <1% for either kind of data, standard arrays type on the order of a million SNPs across the genome, typically favouring common variants, and/or higher information content. For comparison, lc-WGS sequencing at 2.0X coverage - if we know the true read label assignment - yields about 1.0X coverage for each haplotype, or about 63% of the entire genome for each haplotype under a simple Poisson model of sequencing coverage. While information from some array SNPs is lost, information from many more non-array SNPs is gained, explaining why QUILT can be more accurate than arrays at higher coverage, particularly for populations other than those for which genotyping microarrays are principally designed for. The above relies on accurate read label assignments, and in practice, while QUILT does make some errors, assignment is highly accurate, explaining why QUILT results are often near “Optimal” values, where direct phase information is provided. This explains why short-read (Illumina-based) sequencing outperforms long-read data (exemplified by ONT), because the slightly improved read label assignment possible from long reads is outweighed by the much lower per-base error rate of the short-read data. The haplotagging approach, meanwhile, improves accuracy versus unlinked Illumina reads as it retains the same per-base error rate, but decreases phasing errors, particularly at low coverages, where there still remains more phasing uncertainty. What do these results mean for GWAS today, and in the future? lc-WGS already provides similar accuracy to genotyping microarrays at only 0.1-0.25X coverage, and much higher accuracy at 1.0-2.0X. Combined with estimates of library and sequencing costs, in terms of power for a given expenditure, it is already preferable to use lc-WGS and QUILT as opposed to genotyping microarrays across a diverse range of scenarios. This is particularly true for discovering rare disease-associated variants, which recent work suggest harbour a disproportionate amount of heritability for human phenotypes[33,34]. Because QUILT, in contrast to other methods for lc-WGS, models reads directly, and can work with high per-base error rates, it is robust across sequencing technologies, ensuring accuracy will remain high regardless of future technological advances. Further, by modelling reads directly, QUILT should work well regardless of the diversity of the region of the genome, or in non-human species where heterozygosity levels are often higher, and can be an order of magnitude higher, than in most human populations[16]. The strong performance of QUILT for HLA imputation, across diverse sample backgrounds, offers a concrete confirmation of this prediction. QUILT has linear computational complexity in sample size and haplotype reference panel size, ensuring reasonable run times across a range of study sizes. In terms of future development of the model, as datasets with millions of lc-WGS become available, they will provide extensive information to improve imputation, and we intend to address this in future work. For example, we might iterate imputation, followed by incorporation of imputed sequenced samples into the reference panel. This might offer particular improvements, for example for newly discovered rare variants not represented in the reference panel. The QUILT model could also be adapted to novel settings. As one example, non-invasive prenatal testing (NIPT) using low coverage sequence is rapidly becoming the clinical standard of care[35], and can generate the genotypes, both fetal and maternal, needed for GWAS[36]. The model presented here could readily be adapted to this data type, to allow separate imputation of the three haplotypes present in NIPT: the maternal transmitted and untransmitted, and paternal transmitted sequences. This would enable recovery of the maternal, fetal and partial paternal genomes, allowing determination of e.g. carrier status for genetic risk variants, and polygenic risk scores. In conclusion, lc-WGS imputation using QUILT is flexible and accurate across a range of coverages and datatypes. It is particularly beneficial for imputing rare SNPs, and for imputing genotypes for human populations poorly represented by existing large haplotype reference panels, and should help empower a new wave of large, ethnically diverse GWAS.

Online methods

QUILT

In the QUILT model, imputation is performed through an iterative two step process. First, sequencing read labels, reflecting maternal or paternal origin, are updated using Gibbs sampling based on a subset of the full reference panel. Second, based on the read labels, the sequencing reads are split into two sets and separately imputed using the full reference panel, and this is used both to determine the best matching haplotypes for the next round of the Gibbs sampler, and in the terminal iteration, provides the imputed genotypes to be output. Here we offer more high-level detail on how this works, with full details offered in the Supplementary Note. The full details includes an explanation of a generative model under which reads would be simulated, detailed mathematics for both the Gibbs sampler and full haploid imputation, as well as a description of the phasing procedure, and an explanation of other parameter choices. Note that throughout the description of the model, the term “read” will refer to potentially discontinuous sequencing information from a single DNA fragment, either a random variable or observed, but known to come from the same underlying molecule: i.e. two reads from a read pair will be called a read, since they are observations from the same molecule. Let Gen be the genotype for some arbitrary SNP, and E[Gen|O] be the expected genotype given the observed sequencing data. Let λ be the parameters of the model, for example the recombination rate. We can use random sampling to estimate the diploid genotype dosage as We use Gibbs samplings to generate draws from H ~ P(H | O, λ). Let the vector h be our realized Gibbs sampled value at any point during the sampler. We begin by initializing h at random, i.e. for read indexed by v, h is drawn equally from {1,2} for all v (where arbitrarily 1 = maternal, 2 = paternal). Let o be the realized observations for random variable O (sequencing read bases and base qualities). We assume that the bases of a sequence read reflect the haplotype being copied from in the haplotype reference panel - specifically, who is being copied from at the central point of the read. Suppose we consider a Li and Stephens model and a hidden Markov model (HMM), where transition probabilities depend on the recombination rate in the usual way, and emission probabilities for a state (copied reference haplotype) at a site depend on the reference haplotype sequence and observed sequencing reads whose central location is that site (by the assumption above, the observations (reads) are independent between sites, ensuring this is a valid HMM). From this, we can calculate P(O=o | H = h, λ) using the forward backward algorithm. Now suppose we want to sample a new value at read label h for read v conditional on all other read labels. Let H be this random variable at this point in the Gibbs sampler, and let H be a random variable representing the remaining read labels. We therefore need to calculate for l in {1,2}, and sample h using this probability (where we note P(O = o, H = h| λ) = P(O=o | H = h, λ) P(H = h| λ) is easy to switch between as P(H = h| λ), the probability of the read labels, is 1/2 to the power of the number of reads). Naively calculating the above requires a new forward backward pass of the HMM. We are able to avoid a new forward-backward pass through efficient re-use of the forward and backward variables, and in fact re-sample all read labels under the Gibbs sampler using a single forward-backward pass. Details of this, and further details including a block Gibbs sampler, are given in the Supplementary Note. Now, while the above is computationally linear in K, the reference panel size, it can be slow for large K. Therefore, we run the above Gibbs sampler using a reduced set of haplotypes, default 400. Further, the above HMM is based on the assumption that each read has a central location, and genotypes in sequencing reads are based on the reference haplotype copied at that central location. This assumption is less accurate for long read sequencing, and moreover, is irrelevant once read labels are known, as once reads are known to come from the same haplotype, membership of a sequenced base in a read no longer matters. This in effect changes the nature of the observations at a given site from sequencing reads to sequenced bases and their base qualities, and allows us to work with haplotype genotype likelihoods. Therefore, given read labels, we split the reads into sets reflecting maternal and paternal origin, generate haplotype genotype likelihoods, and perform full haploid imputation using the entire reference panel. We then use posterior state probabilities to update the subset of the reference panel used by the Gibbs sampler, as well as optionally output the genotype dosages. Full details of this are given in the Supplementary Note.

Datasets

NA12878

Both the NA12878 sample here and the further samples from the 1000 Genomes centre came from the New York Genome Center (NYGC) resequencing effort. We generated haplotagged NA12878 Illumina short read data, and by either considering or ignoring BX tags, used this both for analyses involving haplotagging and those without. We used ~80X ONT data from Bowden et al.[25], and converted it to GRCh38 by first downsampling it using samtools (version 1.0)[37] to ~8X, converting to fastq using samtools, mapped using Minimap2[38] using --alt-drop 1.0 and used --alt with a list containing all non-autosomal, chrX, chrY or chrMT contigs, and sorted and indexed using samtools. Parental genomes were obtained from the Illumina Platinum Genomes project under ENA accession code PRJEB3381, and converted to GRCh38 per-chromosome for chromosomes 6, 20, 21, by first sorting reads by read name using samtools, then converting to fastq format using samtools, then read mapping using bwa mem (v0.7.15), then sorting, then results were combined using samtools cat, and finally re-sorted and indexed.

5-Family

We generated (9.4X to 16.4X coverage, mean 12.2X) haplotagged data both for imputation and for truth data for 7 offspring and 10 parents in 5 families from the Bloom Syndrome Repository (4 trios and one 2 parent plus 3 offspring family). Parental samples were similarly haplotagged (2.8-9.8X, mean 6.5X), though this information was not used for inference of “true” genotype or phasing. Samples were North American, and also include individuals of Ashkenazi Jewish origin. Written informed consent for all 5-Family individuals was obtained by the Institutional Review Board of Weill Cornell Medical College.

GBR

We generated low coverage (mean 0.30X, range 0.13X-0.47X) haplotagged data for the 91 GBR 10000 Genomes project samples, as described in the Haplotagging section of the Supplementary Note. We analysed the 59 samples with assessed depth greater than or equal to 0.25X. We used “truth” data for these samples from the 1000 Genomes project resequencing data from the New York Genome Centre[24].

Shafin et al. Oxford Nanopore Technologies (ONT)

We downloaded high coverage ONT and 10X Illumina data from Shafin et al. for 7 samples[27] HG01243, HG02080, HG03098, HG01109, HG02055, HG02723, HG03492 from Amazon’s pangenomics resource (https://s3-us-west-2.amazonaws.com/human-pangenomics/). We downloaded high coverage Illumina 1000 Genomes data for their parents from the NYGC. For the ONT data, we processed the first 20 million reads from the fastq file using Minimap2 (v2.1), in the same way as the NA12878 sample. For 10X Illumina data, we used the proc10xG toolkit (https://github.com/ucdavis-bioinformatics/proc10xG, commit 7afbfcf), which successively runs process_10xReads.py, bwa mem, samConcat2Tag.py and samtools view -b to generate a mapped bam file.

1000 Genomes data

We used high coverage 1000 Genomes project resequencing data from the New York Genome Centre[24]. We chose one population (ASW, CEU, CHB, PJL, PUR) from each of the 5 continental superpopulations within 1000G (AFR, EUR, EAS, SAS, AMR). To minimize computation time for full imputation we chose every 5th member of this subset to impute, resulting in N=11 ASW, N=23 CEU, N=19 CHB, N=20 PJL, N=20 PUR samples. Due to a lack of consistent parent or offspring high coverage genome availability, we did not phase the data. For imputation of HLA loci we used all members of each population, and we generated and applied a modified reference panel, as described in the HLA imputation methods section.

Mapping and downsampling

All mapping was done against the human reference genome GRCh38, and when performed using bwa mem used (at least) options -Y -K 100000000. Average depth was calculated using samtools depth -a -q 10 -Q 10 on chromosome 20 from 1 to 10 Mbp for each BAM. Reads were grouped into molecules, either using paired read information, or using the BX tag from haplotagging. Downsampling was performed per-molecule (i.e. all reads from both read pairs and given linkage information from the BX tag were taken as being from the same molecule) using a Bernoulli probability determined by the desired coverage. For the 5-Family samples, we used results from an earlier version of the chemistry for haplotagging, as a later re-run achieved less than 2X coverage for each sample. We then considered the proportions of molecules that had no BX tag, had 2 or fewer reads in a BX tag, or had 3 of more reads in a BX tag. When downsampling using data from the old chemistry, we also downsampled to ensure these proportions matched results from the new chemistry. Subsampled BAMs were written either using the original read names (QNAME field of the BAM) or using new reads names that incorporated the BX molecule information.

Reference panel

We used a controlled access version of the HRC[7] with Genbank accession number EGAD00001002729. We used liftOver to convert the reference panel from the GRCh37 to the GRCh38 reference genome using GATK Picard LiftoverVCF (v2.22.2)[39]. From the original GRCh37 HRC reference panel (39,131,578 autosomal variants), we removed 6,803 variants due to failure to map between the GRCh37 and GRCh38 reference genomes and a further 9,010 variants due to mismatching chromosome between the two reference genomes. The resulting autosomal GRCh38 HRC reference panel contain 39,115,765 variants and 27,165 samples with 54,330 haplotypes. For imputation, we created three versions of the reference panel, based on three different subsets of the full panel. We first made a subset where we removed only NA12878, and this was used for the analyses involving NA12878 and the 5-Family individuals. Secondly, we made a version where we removed NA12878, the parents on the ONT samples from Shafin et al. used in this study, and also removed those samples from the 5 populations of 1000 Genomes we used for testing imputation (ASW, CEU, CHB, PJL, PUR), and used this on the ONT and 1000 Genomes analyses. Thirdly, we made a version where we removed NA12878 and the entire GBR population, for the analyses involving the GBR dataset.

Imputation

We used QUILT v0.1.3[40], GLIMPSE[20] version 1.0.0 (github commit 00a0c55a2c23c8f47c832647a5a5e6806bc69802), and Beagle 5.1[8] (beagle.18May20.d20.jar), using default parameters across all runs. All imputation was done in 2 Mbp chunks, with 500kbp flanking buffers for all methods, on three regions: chr6:1-172000000bp, chr21:1-26000000bp, and chr21:14000001-46000000bp. QUILT utilises a pre-constructed compressed internal data format derived from the relevant .hap and .legend format files. For QUILT and GLIMPSE, we used a CEU based recombination map (CEU_omni_recombination_20130507.tar), and liftOver to generate a GRCh38 recombination rate map, across all runs. For Beagle, we used a plink format recombination rate map available from the Beagle web site (plink..fixed.GRCh38.map). For GLIMPSE, we used GATK 3.8-1-0-gf15c1c3ef UnifiedGenotyper --genotyping_mode GENOTYPE_GIVEN_ALLELES --output_mode EMIT_ALL_SITES, to generate genotype likelihoods at sites to impute, which was done using GLIMPSE_phase. For both QUILT and GATK as input for GLIMPSE, we used a minimum base quality and mapping quality of 10 (using -minBQ in GATK). For input to Beagle representing arrays, we intersected the high coverage truth genotypes called using GATK UnifiedGenotyper with sites present on the array, to generate an input VCF. To determine sites, for the UK Biobank Affymetrix array, we used an annotation file (Axiom_tx_v1.na35.annot.csv), removing a small number of array sites that did not map to the autosomes or sex chromosomes, were not unique, or were multi-allelic, and used liftOver to get results in GRCh38, yielding 761,888 SNPs. For the Illumina Global Screening Array (GSA), we used strand files from https://www.well.ox.ac.uk/~wrayner/strand/, specifically GSA-24v3-0_A2-b38.strand.

Assessing imputation and phasing accuracy

We called “truth” genotypes using high coverage Illumina short read bams at HRC bi-allelic SNPs using the GATK UnifiedGenotyper software and option “—alleles” at the bi-allelic SNPs, setting genotyping_mode GENOTYPE_GIVEN_ALLELES and output_mode EMIT_ALL_SITES[39]. We used truth genotypes from high coverage sequencing only at sites where depth was at least 6. Phasing of high coverage trios was done first assuming Mendelian inheritance, excluding triple-heterozygous sites, using bespoke R code, then the excluded sites phased using this scaffold using shapeit4[11] (version 4.0 conda installation) using the HRC reference panel with only NA12878 removed. To assess genotype imputation accuracy, we used imputed (test) dosages and high coverage (truth) genotypes. We assessed results either per-sample or across samples for SNPs in a given frequency range by constructing vectors of test dosages and truth genotypes and taking their squared Pearson correlation (in R using “cor”) at pairwise complete sites (i.e. ignoring truth sites with depth less than 6). SNP frequencies were taken from the gnomAD[26] version 3.0 release, while for the very small number of SNPs in HRC not in gnomAD we used their HRC allele frequencies. We downloaded gnomAD data from https://gnomad.broadinstitute.org/downloads using links exemplified like the following for chromosome 20: https://storage.googleapis.com/gnomad-public/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.chr20.vcf.bgz To assess phasing accuracy we used switch error rates as follows. First consider we have both “truth” integer haplotype data and test imputed haplotype dosages (i.e. two real numbers within the range of 0 to 1), at sites where the truth genotypes are heterozygous. Define as discordant any test sites that are also not heterogyzous (sum of haplotype dosages does not round to 1). On the remaining sites, define a phase switch error as when either the truth haplotypes record a change in which haplotype carries the alternate allele between adjacent heterozygous sites when the test haplotypes do not, or vice versa. We removed from consideration sites that were flipped, i.e. yielding consecutive phase switch errors. The phase switch error rate is the number of phase switch errors divided by the total number of pairs of consecutive heterozygous sites examined, and can be combined across discrete imputed windows.

HLA imputation

We used HLA reference data built as follows. In brief, we downloaded full-length HLA alignments for annotated HLA genes and pseudogenes from the HLA database IPD-IMGT/HLA[31] (version 3.39). This provides a set of (aligned) sequenced alleles for each region, to which reads can be mapped. Separately, for each HLA region, we subsetted reference haplotypes from the HRC to samples from the 1000 Genomes Project[24]. We then obtained unphased HLA types for those samples (version 20181129), previously inferred using high-coverage exome sequence data and using the PolyPheMe software[41], and previously shown to have high accuracy. We phased HLA types onto haplotypes using a bespoke approach, and then excluded all members of the test populations, and those individuals with unphased alleles, from the labelled reference panel. For full details, see the Supplementary Note. For HLA imputation, we used QUILT-HLA v0.1.6 for imputation from lc-WGS. For array based imputation, we were unable to run either SNP2HLA or the SNP2HLA module of HLA-TAPAS without error, so we used a custom HLA imputation approach implementing the algorithm used by SNP2HLA[28,32]. In brief, we first generated genotypes simulating an array, using high coverage WGS data at array sites, then used the HRC to impute additional sites from those genotypes using Beagle 4.1[8], and then finally used the labelled haplotype reference panel and the imputed genotypes to impute HLA types using Beagle 4.1. For details, see the Supplementary Note.

Cost effectiveness

We assessed the relative cost benefits of lc-WGS and genotyping microarrays using imputed results for rare SNPs (0.1-0.2%) and common SNPs (20-50%) for the CHB 1000 Genomes imputation results. We assumed a fixed array cost of 30 GBP per array, library cost of 1.36 GBP[16], and per-X sequencing costs from 1000, 500 and 250 USD / sample, i.e. per-X costs of 1000/30, 500/30 and 250/30, converting into GBP using an exchange rate of 0.79375. For the GWAS type analysis, for a fixed budget, we calculate the number of samples available for imputation as budget / (pheno_cost + librar_cost + per_x_cost x coverage), while for genotyping microarrays, the number of samples is budget / (pheno_cost + cost_array). We then take the effective sample size as the sample size multiplied by the imputation r2, and take the ratio of these to be the relative increase in effective sample size from using lc-WGS. For the burden style analysis, we assume a gene with 10 causal SNPs each possessing a frequency of 0.1% in cases and 0.01% in controls, and use the same imputation r2, i.e. the imputation r2 for SNPS at frequency 0.01% in the population. For simplicity, we may approximate this as 1 causal SNP with a frequency of f 1=1% in cases and f 2=0.1% in controls. We then suppose error is introduced governed by some parameter m, such that in cases, where X is the true genotype, and Y is the observed genotype (with imputation error), that P(X = 0, Y = 0) = 1 - f1 - m f1, P(X = 0, Y = 1) = m f1, P(X = 1, Y = 0) = m f1, P(X = 1, Y = 1) = f1 (1 - m), and for controls, that P(X = 0, Y = 0) = 1 - f2 - m f1, P(X = 0, Y = 1) = m f1, P(X = 1, Y = 0) = m f2, P(X = 1, Y = 1) = f2 (1 - m). We can then calculate m given r2. We then estimated power using 10,000 simulations using Fisher’s Exact test, given an alpha of 0.05 / 20000 (approximately Bonferroni-correcting for the number of genes in the genome), and given equal case and control numbers governed by the same N as for the GWAS analysis, where the budget was 10000 x (30 + pheno_cost), i.e. the default budget was the array cost plus phenotyping cost.

34 in total

1. Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data.

Authors: Na Li; Matthew Stephens
Journal: Genetics Date: 2003-12 Impact factor: 4.562

Review 2. The personal and clinical utility of polygenic risk scores.

Authors: Ali Torkamani; Nathan E Wineinger; Eric J Topol
Journal: Nat Rev Genet Date: 2018-09 Impact factor: 53.242

3. A One-Penny Imputed Genome from Next-Generation Reference Panels.

Authors: Brian L Browning; Ying Zhou; Sharon R Browning
Journal: Am J Hum Genet Date: 2018-08-09 Impact factor: 11.025

Review 4. Dissecting the genetics of complex traits using summary association statistics.

Authors: Bogdan Pasaniuc; Alkes L Price
Journal: Nat Rev Genet Date: 2016-11-14 Impact factor: 53.242

5. Fast and accurate long-range phasing in a UK Biobank cohort.

Authors: Po-Ru Loh; Pier Francesco Palamara; Alkes L Price
Journal: Nat Genet Date: 2016-06-06 Impact factor: 38.330

6. Haplotype estimation for biobank-scale data sets.

Authors: Jared O'Connell; Kevin Sharp; Nick Shrine; Louise Wain; Ian Hall; Martin Tobin; Jean-Francois Zagury; Olivier Delaneau; Jonathan Marchini
Journal: Nat Genet Date: 2016-06-06 Impact factor: 38.330

7. A reference panel of 64,976 haplotypes for genotype imputation.

Authors: Shane McCarthy; Sayantan Das; Warren Kretzschmar; Olivier Delaneau; Andrew R Wood; Alexander Teumer; Hyun Min Kang; Christian Fuchsberger; Petr Danecek; Kevin Sharp; Yang Luo; Carlo Sidore; Alan Kwong; Nicholas Timpson; Seppo Koskinen; Scott Vrieze; Laura J Scott; He Zhang; Anubha Mahajan; Jan Veldink; Ulrike Peters; Carlos Pato; Cornelia M van Duijn; Christopher E Gillies; Ilaria Gandin; Massimo Mezzavilla; Arthur Gilly; Massimiliano Cocca; Michela Traglia; Andrea Angius; Jeffrey C Barrett; Dorrett Boomsma; Kari Branham; Gerome Breen; Chad M Brummett; Fabio Busonero; Harry Campbell; Andrew Chan; Sai Chen; Emily Chew; Francis S Collins; Laura J Corbin; George Davey Smith; George Dedoussis; Marcus Dorr; Aliki-Eleni Farmaki; Luigi Ferrucci; Lukas Forer; Ross M Fraser; Stacey Gabriel; Shawn Levy; Leif Groop; Tabitha Harrison; Andrew Hattersley; Oddgeir L Holmen; Kristian Hveem; Matthias Kretzler; James C Lee; Matt McGue; Thomas Meitinger; David Melzer; Josine L Min; Karen L Mohlke; John B Vincent; Matthias Nauck; Deborah Nickerson; Aarno Palotie; Michele Pato; Nicola Pirastu; Melvin McInnis; J Brent Richards; Cinzia Sala; Veikko Salomaa; David Schlessinger; Sebastian Schoenherr; P Eline Slagboom; Kerrin Small; Timothy Spector; Dwight Stambolian; Marcus Tuke; Jaakko Tuomilehto; Leonard H Van den Berg; Wouter Van Rheenen; Uwe Volker; Cisca Wijmenga; Daniela Toniolo; Eleftheria Zeggini; Paolo Gasparini; Matthew G Sampson; James F Wilson; Timothy Frayling; Paul I W de Bakker; Morris A Swertz; Steven McCarroll; Charles Kooperberg; Annelot Dekker; David Altshuler; Cristen Willer; William Iacono; Samuli Ripatti; Nicole Soranzo; Klaudia Walter; Anand Swaroop; Francesco Cucca; Carl A Anderson; Richard M Myers; Michael Boehnke; Mark I McCarthy; Richard Durbin
Journal: Nat Genet Date: 2016-08-22 Impact factor: 38.330

8. The UK Biobank resource with deep phenotyping and genomic data.

Authors: Clare Bycroft; Colin Freeman; Desislava Petkova; Gavin Band; Lloyd T Elliott; Kevin Sharp; Allan Motyer; Damjan Vukcevic; Olivier Delaneau; Jared O'Connell; Adrian Cortes; Samantha Welsh; Alan Young; Mark Effingham; Gil McVean; Stephen Leslie; Naomi Allen; Peter Donnelly; Jonathan Marchini
Journal: Nature Date: 2018-10-10 Impact factor: 49.962

9. Accurate, scalable and integrative haplotype estimation.

Authors: Olivier Delaneau; Jean-François Zagury; Matthew R Robinson; Jonathan L Marchini; Emmanouil T Dermitzakis
Journal: Nat Commun Date: 2019-11-28 Impact factor: 14.919

10. Power and predictive accuracy of polygenic risk scores.

Authors: Frank Dudbridge
Journal: PLoS Genet Date: 2013-03-21 Impact factor: 5.917

6 in total

1. Haplotype-aware inference of human chromosome abnormalities.

Authors: Daniel Ariad; Stephanie M Yan; Andrea R Victor; Frank L Barnes; Christo G Zouves; Manuel Viotti; Rajiv C McCoy
Journal: Proc Natl Acad Sci U S A Date: 2021-11-16 Impact factor: 11.205

Review 2. Functional genomics data: privacy risk assessment and technological mitigation.

Authors: Gamze Gürsoy; Tianxiao Li; Susanna Liu; Eric Ni; Charlotte M Brannon; Mark B Gerstein
Journal: Nat Rev Genet Date: 2021-11-10 Impact factor: 53.242

3. A population-specific reference panel for improved genotype imputation in African Americans.

Authors: Jared O'Connell; Taedong Yun; Meghan Moreno; Helen Li; Nadia Litterman; Alexey Kolesnikov; Elizabeth Noblin; Pi-Chuan Chang; Anjali Shastri; Elizabeth H Dorfman; Suyash Shringarpure; Adam Auton; Andrew Carroll; Cory Y McLean
Journal: Commun Biol Date: 2021-11-05

4. Genomic prediction using low-coverage portable Nanopore sequencing.

Authors: Harrison J Lamb; Ben J Hayes; Imtiaz A S Randhawa; Loan T Nguyen; Elizabeth M Ross
Journal: PLoS One Date: 2021-12-15 Impact factor: 3.240

5. Constructing germline research cohorts from the discarded reads of clinical tumor sequences.

Authors: Alexander Gusev; Noah Zaitlen; Stefan Groha; Kodi Taraszka; Yevgeniy R Semenov
Journal: Genome Med Date: 2021-11-08 Impact factor: 11.117

6. Marker density and statistical model designs to increase accuracy of genomic selection for wool traits in Angora rabbits.

Authors: Chao Ning; Kerui Xie; Juanjuan Huang; Yan Di; Yanyan Wang; Aiguo Yang; Jiaqing Hu; Qin Zhang; Dan Wang; Xinzhong Fan
Journal: Front Genet Date: 2022-09-02 Impact factor: 4.772

6 in total