Literature DB >> 34174829

A novel nonlinear dimension reduction approach to infer population structure for low-coverage sequencing data.

Miao Zhang¹, Yiwen Liu², Hua Zhou³, Joseph Watkins^2,1, Jin Zhou^4,5,6.

Abstract

BACKGROUND: Low-depth sequencing allows researchers to increase sample size at the expense of lower accuracy. To incorporate uncertainties while maintaining statistical power, we introduce MCPCA_PopGen to analyze population structure of low-depth sequencing data.
RESULTS: The method optimizes the choice of nonlinear transformations of dosages to maximize the Ky Fan norm of the covariance matrix. The transformation incorporates the uncertainty in calling between heterozygotes and the common homozygotes for loci having a rare allele and is more linear when both variants are common.
CONCLUSIONS: We apply MCPCA_PopGen to samples from two indigenous Siberian populations and reveal hidden population structure accurately using only a single chromosome. The MCPCA_PopGen package is available on https://github.com/yiwenstat/MCPCA_PopGen .

Entities: CellLine Chemical Disease Gene Species

Keywords: Data-adaptive; Dimension reduction; Low-coverage; Non-linear kernel; Population structure

Mesh：

Year: 2021 PMID： 34174829 PMCID： PMC8236193 DOI： 10.1186/s12859-021-04265-7

Source DB: PubMed Journal: BMC Bioinformatics ISSN： 1471-2105 Impact factor: 3.169

Background

High-throughput sequencing technologies are capable of generating billions of short sequence reads on scale [6]. Different sequencing designs and platforms provide options balancing accuracy and cost. High-depth whole-genome sequencing identifies nearly all variants along the genome with high confidence but at high cost [3, 4, 14]. As a cost-effective alternative, low to medium depth next-generation sequencing (NGS) has lower accuracy, especially for rare-variant identification and genotype calling, but at much lower cost [2, 23, 30, 40, 42]. Low coverage sequencing technology (x) has shown to be valuable in a variety of population genetic issues, e.g, in population structure [37], in conservation biology [12], in ancient DNA [1], and in single-cell RNA sequencing [15]. In humans, ultra low-sequencing technology has been widely adopted for non-invasive prenatal tests of the maternal plasma [24]. Compared with high-coverage sequencing data, genotypes from low-coverage sequencing data are noisier and thus bring higher levels of uncertainty [29]. Downstream analyses based on the raw sequencing data incorporating uncertainties are advantageous and comparable to high-depth NGS [14, 22]. Therefore, researchers can afford to sequence more samples at comparable cost with minimal sacrifice in statistical power. One fundamental dimension reduction technique for NGS data is principal component analysis (PCA) [19]. This analysis determines the principal components (PCs), i.e., the linear projection of the original variables onto a low dimensional vector space that maximally explains the variance of the data. Among its many applications, PCA is a widely adopted tool in genetic studies to infer population structure [26, 27, 32, 33, 44]. However, PCA is not designed to reveal the nonlinear relationship that may arise, for example, from the uncertainties in low-depth genomic data. Several methods, including IsoMap [41], locally linear embedding (LLE) [36], and Kernel PCA (KPCA) [39] have been developed to capture nonlinear patterns. KPCA enables us to construct nonlinear versions of the PCA algorithm and has been successfully applied to gene expression data for the classification of samples [25, 35]. However, KPCA suffers from two major limitations: 1) the kernel must be pre-specified; 2) the corresponding transformation is identical at each locus. However, the form of transformation may depend upon the alleles’ characteristics, e.g., rare or common alleles (see Additional file 1: Fig. S1). To optimize the usage of ultra low-coverage sequencing datasets, we propose an extension of a data-adaptive approach, Maximally Correlated Principal Component Analysis (MCPCA) [11], which naturally addresses the first two limitations. To address the third, our method uses genotype likelihoods rather than any single genotype. Taking into account the uncertainty of raw sequencing reads provides an opportunity to model the nonlinear patterns in population genetics data. In particular, we employ a continuous value, i.e, dosage (see Fig. 1), to summarize the uncertainty in genotype calling. MCPCA is designed to determine a transformed dosage value, , at each locus j to maximize the sum of a pre-specified number of eigenvalues of the transformed dosage covariance matrix (the Ky Fan norm [10]). We name our method MCPCA_PopGen, aiming to analyze the population structure for low-coverage sequencing data. It applies MCPCA to genotype dosages and finds the optimal transformations to explain a maximum proportion of the variances in the data. Our simulation reveals two major properties of MCPCA_PopGen for analyzing low-coverage sequencing data. For a locus with a low minor allele frequency (MAF), the transformation emphasizes the uncertainty in calling between heterozygous and the major homozygous loci. On the other hand, the transformation is more linear when variants are common (see Additional file 1: Fig. S1). We performed extensive simulations and demonstrated the benefit of MCPCA over standard PCA and KPCA for low-coverage data. We applied MCPCA to two indigenous Siberian populations. The optimized MCPCA explains a much higher percentage of the variance and more clearly distinguishes these two populations even when limited to the genetic information from a single chromosome.

Fig. 1

The histograms of true genotypes across all 19,530 SNPs and the histograms of genotype dosage values for coverage depth 10, 5, and 1 when MAFs are low (), medium (), and high (). Genotypes dosage are the posterior mean of the genotype under additive coding. With values 0, 1 and 2 assigned to the genotypes (major, major), (major, minor) and (minor, minor), respectively, the genotype dosage, + 2, where and denote the conditional (“posterior”) probabilities for the genotypes (major, minor) and (minor, minor)

Results

Simulation studies

Variance explained by MCPCA_PopGen

We evaluate the MCPCA method (MCPCA_PopGen) using three types of genotype callings, (1) true genotypes, (2) observed genotypes with errors, and (3) genotype dosage. Genotypes were simulated using ms package [17] from three populations (African, Caucasian, and Asian) (ms commands to simulated genotypes were included in the Additional file 1: Sect. 3). They took value from , representing the minor allele counts carried by each individual at each locus. Observed genotypes were generated by perturbing the known genotype under specified coverage depths as developed in [8]. Genotype dosage is the posterior mean of the genotype calls under additive coding (Fig. 1) [43]. Details of the simulation procedures are provided in the “Methods” section. As illustrated in Table 1, observed genotypes with coverage depth below have high error rates in these simulated datasets. When the coverage depth is low, the “best-guess” genotypes frequently differ from the true genotypes. In our simulation studies, we evaluate the total variance explained by the top q MCPCs. We compare the computational efficiency across different q and different number of Single nucleotide polymorphisms (SNPs) used to generate PCs. Finally we compare the performance of MCPCA_PopGen with PCA and KPCA.

Table 1

The percentage of error calling and the average Phred quality scores for observed genotypes across all 19530 SNPs in simulated datasets

Coverage depth	Percentage of error calling (%)	Mean quality score (SD)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1\times$$\end{document}1×	70.49	3.37 (1.34)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5\times$$\end{document}5×	12.59	15.28 (7.45)
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$10\times$$\end{document}10×	3.19	29.53 (13.27)

The percentage of error calling and the average Phred quality scores for observed genotypes across all 19530 SNPs in simulated datasets Determine the optimal number of MCPCs Choosing the number of maximally correlated principal components q is essential. A small q may result in loss of information. The computational time increases if a large value of q is selected. To provide a practical guide in choosing the number of MCPCs, we demonstrate in Fig. 2 how much more of the variances is explained with increasing values of q. The MCPCA algorithm is applied to the true genotype data (MCPCA-TG), dosage data (MCPCA-DS), and discretized dosage data given and respectively, i.e., MCPCA-Intv, MCPCA-Freq, and MCPCA-Jenks represent MCPCA algorithm applied to discretized dosage data with different binning methods : equal width, equal frequency, and Jenks binning). Please refer to “Methods” section for details.

Fig. 2

(top) The proportions of the variances explained by the first 6 and 10 PCs. The data consists of 150 samples, each having 1458 SNPs. (bottom) The CPU time of implementing MCPCA_PopGen given different choice of q when the number of SNPs ranges from 1000 to 16,000. MCPCA-TG: MCPCA applied to the true genotype data; MCPCA-DS: MCPCA applied to the genotype dosage data; MCPCA-Intv, MCPCA-Freq, and MCPCA-Jenks: MCPCA using the equal width, equal frequency, and Jenks binning methods We used the proportion of variance explained by the true genotype (MCPCA-TG) as a baseline. As showed in Fig. 2 (top panel), MCPCA-DS explains a much larger proportion of variances than MCPCA-TG, indicating overfitting due to the over-determined number of categories. By implementing MCPCA on an optimally discretized dosage values (MCPCA-Intv, MCPCA-Freq, and MCPCA-Jenks), we avoid overfitting. Note that all three discretization methods achieve comparable proportions of explained variances to that of MCPCA-TG. We also illustrate how the CPU time for implementing the proposed algorithm changes as we vary q and the number of SNPs p in the data. Given that q ranges from 2 to 15, the CPU time has a polynomial growth as p increases. The computational complexity of MCPCA algorithm for each iteration is [11]. For , the algorithm is nearly linear with n, which makes this approach suitable for data sets with a large number of individuals (e.g., biobank scale studies). When the number of SNPs p substantially exceeds the sample size n or when they are in the same scale, the MCPCA_PopGen algorithm runs in cubic time . To balance the interpretability, effectiveness, and efficiency of our algorithm, we suggest a choice of q at most 20 when p is large, and a pruning procedure for choosing SNPs for analysis should also be adopted [5]. Our analysis were performed using 11 cores and 6 GB memory computing resources. Performance comparisons The performance of MCPCA_PopGen was compared with that of PCA with respect to the proportion of variances explained by the first q PCs. The results were summarized over 100 simulation replicates. In all scenarios, we set . Figure 3 displays the barplot of variances explained by the top 10 PCs over 100 simulation runs. In all scenarios, MCPCA or PCA on dosage data show better performance than that on the observed genotypes (PCA-OG and MCPCA-OG), indicating that dosage values preserve more information by taking into account the uncertainty in genotype calling. MCPCA outperforms PCA under different discretization methods in all scenarios, especially when the coverage depth is low (Fig. 3, left panel). As illustrated in Additional file 1: Fig. S1, MCPCA finds nonlinear transformations of dosage values with low MAF, emphasizing the uncertainty in calling between heterozygous and the major homozygous loci. Among the three discretization methods, MCPCA using the Jenks discretization has the highest explained variance. We have also applied KPCA to the observed genotypes (KPCA-OG) and dosage genotypes data (KPCA-DS). Instead of Gaussian kernel, the polynomial kernel was adopted since KPCA had better performances with a polynomial kernel in our simulation studies. In all scenarios, KPCA did not perform well when coverages were x. When coverage was low (i.e., 1x), it has a similar performance as PCA. These results suggest that an adaptive transformation according to data coverage depth is needed rather than the “one-size-fits-all” approach.

Fig. 3

The average of variances explained by the top 10 PCs over 100 simulation replicates. In these scenarios, KPCA-TG, PCA-TG, and MCPCA-TG explained 0.2893, 0.3768, and 0.3982 of the totoal variances, respectively. Error bars represent one standard deviation away from the mean

Prediction accuracy of MCPCA

In this section, we illustrate the performance of the MCPCA method in predicting sample identities by utilizing nonlinear patterns among predictors. The true model is demonstrated in Fig. 4a. Two groups of samples were simulated in a way such that a nonlinear curve of and may give a clear separation of the two groups (Fig. 4a). We further generated predictors from a standard normal distribution, where and . The sample sizes for group 1 and 2 were set to be 200 and 100, respectively. We applied MCPCA, PCA, and KPCA to the simulated data and projected the samples into the two-dimensional spaces formed by their embeddings. MCPCA distinguished the two groups more clearly (Fig. 4b and c). To evaluate the prediction accuracy, we trained random forests to predict sample identities using the two-dimensional embeddings generated by MCPCA, PCA, and KPCA. When implementing MCPCA, three discretization methods (MCPCA-Freq, MCPCA-Intv, and MCPCA-Jenks) were used (see “Methods” section). The within-group and overall accuracy of the predictions were measured through out-of-bag (OOB) prediction errors over 100 simulation replicates. In all scenarios, MCPCA with different discretization methods achieved higher accuracies than PCA and KPCA and were robust in both groups, even when p was much larger than the sample size (Fig. 4d). To summarize, the MCPCA method enables the discovery of nonlinear transformations of predictors, whose linear combinations provide a better prediction accuracy.

Fig. 4

a Illustrates that two groups of samples are generated such that a nonlinear curve of and separates the two groups. b and c Present the visualization of the two groups using the first two components from MCPCA and PCA, respectively. d Shows the barplots of prediction accuracy of MCPCA, PCA, and KPCA under various scenarios. Error bars represent one standard deviation away from the mean

Application to Siberian population

Based on a low-coverage whole exome sequencing data, [16] reported the evidence for cold adaptation in two indigenous Siberian populations, the Nganasan (nomadic hunters, NGA, , coverage) from the Taymyr Peninsula in the Arctic Ocean, and the Yakut (herders, YAK, , coverage) of North-Central Siberia (More detail of the data is provided in [16]). This low-coverage data set provides an excellent opportunity to test the ability of MCPCA_PopGen to classify the two groups. Utilizing genotype posterior probabilities extracted from Binary Sequence Alignment/Map format (BAM) files by the software ANGSD [22], we calculated the dosage values. For comparison, we also applied ngsPopGen [13] and PCA (PCA-DS) to these data. Like MCPCA_PopGen, the approach in ngsPopGen approximates the covariance matrix among individuals using posterior probabilities of sample allele frequencies, thus accounts for the uncertainty of low quality and/or coverage sequencing data. While for PCA-DS method, instead of using posterior probabilities, we calculated the covariance matrix using genotype dosage. As the posterior mean of the genotype, dosage also summarizes the uncertainty in genotype calling. Eigen-decomposition of the two resulting covariance matrices then enables us to perform PCA. We illustrated the performance of MCPCA_PopGen using Figs. 5 and 6. For Fig. 5, we set and applied MCPCA_PopGen, ngsPopGen, and PCA-DS to the data obtained from chromosomes 20, 21, and 22. First note that MCPCA_PopGen more clearly separates the two populations. In addition, the first two principal components of MCPCA_PopGen explain at least 13% of the variance, whereas ngsPopGen and PCA-DS explain around 8% - 10%. In preparation for Fig. 6, we called posterior probabilities of the genotype likelihood across all 22 human chromosomes. After filtering, this provides a total of 51, 673 SNPs for analysis. We display the top 6 PCs from MCPCA_PopGen, ngsPopGen, and PCA-DS. The MCPCA plots are consistent with reported histories of these two groups. As shown in [34], the Yakuts are more admixed (with Mongolian populations) than the Nganasan. The top plot seems to show two somewhat distinct Yakuts populations. The data were taken from two villages which do not match the clustering in the MCPCA plot [16]. However, analysis of ancient DNA [21] reveals evidence of Yakuts parent-child relationships in graves 70 km apart, indicative of a mobile population. As noted in [28], PCA may not be able to distinguish between migration and a population split. Both [20, 34] found evidence of severe bottlenecks in the Nganasan. This is displayed in the plot showing that except for one individual, the MCPCA plots for the Nganasan in both the PC3/PC4 and PC5/PC6 plots are very tightly clustered.

Fig. 5

(left) MCPCA plot for chromosome 20–22; (middle) PCA plot (ngsPopGen) for chromosome 20–22; (right) PCA plot (PCA-DS) for chromosome 20–22 for the Nganasan (NGA) and Yakuts (YAK) samples

Fig. 6

(left) MCPCA plot for top 6 PCs; (middle) PCA plot (ngsPopGen) for top 6 PCs; (right) PCA plot (PCA-DS) for top 6 PCs. 51,673 SNPs across Chromosome 1–22 were used

(left) MCPCA plot for chromosome 20–22; (middle) PCA plot (ngsPopGen) for chromosome 20–22; (right) PCA plot (PCA-DS) for chromosome 20–22 for the Nganasan (NGA) and Yakuts (YAK) samples (left) MCPCA plot for top 6 PCs; (middle) PCA plot (ngsPopGen) for top 6 PCs; (right) PCA plot (PCA-DS) for top 6 PCs. 51,673 SNPs across Chromosome 1–22 were used

Discussion

In genetic studies, PCA is a widely adopted dimension reduction tool to infer population structure and to adjust for population stratification. Unlike high-density SNP arrays, new sequencing technologies allow us to model the genotype uncertainty of raw sequencing reads rather than make a hard decision of any single genotype and to provide options balancing between accuracy and cost. New approaches are needed in order to make effective use of this type of data better. In this article, we introduce a dimension reduction approach for low-coverage sequencing data. To account for the genotype uncertainty, we propose the use of dosage values instead of the discrete genotypes. By considering both the genotype uncertainty and nonlinear correlations, our method transforms each SNP sequentially by maximizing the sum of top q eigenvalues of the transformed covariance matrix. The advantage of our method is that the data are used to optimize the transformation for each SNP, an approach that is not permitted in KPCA. For our simulations, we learned that the transformation is more nonlinear, emphasizing the difference between heterozygous and the major homozygous genotypes, for the SNPs with low MAF and more linear for common variants. To balance among computational feasibility, issues with overfitting, and statistical power, we analyzed three candidate methods to discretize dosage values. In simulation studies, we demonstrate that our method achieves higher fractions of the variance explained by meta-features when compared to PCA and KPCA. In the Siberian data analysis, our method more clearly distinguishes the two populations even when limited to the genetic information from one chromosome. Our method is particularly effective in increasing the power for low-coverage sequencing data, thus offering an option for researchers with a limited budget to study in medical and population genetics as well as assessing population structure for threatened or endangered species. With the advantage in low-coverage data, we believe MCPCA offers an attractive approach to the study of non-model organisms [7], which are often associated with the absence of closely related reference genomes and challenging sample material issues. The limitations of our method include, (1) MCPCA is likely to be computationally intensive if the number of SNPs used are large or the number of PCs output are large; (2) Although, discretization of the dosage values is deem necessary for MCPCA method, it might lead to loss of information. For these limitations, we defer to the future researches.

Conclusions

In this paper, we introduce a dimension reduction tool MCPCA_PopGen to analyze population structure of low-depth sequencing data.

Methods

Find optimal MCPCs

Let be a matrix and its (i, j)th element be the discretized dosage value for the ith individual at the jth SNP. Let represent a vector of dosage values of jth SNP across n individuals, and define the nonlinear transformations as . Thus are the vectors of transformed dosage values. We restrict ourselves to standardized transformations and consider the collection of covariance matrices,For a given value of q, [11] proposed the choice , , to maximize the sum of the top q eigenvalues, i.e., achieves the Ky Fan q-normwhere is the rth largest eigenvalue of . MCPCA thus can be considered as a generalization of PCA over all possible nonlinear transformations of predictors. The q optimal maximally correlated principle components (MCPCs) achieve the Ky Fan q-norm. Because PCA is based on computing eigenvalues for the special choice of where each component is a linear function, the sum of the top q eigenvalues for PCA is upper bounded by the Ky Fan q-norm. To solve this optimization problem, we adopted the block coordinate descent algorithm [11]. Implementation of the algorithm to genetic data requires, as with PCA, replacing the expectations in (1) with sample means.

Discretize dosage values

Discretization of the dosage values is necessary to create a computationally feasible algorithm. We have previously evaluated several discretization protocols. The equal width, equal frequency, and Jenks binning methods are considered [18], with the number of bins, m, determined by the Freedman-Diaconis rule (equation (S1) in Additional file 1). The discretization method is performed over each SNP individually. For equal width binning method, we divide the range of the dosage values for a given SNP into m bins, with each bin having equal interval length. For equal frequency binning method, we use a similar strategy by replacing the range of dosage values with their frequencies. Each category thus has an equal number of members. However, if the data contain duplicated values, the equal frequency binning may not achieve perfect equally sized groups. For Jenks binning, we partition the dosage values into m clusters such that the within-cluster variations are minimized and between-cluster variations are maximized. To avoid label switching problem in Jenks binning, we assign the labels to the m clusters according to their group means. We evaluated the performance of MCPCA using the equal width, equal frequency, and Jenks binning methods. For ease in presentation, we refer to discretization methods as MCPCA-Intv, MCPCA-Freq, and MCPCA-Jenks respectively.

Simulation

We evaluate MCPCA_PopGen using three types of genotype callings. Perfectly known genotypes. To simulate the genotype data under a variety of assumptions concerning migration, recombination rate, and population size under neutral models, we used a coalescence simulator ms to simulate haplotypes for 50 individuals from each of three populations (African, Caucasian and Asian) [17]. Then we generated the genotypes of admixed individuals based on the ms output (See Supplemental Material for ms commands adopted to generate genotypes from admixed populations). After obtaining genotypes, we filtered out rare variants with minor allele frequency (MAF) below 0.05. These data play the role of perfectly known genotypes that come with high coverage NGS. The genotype is treated as the minor allele counts (i.e., 0, 1, 2) carried by individual i at each locus j. Observed genotypes (with error). We generated the observed genotypes under different coverage depths by perturbing with sequencing qualities sampled from the 1000 Genomes project [8, 9]. More specifically, we simulated by perturbing using errors generated from the Bernoulli distribution with probability , where is the quality score determined by the coverage depth. At a given mean depth, the number of reads for each genotype was sampled from Gamma distribution with shape and scale parameters 6.3 and depth/6.3 [8, 31, 38]. Then was sampled from the quality scores in the 1000 Genomes project whose observed number of reads is closest to the number of reads simulated from mean coverage. Thus, we generated the observed genotypes ’s along with the corresponding base-calling error probabilities ’s. Dosage genotypes. Dosage genotypes are the posterior mean of the genotype under additive coding. With values 0, 1 and 2 assigned to the genotypes (major, major), (major, minor) and (minor, minor), respectively, the dosage, + 2, where and denote the conditional (“posterior”) probabilities for the genotypes (major, minor) and (minor, minor). Our method can also be applied to dosage data imputed by Mach/Thunder [23].

Implementation

MCPCA_PopGen is an open-source package. The source code of MCPCA is provided by [11] using Matlab. To make it easier to install and implement, we provide the entire package MCPCA_PopGen in the high-performance Julia language. Both the ms commands for generating genotypes and the documented source code for MCPCA_PopGen are hosted on GitHub: https://github.com/yiwenstat/MCPCA_PopGen. Additional file 1. Details of estimating nonlinear transformation, discretization schemes, and simulation commands.

40 in total

1. Generating samples under a Wright-Fisher neutral model of genetic variation.

Authors: Richard R Hudson
Journal: Bioinformatics Date: 2002-02 Impact factor: 6.937

2. Maximum Properties and Inequalities for the Eigenvalues of Completely Continuous Operators.

Authors: K Fan
Journal: Proc Natl Acad Sci U S A Date: 1951-11 Impact factor: 11.205

3. Principal components analysis corrects for stratification in genome-wide association studies.

Authors: Alkes L Price; Nick J Patterson; Robert M Plenge; Michael E Weinblatt; Nancy A Shadick; David Reich
Journal: Nat Genet Date: 2006-07-23 Impact factor: 38.330

4. Analysis commons, a team approach to discovery in a big-data environment for genetic epidemiology.

Authors: Jennifer A Brody; Alanna C Morrison; Joshua C Bis; Jeffrey R O'Connell; Michael R Brown; Jennifer E Huffman; Darren C Ames; Andrew Carroll; Matthew P Conomos; Stacey Gabriel; Richard A Gibbs; Stephanie M Gogarten; Namrata Gupta; Cashell E Jaquish; Andrew D Johnson; Joshua P Lewis; Xiaoming Liu; Alisa K Manning; George J Papanicolaou; Achilleas N Pitsillides; Kenneth M Rice; William Salerno; Colleen M Sitlani; Nicholas L Smith; Susan R Heckbert; Cathy C Laurie; Braxton D Mitchell; Ramachandran S Vasan; Stephen S Rich; Jerome I Rotter; James G Wilson; Eric Boerwinkle; Bruce M Psaty; L Adrienne Cupples
Journal: Nat Genet Date: 2017-10-27 Impact factor: 38.330

5. The Allelic Landscape of Human Blood Cell Trait Variation and Links to Common Complex Disease.

Authors: William J Astle; Heather Elding; Tao Jiang; Dave Allen; Dace Ruklisa; Alice L Mann; Daniel Mead; Heleen Bouman; Fernando Riveros-Mckay; Myrto A Kostadima; John J Lambourne; Suthesh Sivapalaratnam; Kate Downes; Kousik Kundu; Lorenzo Bomba; Kim Berentsen; John R Bradley; Louise C Daugherty; Olivier Delaneau; Kathleen Freson; Stephen F Garner; Luigi Grassi; Jose Guerrero; Matthias Haimel; Eva M Janssen-Megens; Anita Kaan; Mihir Kamat; Bowon Kim; Amit Mandoli; Jonathan Marchini; Joost H A Martens; Stuart Meacham; Karyn Megy; Jared O'Connell; Romina Petersen; Nilofar Sharifi; Simon M Sheard; James R Staley; Salih Tuna; Martijn van der Ent; Klaudia Walter; Shuang-Yin Wang; Eleanor Wheeler; Steven P Wilder; Valentina Iotchkova; Carmel Moore; Jennifer Sambrook; Hendrik G Stunnenberg; Emanuele Di Angelantonio; Stephen Kaptoge; Taco W Kuijpers; Enrique Carrillo-de-Santa-Pau; David Juan; Daniel Rico; Alfonso Valencia; Lu Chen; Bing Ge; Louella Vasquez; Tony Kwan; Diego Garrido-Martín; Stephen Watt; Ying Yang; Roderic Guigo; Stephan Beck; Dirk S Paul; Tomi Pastinen; David Bujold; Guillaume Bourque; Mattia Frontini; John Danesh; David J Roberts; Willem H Ouwehand; Adam S Butterworth; Nicole Soranzo
Journal: Cell Date: 2016-11-17 Impact factor: 41.582

6. A genealogical interpretation of principal components analysis.

Authors: Gil McVean
Journal: PLoS Genet Date: 2009-10-16 Impact factor: 5.917

7. Accurate whole human genome sequencing using reversible terminator chemistry.

Authors: David R Bentley; Shankar Balasubramanian; Harold P Swerdlow; Geoffrey P Smith; John Milton; Clive G Brown; Kevin P Hall; Dirk J Evers; Colin L Barnes; Helen R Bignell; Jonathan M Boutell; Jason Bryant; Richard J Carter; R Keira Cheetham; Anthony J Cox; Darren J Ellis; Michael R Flatbush; Niall A Gormley; Sean J Humphray; Leslie J Irving; Mirian S Karbelashvili; Scott M Kirk; Heng Li; Xiaohai Liu; Klaus S Maisinger; Lisa J Murray; Bojan Obradovic; Tobias Ost; Michael L Parkinson; Mark R Pratt; Isabelle M J Rasolonjatovo; Mark T Reed; Roberto Rigatti; Chiara Rodighiero; Mark T Ross; Andrea Sabot; Subramanian V Sankar; Aylwyn Scally; Gary P Schroth; Mark E Smith; Vincent P Smith; Anastassia Spiridou; Peta E Torrance; Svilen S Tzonev; Eric H Vermaas; Klaudia Walter; Xiaolin Wu; Lu Zhang; Mohammed D Alam; Carole Anastasi; Ify C Aniebo; David M D Bailey; Iain R Bancarz; Saibal Banerjee; Selena G Barbour; Primo A Baybayan; Vincent A Benoit; Kevin F Benson; Claire Bevis; Phillip J Black; Asha Boodhun; Joe S Brennan; John A Bridgham; Rob C Brown; Andrew A Brown; Dale H Buermann; Abass A Bundu; James C Burrows; Nigel P Carter; Nestor Castillo; Maria Chiara E Catenazzi; Simon Chang; R Neil Cooley; Natasha R Crake; Olubunmi O Dada; Konstantinos D Diakoumakos; Belen Dominguez-Fernandez; David J Earnshaw; Ugonna C Egbujor; David W Elmore; Sergey S Etchin; Mark R Ewan; Milan Fedurco; Louise J Fraser; Karin V Fuentes Fajardo; W Scott Furey; David George; Kimberley J Gietzen; Colin P Goddard; George S Golda; Philip A Granieri; David E Green; David L Gustafson; Nancy F Hansen; Kevin Harnish; Christian D Haudenschild; Narinder I Heyer; Matthew M Hims; Johnny T Ho; Adrian M Horgan; Katya Hoschler; Steve Hurwitz; Denis V Ivanov; Maria Q Johnson; Terena James; T A Huw Jones; Gyoung-Dong Kang; Tzvetana H Kerelska; Alan D Kersey; Irina Khrebtukova; Alex P Kindwall; Zoya Kingsbury; Paula I Kokko-Gonzales; Anil Kumar; Marc A Laurent; Cynthia T Lawley; Sarah E Lee; Xavier Lee; Arnold K Liao; Jennifer A Loch; Mitch Lok; Shujun Luo; Radhika M Mammen; John W Martin; Patrick G McCauley; Paul McNitt; Parul Mehta; Keith W Moon; Joe W Mullens; Taksina Newington; Zemin Ning; Bee Ling Ng; Sonia M Novo; Michael J O'Neill; Mark A Osborne; Andrew Osnowski; Omead Ostadan; Lambros L Paraschos; Lea Pickering; Andrew C Pike; Alger C Pike; D Chris Pinkard; Daniel P Pliskin; Joe Podhasky; Victor J Quijano; Come Raczy; Vicki H Rae; Stephen R Rawlings; Ana Chiva Rodriguez; Phyllida M Roe; John Rogers; Maria C Rogert Bacigalupo; Nikolai Romanov; Anthony Romieu; Rithy K Roth; Natalie J Rourke; Silke T Ruediger; Eli Rusman; Raquel M Sanches-Kuiper; Martin R Schenker; Josefina M Seoane; Richard J Shaw; Mitch K Shiver; Steven W Short; Ning L Sizto; Johannes P Sluis; Melanie A Smith; Jean Ernest Sohna Sohna; Eric J Spence; Kim Stevens; Neil Sutton; Lukasz Szajkowski; Carolyn L Tregidgo; Gerardo Turcatti; Stephanie Vandevondele; Yuli Verhovsky; Selene M Virk; Suzanne Wakelin; Gregory C Walcott; Jingwen Wang; Graham J Worsley; Juying Yan; Ling Yau; Mike Zuerlein; Jane Rogers; James C Mullikin; Matthew E Hurles; Nick J McCooke; John S West; Frank L Oaks; Peter L Lundberg; David Klenerman; Richard Durbin; Anthony J Smith
Journal: Nature Date: 2008-11-06 Impact factor: 49.962

8. The UK10K project identifies rare variants in health and disease.

Authors: Klaudia Walter; Josine L Min; Jie Huang; Lucy Crooks; Yasin Memari; Shane McCarthy; John R B Perry; ChangJiang Xu; Marta Futema; Daniel Lawson; Valentina Iotchkova; Stephan Schiffels; Audrey E Hendricks; Petr Danecek; Rui Li; James Floyd; Louise V Wain; Inês Barroso; Steve E Humphries; Matthew E Hurles; Eleftheria Zeggini; Jeffrey C Barrett; Vincent Plagnol; J Brent Richards; Celia M T Greenwood; Nicholas J Timpson; Richard Durbin; Nicole Soranzo
Journal: Nature Date: 2015-09-14 Impact factor: 49.962

9. Rapid, ultra low coverage copy number profiling of cell-free DNA as a precision oncology screening strategy.

Authors: Daniel H Hovelson; Chia-Jen Liu; Yugang Wang; Qing Kang; James Henderson; Amy Gursky; Scott Brockman; Nithya Ramnath; John C Krauss; Moshe Talpaz; Malathi Kandarpa; Rashmi Chugh; Missy Tuck; Kirk Herman; Catherine S Grasso; Michael J Quist; Felix Y Feng; Christine Haakenson; John Langmore; Emmanuel Kamberov; Tim Tesmer; Hatim Husain; Robert J Lonigro; Dan Robinson; David C Smith; Ajjai S Alva; Maha H Hussain; Arul M Chinnaiyan; Muneesh Tewari; Ryan E Mills; Todd M Morgan; Scott A Tomlins
Journal: Oncotarget Date: 2017-09-22

10. Understanding 6th-century barbarian social organization and migration through paleogenomics.

Authors: Carlos Eduardo G Amorim; Stefania Vai; Cosimo Posth; Alessandra Modi; István Koncz; Susanne Hakenbeck; Maria Cristina La Rocca; Balazs Mende; Dean Bobo; Walter Pohl; Luisella Pejrani Baricco; Elena Bedini; Paolo Francalacci; Caterina Giostra; Tivadar Vida; Daniel Winger; Uta von Freeden; Silvia Ghirotto; Martina Lari; Guido Barbujani; Johannes Krause; David Caramelli; Patrick J Geary; Krishna R Veeramah
Journal: Nat Commun Date: 2018-09-11 Impact factor: 14.919