Literature DB >> 27687565

A Thousand Fly Genomes: An Expanded Drosophila Genome Nexus.

Justin B Lack1, Jeremy D Lange1, Alison D Tang2, Russell B Corbett-Detig2, John E Pool3.   

Abstract

The Drosophila Genome Nexus is a population genomic resource that provides D. melanogaster genomes from multiple sources. To facilitate comparisons across data sets, genomes are aligned using a common reference alignment pipeline which involves two rounds of mapping. Regions of residual heterozygosity, identity-by-descent, and recent population admixture are annotated to enable data filtering based on the user's needs. Here, we present a significant expansion of the Drosophila Genome Nexus, which brings the current data object to a total of 1,121 wild-derived genomes. New additions include 305 previously unpublished genomes from inbred lines representing six population samples in Egypt, Ethiopia, France, and South Africa, along with another 193 genomes added from recently-published data sets. We also provide an aligned D. simulans genome to facilitate divergence comparisons. This improved resource will broaden the range of population genomic questions that can addressed from multi-population allele frequencies and haplotypes in this model species. The larger set of genomes will also enhance the discovery of functionally relevant natural variation that exists within and between populations.
© The Author 2016. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.

Entities:  

Keywords:  Drosophila melanogaster; data resource.; genomic variation; reference alignment

Mesh:

Year:  2016        PMID: 27687565      PMCID: PMC5100052          DOI: 10.1093/molbev/msw195

Source DB:  PubMed          Journal:  Mol Biol Evol        ISSN: 0737-4038            Impact factor:   16.240


Introduction

The genetics model Drosophila melanogaster has played a pivotal role in population genetic research. A growing number of studies have generated population genomic data from this species, but alignment and filtering criteria typically vary among studies, which obscures direct comparisons between these data sets. The Drosophila Genome Nexus (DGN; Lack ; http://www.johnpool.net/genomes.html; last accessed September 20, 2016) provides the research community with genomes from multiple published sources that are generated using a common reference alignment pipeline. This more consistent data object is intended to facilitate comparisons of genomic variation between data sets with less potential for methodological bias. The DGN pipeline improved upon typical reference alignment protocols by including a second round of mapping to a modified reference genome that incorporates the variants detected in the first round, a practice that resulted in improved genomic coverage and accuracy (Lack ). Version 1.0 of the DGN included 623 genomes of D. melanogaster from individual wild-derived strains, originating from five data sets (table 1). Phase 2 of the Drosophila Population Genomics Project (DPGP; Pool ) included 139 genomes from 22 populations, mainly from Africa. D. melanogaster was known to have originated in sub-Saharan Africa (Lachaise ), and this study identified southern-central Africa as the likely ancestral range. It also identified significant recent gene flow re-entering Africa, potentially related to urban adaptation, and powerful effects of inversions on genomic variation (Pool ). This geographic sampling across Africa was supplemented by a set of genomes denoted in the first DGN publication as AGES (African Genomes Extended Sequencing; Lack ). Phase 3 of DPGP focused on a putative ancestral range population identified in the previous study, and brought this Zambia sample to a total of 197 independent, haploid genomes from a single location (Lack ). That study, which also introduced the DGN, confirmed that the focal Zambia sample was maximally diverse among all sampled populations, with minimal presence of non-African admixture (Lack ). Most of the DPGP2 genomes and all of the DPGP3 and AGES genomes were sequenced from haploid embryos (Langley ). Most other DGN genomes were sequenced from inbred or isofemale lines (supplementary tables S1 and S2, Supplementary Material online).
Table 1

Genomic Data Sets Present in the Drosophila Genome Nexus Are Summarized.

DGN SetData referenceGenomesPopulationsGeographic focusGenome type
Present in DGN 1.0
DGRPMackay et al. 2012; Huang et al. 20142051North AmericaInbred
DPGP3Lack et al. 20151971Africa (Zambia)Haploid
DPGP2Pool et al. 2012; Langley et al. 201215022Mostly AfricaMostly Haploid
AGESLack et al. 20155312AfricaHaploid
DSPRKing et al. 20121818WorldwideInbred
Added in DGN 1.1
POOL(present study)3056Africa/EuropeInbred
CLARKGrenier et al. 2015857WorldwideInbred
NUZHDINCampo et al. 2013; Kao et al. 20155813North AmericaInbred
BERGMANBergman and Haddrill 2015503Africa/Eur./N. Am.Isofemale

Note.—Further details concerning the population samples and individual genomes represented in these data sets are given in Table S1 and Table S2, respectively.

Genomic Data Sets Present in the Drosophila Genome Nexus Are Summarized. Note.—Further details concerning the population samples and individual genomes represented in these data sets are given in Table S1 and Table S2, respectively. Another data source for DGN 1.0 was from the Drosophila Genetic Reference Panel (DGRP), which consists of 205 genomes originating from Raleigh, North Carolina, USA (Mackay ; Huang ), and has been widely used in genome-wide association studies. These genomes were from strains inbred for 20 generations, resulting in 87% homozygous regions across euchromatic chromosome arms (Lack ). North American populations appear to have resulted from admixture between European and African gene pools; a recent study that examined population ancestry along DGRP genomes estimated this population to be 20% African, with significant genome-wide evidence for incompatibilities between African and European alleles at unlinked loci (Pool 2015). Beyond the above data sources, DGN 1.0 also included Malawi chromosome extraction line genomes from DPGP Phase 1 (Langley ), which are grouped with DPGP2 genomes in the DGN. And it featured source strain genomes from the Drosophila Synthetic Population Resource (DSPR; King ), a trait mapping resource that encompasses more than 1,700 recombinant inbred lines. In the present release, labeled as version 1.1 of the DGN, we add a total of 498 genomes. Of these, 305 are newly published in this study, and were sequenced from strains inbred for eight generations. These genomes were added to much smaller samples of genomes originating from a pair of Ethiopian populations (EA, EF), a pair of South African populations (SD, SP), and populations from Egypt (EG) and France (FR). Genomic sequencing was performed using identical methods to those described by (Lack ). Briefly, for each inbred line, ∼30 female flies were used to prepare genomic DNA libraries. Sequencing on a HiSeq 2000 was performed to generate paired end 100 bp paired end reads with ∼300 bp inserts. Drosophila Genome Nexus (DGN) 1.1 also adds 193 genomes from four published studies. The Global Diversity Lines (GDL; Grenier ) include 85 genomes from Australia, China, the Netherlands, the USA, and Zimbabwe. The 50 genomes published by Bergman and Haddrill (2015) originate from France, Ghana, and the USA. Campo studied 35 genomes from a California population. Kao added 23 genomes originating from 12 New World locations. The data sets represented in DGN1.1 are summarized in table 1. The 74 population samples they encompass are described in Table S1, and many of these are depicted in figure 1. Characteristics of all 1,121 individual strain genomes are given in supplementary table S2, Supplementary Material online. Instead of just three geographic population samples with at least 15 sequenced genomes (as in DGN 1.0), 14 population samples now fit this criterion, with five of these having more than 60 genomes (fig. 1).
Fig. 1

Geographic locations of selected population samples are shown, with the largest samples in bold print. These populations have at least three sequenced genomes with DGN consensus sequences available.

Geographic locations of selected population samples are shown, with the largest samples in bold print. These populations have at least three sequenced genomes with DGN consensus sequences available. Importantly, the genomic alignments present in the prior DGN release have not been altered in version 1.1. Instead, we have supplemented the existing data resource by aligning and filtering the additional genomes using exactly the same pipeline described for DGN 1.0, again using the Flybase release 5.57 D. melanogaster reference genome (Lack ). Beginning with raw sequence read data, mapping is performed using BWA v0.5.9 (Li and Durbin 2010) followed by Stampy v1.0.20 (Lunter and Goodson 2010). GATK (DePristo et al. 2011) is then used to realign indels and generate consensus sequences. Called SNPs and indels are then incorporated into a genome-specific modified reference sequence, and read mapping is performed a second time to reduce mismatches. Genomic coordinates are then shifted back to match the original reference numbering. The “site” and "indel" variant call files (VCFs) provided by DGN are the direct output of this pipeline. Drosophila Genome Nexus (DGN) also distributes consensus sequence files that feature additional filtering, and may be more appropriate for most analyses. To reduce the error rate, sites within 3 bp of a called indel are masked to “N”. For genomes that may contain residual heterozygosity, genomic intervals of apparent heterozygosity are fully masked. For fully haploid genomes (Langley ), sites with an excess of apparent heterozygosity (e.g., due to technical artifacts or structural variation) are similarly masked as “pseudoheterozygosity”. Following such masking (in addition to removal of non-target chromosome arms from samples such as chromosome extraction line genomes), we find that an average site has homozygous consensus sequence calls from 754 DGN genomes. We also provide files to enable user-initiated masking for two additional criteria. First, we allow regions of “identity by descent” due to relatedness between genomes in the same population sample to be masked. Second, we allow users to mask from sub-Saharan genomes regions of recent admixture from non-African populations (Pool ). Full details on the alignment and filtering processes are given by Lack . Detailed filtering outcomes for heterozygosity, relatedness IBD, and admixture are provided in supplementary tables S3–S5, Supplementary Material online, respectively. Users can also deploy a script to extract FastA alignments for specific genomic regions from downloaded data. Filtering characteristics of several data sets are depicted in figure 2. Substantial heterozygosity persists in genomes sequenced from inbred lines (GDL, Campo, Kao, Pool, DGRP), in spite of inbreeding efforts that would be expected to reduce heterozygosity to nominal levels under neutral assumptions. Note that in figure 2, “heterozygosity” also includes regions masked due to elevated heterozygous site rates for reasons such as copy number variation or data quality (“psuedoheterozygosity”; Lack ). For example, the DGRP data set is estimated to have just 13% genuine heterozygosity (Lack ). Previous analysis has shown that most genuine residual heterozygosity is associated with inversions (Grenier ; Lack ). Inversion genotypes based on prior published calls and the method of Corbett-Detig are given in supplementary table S2, Supplementary Material online. Genomes from the Bergman and Haddrill (2015) data set, which were sequenced from isofemale lines, were estimated to be 99% heterozygous. DGN provides VCFs but not heterozygosity-filtered consensus sequences for these genomes.
Fig. 2

The extent of genomic data annotated for masking due to heterozygosity, relatedness, and admixture is shown per 119 Mb genome (filtered in that order).

The extent of genomic data annotated for masking due to heterozygosity, relatedness, and admixture is shown per 119 Mb genome (filtered in that order). Figure 2 also shows the proportion of data sets that can be masked for relatedness IBD. These IBD tracts can allow the estimation of an average coefficient of relationship (Wright 1922) for each population sample, which may be viewed as the probability that two random genomes are IBD at a given site due to recent relatedness. Focusing on population samples with at least 15 genomes, we estimate that for most population samples, a random pair of individuals has a coefficient of relatedness between 0.001 and 0.005 (supplementary table S4, Supplementary Material online), or roughly the relatedness of fourth cousins. A few populations have lower values (EG, RG, SP, and the DPGP3 Zambia ZI sample). Relatedness in one population (Netherlands N) was an order of magnitude higher than any other; its coefficient of relationship (0.046) exceeded the expectation for second cousins. Thus, it may be important to account for relatedness IBD (e.g., by masking the provided intervals) if analysis will assume that unrelated alleles are being compared. Pool found evidence for substantial recent gene flow from non-African populations back into sub-Saharan genomes. Masking admixed genomic regions may allow sub-Saharan genetic diversity to be studied more directly, with fewer departures from typical assumptions of well-mixed populations. Admixture levels are known to vary drastically between sub-Saharan populations, partly as a function of urbanization (Pool ). Of the data sets shown in figure 2, “Pool” is mostly comprised of sub-Saharan genomes (62% from Ethiopia or South Africa), whereas one sixth of “GDL” consists of Zimbabwe genomes. “DPGP3” is a sample of 197 genomes from a single Zambia population with very low levels of admixture (Lack ). Among the DGN 1.1 samples, 15 worldwide populations are represented by at least 10 genomes for all three euchromatic chromosomes. A summary of genetic variation within and between these populations is provided in figure 3. As previously indicated, genomic diversity is highest in Zambia and other southern African populations (Pool ; Lack ), and all sub-Saharan populations are more diverse than all others. Because North American populations have mainly European but partly African ancestry (Kao ; Pool 2015; Bergland ), they show somewhat higher diversity than European populations. Geographic structure is apparent, especially between sub-Saharan populations and all others, with the latter group showing a common reduced gene pool apparently resulting from a population bottleneck. Additional bottlenecks may have impacted the B population from China (Laurent ) and the EF population from the Ethiopian highlands (Pool ; Lack ), leading to mild population-specific reductions in diversity and increases in genetic differentiation (fig. 3).
Fig. 3

Average values of nucleotide diversity (π) within populations (on the diagonal), average pairwise distance between populations (D, above the diagnonal), and F between populations (below the diagonal) are shown. Values are averaged across chromosome arms X, 2L, 2R, 3L, and 3R, each of which was analyzed using inversion-free genomes only.

Average values of nucleotide diversity (π) within populations (on the diagonal), average pairwise distance between populations (D, above the diagnonal), and F between populations (below the diagonal) are shown. Values are averaged across chromosome arms X, 2L, 2R, 3L, and 3R, each of which was analyzed using inversion-free genomes only. In addition to the above-described D. melanogaster genomes, DGN now also distributes an aligned sequence of D. simulans to the same D. melanogaster reference genome. Stanley and Kulathinal (2016) produced this alignment using progressiveMauve (Darling ) to align the release 2 D. simulans genome (Hu ) to the release 5 D. melanogaster reference sequence. We provide sequence text files mirroring our D. melanogaster consensus sequences for D. simulans on the DGN web site (http://www.johnpool.net/genomes.html; last accessed September 20, 2016). Note that for all data hosted by DGN, users should cite the original publications (supplementary table S2, Supplementary Material online) in addition to this alignment resource. This expansion of the DGN will significantly bolster researchers’ ability to examine genetic variation within and between D. melanogaster populations. Future DGN releases will entail realigning all genomes using updated methods and reference genomes, plus evaluating new formats for providing genomic data. Community input to shape the future of this population genomic resource is welcome.

Supplementary Material

Supplementary tables S1–S5 are available at Molecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).
  21 in total

1.  Approximate Bayesian analysis of Drosophila melanogaster polymorphism data reveals a recent colonization of Southeast Asia.

Authors:  Stefan J Y Laurent; Annegret Werzner; Laurent Excoffier; Wolfgang Stephan
Journal:  Mol Biol Evol       Date:  2011-02-07       Impact factor: 16.240

2.  Sequence-based detection and breakpoint assembly of polymorphic inversions.

Authors:  Russell B Corbett-Detig; Charis Cardeno; Charles H Langley
Journal:  Genetics       Date:  2012-06-05       Impact factor: 4.562

3.  Whole-genome sequencing of two North American Drosophila melanogaster populations reveals genetic differentiation and positive selection.

Authors:  D Campo; K Lehmann; C Fjeldsted; T Souaiaia; J Kao; S V Nuzhdin
Journal:  Mol Ecol       Date:  2013-09-19       Impact factor: 6.185

4.  The Drosophila melanogaster Genetic Reference Panel.

Authors:  Trudy F C Mackay; Stephen Richards; Eric A Stone; Antonio Barbadilla; Julien F Ayroles; Dianhui Zhu; Sònia Casillas; Yi Han; Michael M Magwire; Julie M Cridland; Mark F Richardson; Robert R H Anholt; Maite Barrón; Crystal Bess; Kerstin Petra Blankenburg; Mary Anna Carbone; David Castellano; Lesley Chaboub; Laura Duncan; Zeke Harris; Mehwish Javaid; Joy Christina Jayaseelan; Shalini N Jhangiani; Katherine W Jordan; Fremiet Lara; Faye Lawrence; Sandra L Lee; Pablo Librado; Raquel S Linheiro; Richard F Lyman; Aaron J Mackey; Mala Munidasa; Donna Marie Muzny; Lynne Nazareth; Irene Newsham; Lora Perales; Ling-Ling Pu; Carson Qu; Miquel Ràmia; Jeffrey G Reid; Stephanie M Rollmann; Julio Rozas; Nehad Saada; Lavanya Turlapati; Kim C Worley; Yuan-Qing Wu; Akihiko Yamamoto; Yiming Zhu; Casey M Bergman; Kevin R Thornton; David Mittelman; Richard A Gibbs
Journal:  Nature       Date:  2012-02-08       Impact factor: 49.962

5.  Circumventing heterozygosity: sequencing the amplified genome of a single haploid Drosophila melanogaster embryo.

Authors:  Charles H Langley; Marc Crepeau; Charis Cardeno; Russell Corbett-Detig; Kristian Stevens
Journal:  Genetics       Date:  2011-03-24       Impact factor: 4.562

6.  Global diversity lines - a five-continent reference panel of sequenced Drosophila melanogaster strains.

Authors:  Jennifer K Grenier; J Roman Arguello; Margarida Cardoso Moreira; Srikanth Gottipati; Jaaved Mohammed; Sean R Hackett; Rachel Boughton; Anthony J Greenberg; Andrew G Clark
Journal:  G3 (Bethesda)       Date:  2015-02-11       Impact factor: 3.154

7.  The Drosophila genome nexus: a population genomic resource of 623 Drosophila melanogaster genomes, including 197 from a single ancestral range population.

Authors:  Justin B Lack; Charis M Cardeno; Marc W Crepeau; William Taylor; Russell B Corbett-Detig; Kristian A Stevens; Charles H Langley; John E Pool
Journal:  Genetics       Date:  2015-01-27       Impact factor: 4.562

8.  The Mosaic Ancestry of the Drosophila Genetic Reference Panel and the D. melanogaster Reference Genome Reveals a Network of Epistatic Fitness Interactions.

Authors:  John E Pool
Journal:  Mol Biol Evol       Date:  2015-09-08       Impact factor: 16.240

9.  Fast and accurate long-read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2010-01-15       Impact factor: 6.937

10.  Population Genomics of sub-saharan Drosophila melanogaster: African diversity and non-African admixture.

Authors:  John E Pool; Russell B Corbett-Detig; Ryuichi P Sugino; Kristian A Stevens; Charis M Cardeno; Marc W Crepeau; Pablo Duchen; J J Emerson; Perot Saelao; David J Begun; Charles H Langley
Journal:  PLoS Genet       Date:  2012-12-20       Impact factor: 5.917

View more
  58 in total

1.  Linked genetic variation and not genome structure causes widespread differential expression associated with chromosomal inversions.

Authors:  Iskander Said; Ashley Byrne; Victoria Serrano; Charis Cardeno; Christopher Vollmers; Russell Corbett-Detig
Journal:  Proc Natl Acad Sci U S A       Date:  2018-05-07       Impact factor: 11.205

2.  Gene Regulatory Variation in Drosophila melanogaster Renal Tissue.

Authors:  Amanda Glaser-Schmitt; Aleksandra Zečić; John Parsch
Journal:  Genetics       Date:  2018-07-05       Impact factor: 4.562

3.  Estimating the Timing of Multiple Admixture Pulses During Local Ancestry Inference.

Authors:  Paloma Medina; Bryan Thornlow; Rasmus Nielsen; Russell Corbett-Detig
Journal:  Genetics       Date:  2018-09-11       Impact factor: 4.562

4.  Evolution of salivary glue genes in Drosophila species.

Authors:  Jean-Luc Da Lage; Gregg W C Thomas; Magalie Bonneau; Virginie Courtier-Orgogozo
Journal:  BMC Evol Biol       Date:  2019-01-29       Impact factor: 3.260

5.  Wild African Drosophila melanogaster Are Seasonal Specialists on Marula Fruit.

Authors:  Suzan Mansourian; Anders Enjin; Erling V Jirle; Vedika Ramesh; Guillermo Rehermann; Paul G Becher; John E Pool; Marcus C Stensmyr
Journal:  Curr Biol       Date:  2018-12-06       Impact factor: 10.834

6.  Genomic signatures of local adaptation in the Drosophila immune response.

Authors:  Angela M Early; Andrew G Clark
Journal:  Fly (Austin)       Date:  2017-06-06       Impact factor: 2.160

7.  Linking Genomic and Metabolomic Natural Variation Uncovers Nematode Pheromone Biosynthesis.

Authors:  Jan M Falcke; Neelanjan Bose; Alexander B Artyukhin; Christian Rödelsperger; Gabriel V Markov; Joshua J Yim; Dominik Grimm; Marc H Claassen; Oishika Panda; Joshua A Baccile; Ying K Zhang; Henry H Le; Dino Jolic; Frank C Schroeder; Ralf J Sommer
Journal:  Cell Chem Biol       Date:  2018-05-17       Impact factor: 8.116

Review 8.  Natural diversity facilitates the discovery of conserved chemotherapeutic response mechanisms.

Authors:  Stefan Zdraljevic; Erik C Andersen
Journal:  Curr Opin Genet Dev       Date:  2017-09-09       Impact factor: 5.578

9.  Dynamic Evolution of Antimicrobial Peptides Underscores Trade-Offs Between Immunity and Ecological Fitness.

Authors:  Mark A Hanson; Bruno Lemaitre; Robert L Unckless
Journal:  Front Immunol       Date:  2019-11-08       Impact factor: 7.561

10.  PopFly: the Drosophila population genomics browser.

Authors:  Sergi Hervas; Esteve Sanz; Sònia Casillas; John E Pool; Antonio Barbadilla
Journal:  Bioinformatics       Date:  2017-09-01       Impact factor: 6.937

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.