Literature DB >> 34349118

A curated dataset of modern and ancient high-coverage shotgun human genomes.

Pierpaolo Maisano Delser1,2, Eppie R Jones3,4, Anahit Hovhannisyan5, Lara Cassidy6, Ron Pinhasi7, Andrea Manica8.   

Abstract

Over the last few years, genome-wide data for a large number of ancient human samples have been collected. Whilst datasets of captured SNPs have been collated, high coverage shotgun genomes (which are relatively few but allow certain types of analyses not possible with ascertained captured SNPs) have to be reprocessed by individual groups from raw reads. This task is computationally intensive. Here, we release a dataset including 35 whole-genome sequenced samples, previously published and distributed worldwide, together with the genetic pipeline used to process them. The dataset contains 72,041,355 sites called across 19 ancient and 16 modern individuals and includes sequence data from four previously published ancient samples which we sequenced to higher coverage (10-18x). Such a resource will allow researchers to analyse their new samples with the same genetic pipeline and directly compare them to the reference dataset without re-processing published samples. Moreover, this dataset can be easily expanded to increase the sample distribution both across time and space.
© 2021. The Author(s).

Entities:  

Mesh:

Substances:

Year:  2021        PMID: 34349118      PMCID: PMC8338957          DOI: 10.1038/s41597-021-00980-1

Source DB:  PubMed          Journal:  Sci Data        ISSN: 2052-4463            Impact factor:   6.444


Background & Summary

The number of ancient humans with genome-wide data available has increased from less than five a decade ago to more than 3,000 thanks to advancements in extraction and sequencing methods for ancient DNA (aDNA)[1]. However, there are just a few high-quality (coverage >10x) shotgun whole-genome sequenced ancient samples[2]. While genetic pipelines have been previously published[3-6], combining data processed with different approaches is hard and time consuming. Therefore, researchers have to download raw reads of published samples and reprocess them to create a dataset to compare their new samples against without pipeline-associated biases. This problem is less pronounced for modern DNA samples as the higher quality of DNA and sequencing coverage partially reduce the biases introduced by the usage of different bioinformatic tools. Panels including shotgun data for modern samples distributed worldwide have been previously published, such as the Simons Genome Diversity Program[7], 1000 Genome Project[8] and Human Genome Diversity Project (HGDP-CEPH panel)[9]. However, the same concept has not yet been applied to ancient samples or a mix of modern and ancient samples. This study aims to start filling this gap by creating a dataset including both modern and ancient samples distributed across all continents. Therefore, we fully reprocessed 15 high-quality shotgun sequenced ancient samples downloaded from the literature, generated additional new data for previously published 4 ancient samples and merged them with 16 modern samples. The final dataset includes 35 individuals and researchers can use it to quickly compare their new samples against a set of individuals distributed across time and space (Fig. 1). Moreover, we hope that researchers will add additional data processed with the pipeline that we released to increase the sample resolution both in time and space.
Fig. 1

Geographic distribution of samples included in the dataset. Population acronyms are reported in Table 2.

Geographic distribution of samples included in the dataset. Population acronyms are reported in Table 2.
Table 2

Metadata for modern samples. SGDP: Simons Genome Diversity Panel.

Sample_IDSample_acronymPopulation_IDCountryLatitudeLongitudeStudy
SS6004477AUSAustralianAustralia−13143SGDP – Mallick et al., 2016
LP6005443-DNA_B09DINDinkaSudan8.827.4SGDP – Mallick et al., 2016
LP6005443-DNA_B03ESKEskimo_SirenikiRussia64.4173.9SGDP – Mallick et al., 2016
LP6005519-DNA_D05IRUIrulaIndia13.580SGDP – Mallick et al., 2016
LP6005443-DNA_D04ITEItelmanRussia57157SGDP – Mallick et al., 2016
LP6005441-DNA_G06KARKaritianaBrazil−10−63SGDP – Mallick et al., 2016
LP6005441-DNA_E07MNDMandenkaSenegal12−12SGDP – Mallick et al., 2016
LP6005443-DNA_G04MNSMansiRussia63.6562.1SGDP – Mallick et al., 2016
LP6005441-DNA_F09ORQOroqenChina50.4126.5SGDP – Mallick et al., 2016
LP6005443-DNA_D08PAPPapuanPapuaNewGuinea−4143SGDP – Mallick et al., 2016
LP6005441-DNA_F10PIMPimaMexico29−108SGDP – Mallick et al., 2016
LP6005442-DNA_H12ULCUlchiRussia52.43140.42SGDP – Mallick et al., 2016
LP6005442-DNA_D01XIBXiboChina43.581.5SGDP – Mallick et al., 2016
LP6005442-DNA_F01YKTYakutRussia63129.5SGDP – Mallick et al., 2016
LP6005442-DNA_B02YRIYorubaNigeria7.43.9SGDP – Mallick et al., 2016
JHM06JHMJehaiMalaysia5.25101.17McColl et al., 2018

Methods

Sample collection

Additional sequence data were generated for four ancient samples which were previously collected and described in the following original publications: ZVEJ25 and ZVEJ31 were published in Jones et al.[10], KK1 in Jones et al.[11] and NE5 in Gamba et al.[12]. Furthermore, 15 additional ancient samples and 16 modern samples have been downloaded from the literature (see Online-only Tables 1 and 2). The final dataset includes 35 samples consisting of 19 ancient and 16 modern samples.
Online-only Table 1

Metadata for ancient samples. Samples in bold have been resequenced in this study.

SampleStudyCountySiteLatitudeLongitudeMean date BPDate (2-sigma)UDG-treated
AHUR_2064Moreno-Mayar JV et al., 2018USASpirit Cave, Nevada37.41−122.081097010770–11170 calBPyes
Anzick-1Rasmussen M et al., 2014USANear Wilsall, Montana45.97−110.661263212707–12556 calBPno
BichonJones et al. 2015SwitzerlandBichon47.16.871366513560- 13770 cal BPno
KK1Jones et al. 2015GeorgiaKotias Klde42.2543.2797129529–9895 cal BPyes
Kolyma1Sikora M et al., 2019RussiaDuvanni Yar68.6159.197869668–9904 calBPno
LoschbourLazaridis et al. 2014LuxembourgEchternach49.816.480556220–5990 calBCEyes
MotaGallego-Llorente M et al.,2015AfricaMota Cave, Gamo highlands of southwest Ethiopia6.8038.1744714524–4418 Cal BPno
NE1Gamba et al. 2014HungaryPolgar Ferenci hat47.8821.1971405310-5070 calBCyes
NE5Gamba et al. 2014HungaryKompolt-Kigyoser47.1720.8370505210-4990 calBCyes
SF12Guenther et al. 2018SwedenStora Förvar, Sweden57.281877007500-4000 cal BCyes
Sumidouro5Sikora et al. 2017BrazilCaverna do Sumidouro, Lagoa Santa, Brazil−19.54−43.941039110258–10524 (97.0%) calBPno
sunghirIIIMoreno-Mayar JV et al., 2018RussiaSunghir56.17640.5033409335154-33031 calBPyes
USR1Moreno-Mayar JV et al., 2018USAUpward Sun River site (USR)64.98−150.541143511600-11270 cal BPyes
Ust_IshimFu et al. 2014RussiaUst’-Ishim, Omsk Oblast57.4371.14500045000 calBP (46880–43210 calBP at 95.4% probability)yes
WC1Broushaki et al. 2016IranWezmeh Cave34.0546.5992197455-7082 BCEno
Yana1Sikora M et al., 2019RussiaYana RHS70.43135.253168431321–32047 calBPno
ZVEJ25Jones et al., 2017LatviaZvejnieki57.7825.2476897791-7586 calBPyes
ZVEJ31Jones et al., 2017LatviaZvejnieki57.7825.2459656179-5750 calBPyes
Stuttgart_LBKLazaridis et al. 2014GermanyViesenhäuser Hof, Stuttgart-Mühlhausen48.789.1871435308-5077 cal BCyes
Online-only Table 2

variant calling summary per sample. DP: depth of coverage in filtered intervals for variant calling.

SampleRef_Hom_sitesAlt_Hom_sitesHet_sitesTransitions (ts)Transversions (tv)Average_DPts/tv ratio
Xibo719872882240031667342021986536.61.72
Mansi719874582138432513342531964445.61.74
Oroqen7198823823119299983365219465391.73
Ulchi7198797923038303383388319493421.74
Yakut719876742267831003339221975938.11.72
Irula719865432144633366347202009252.81.73
Australian719873992495928997341741978243.51.73
Eskimo-Sireniki719891282332328904331841904343.61.74
Yoruba719739542201445387426752472634.31.73
Pima719907192477325863320221861436.31.72
Dinka7197552822325435024165624171361.72
Karitiana719921722535423829312141796944.21.74
Mandenka719743972220344755423722458633.21.72
Papuan719885012596126893335331932141.61.74
Jehai7198759623346304133397619783361.72
Itelman719888902401328452332561920947.11.73
SunghirIII719877652361429976341941939613.51.76
Kolyma171901819238241157121183302120616.35.58
AHUR_20647199071124368262763207418570201.73
USR1719893352425727763327171930319.51.69
Yana171863578226131551641584631931428.88.20
Bichon71854465232321636581669611992911.38.38
WC17181675321018203584205313192891210.64
KK1719881562243830761335851961415.71.71
ZVEJ25719898022291328640326651888823.21.73
ZVEJ31719882362214730972336781944113.51.73
Mota71753225226092655212638752425513.610.88
Anzick-171628867225493899393930921939615.420.27
NE5719873532138232620341651983720.81.72
NE171805092209152153482157772048623.910.53
UstIshim719863882156933398347962017135.21.73
Sf127198999022548288173254418821551.73
Sumidouro571064691209029557629586241804016.253.14
Loschbour719901472452526683317621944619.31.63
Stuttgart_LBK719871242149632735342981993317.11.72

DNA extraction, Library preparation and next-generation sequencing

DNA was extracted and libraries were prepared for ZVEJ25, ZVEJ31, KK1 and NE5 (Table 1), following protocols described in the original publications, with the exception that DNA extracts were incubated with USER enzyme (5 µl enzyme: 16.50 µl of extract) for 3 hours at 37 °C prior to library preparation in order to repair post-mortem molecular damage. The libraries were sequenced across 31 lanes of a HiSeq. 2,500.
Table 1

Data statistics for newly sequenced samples.

Sample IDMass sampled (g)Average autosomal coverage
Kotias (KK1)0.10112.03
Latvia_HG2 (ZVEJ25)0.09218.17
NE5 (14.6)0.1815.99
ZVEJ310.1029.97

Average autosomal coverage was estimated on bam files after mapping quality filtering (mq20), duplicates removal, indel realignment and 2 bp softclipping.

Data statistics for newly sequenced samples. Average autosomal coverage was estimated on bam files after mapping quality filtering (mq20), duplicates removal, indel realignment and 2 bp softclipping.

Bioinformatics analysis

Ancient samples

The following approach was used for both the newly sequenced ancient samples and downloaded raw fastq files from previously published ancient samples. Adapters were trimmed with cutadapt v1.9.1[13] and then raw reads were aligned to human reference sequence hg19/GRCh37 with the rCRS mitochondrial sequence using bwa aln v0.7.12[14] with seeding disabled (-l 1000), maximum edit distance set to -n 0.01 and maximum number of gap opens set to -o 2. These parameters are recommended for aDNA as they allow for more mismatches to the reference genome[15]. Sai files were converted into sam files using bwa samse v0.7.12 and the read group line was also added. Bam files were generated using samtools view v1.9[16]. Reads from multiple libraries belonging to the same sample were merged with the module MergeSamFiles within Picard v2.9.2[17]. Aligned reads were filtered for minimum mapping quality 20 with samtools view v1.9. Indexing, sorting and duplicate removal (rmdup) were performed with samtools v1.9. Indels were realigned using The Genome Analysis Toolkit v3.7[18] (module RealignerTargetCreator and IndelRealigner) and 2 bp were softclipped (phred quality score reduced to 2) at the start and ends of reads using a custom python script. Final bam files were split by chromosome using samtools view v1.9 and variant calling was performed with UnifiedGenotyper from The Genome Analysis Toolkit v3.7. All calls were filtered for minimum base quality 20 (−mbq 20) and reference-bias free priors were used (−inputPrior 0.0010 -inputPrior 0.4995). The same priors have been used for modern samples in the Simons Genome Diversity Panel[7]. Raw data was not available for four previously published samples included in this dataset and so alignment data was processed instead (Loschbour, Stuttgart_LBK, Ust_Ishim and WC1). The data for Loschbour, Stuttgart_LBK and Ust_Ishim had been aligned to GRCh37 with additional decoy sequences (hs37d5) using the same non-default bwa aln parameters. We removed reads aligning to these decoys and updated the bam file headers accordingly, before proceeding with the processing pipeline outlined above. The available alignment data from WC1 was mapped using bwa aln with default parameters and had a mapping quality filter of 25 already applied. We realigned these reads using the non-default parameters and proceeded with the processing pipeline. For those who wish to follow this pipeline with newly produced ancient DNA data, we recommend a final data authentication step. Characteristic patterns of aDNA post-mortem damage (e.g. short read lengths and cystosine deamination) can be verified using mapDamage software[19]. A number of methods exist to estimate contamination levels on the basis of these damage patterns, as well as other measures, including heterozygosity at haploid loci and the breakdown of linkage disequilibrium[20-23] We focused on selecting a subset of the genome representing neutral genomic variation for demographic inferences[24,25]. Therefore, specific filters were applied to discard: recombination hotspots (filter_hotspot1000g), poor mapping quality regions (filter_Map20), recent duplication (recent duplications, RepeatMasker score <20), recent segmental duplication (filter_segDups), simple repeats (filter_simpleRepeat), gene exons together with 1000 bp flanking and conserved elements together 100 bp flanking (filter_selection_10000_100) and positions with systematic sequencing errors (filter_SysErrHCB and filter_SysErr.starch). All CpG sites were removed as well as C and G sites with an adjacent missing genotype. Genotypes were filtered by minimum coverage 8x and maximum coverage defined as twice the average coverage. Vcf files per chromosome belonging to the same sample were concatenated using vcf-concat from vcftools v0.1.152 [26].

Modern samples

Bam files were downloaded from the Simons Genome Diversity Panel[7] and from McColl et al.[27]. (Table 2). Bam files were split by chromosome and variant calling, filtering for GC sites and coverage were performed as described above for the ancient samples with the same options and thresholds. Metadata for modern samples. SGDP: Simons Genome Diversity Panel.

Final dataset

Per sample vcf files were compressed with bgzip and indexed with tabix from htslib v1.6[16]. The final dataset was assembled by merging filtered compressed vcf files for all modern and ancient samples with bcftools merge v1.6[16]. Only sites with called genotypes for all samples were kept using vcftools v0.1.15 (--max-missing 1). Tri-allelic sites were also discarded using bcftools view v1.6 (-m1 -M2). Final vcf statistics were generated with bcftools stats v1.6. Downstream analysis and plotting were performed in R v3.6.3[28].

Data Records

All newly generated sequencing raw reads have been deposited in the NCBI Sequence Read Archive Bioproject PRJNA670050[29]. Both filtered and unfiltered vcf files have been uploaded to figshare[30].

Technical Validation

Summary of newly generated data

DNA was extracted for four previously published samples (ZVEJ25, ZVEJ31, KK1 and NE5) and sequence data were generated with an average coverage between 10x and 18x (Table 1). Endogenous DNA was estimated between 0.48 and 0.71 across all libraries (Table 3). Each library generated between 150 and 425 millions of reads corresponding to 15.2 and 42.9 Gb respectively (Table 3).
Table 3

Raw data statistics for the newly sequenced libraries.

SampleTotal BasesRead CountGC (%)Q20 (%)Q30 (%)Reads AlignedEndogenous DNA
KK1_132,085,537,489317,678,58949.396.694.5226,739,8420.71
KK1_231,821,488,543315,064,24349.796.994.8221,241,4350.70
KK1_330,903,010,501305,970,40147.896.694.4218,378,5290.71
KK1_428,374,056,452280,931,25248.596.694.5200,616,5890.71
KK1_527,051,061,997267,832,29747.496.894.8187,070,4430.70
KK1_626,428,490,321261,668,22149.796.794.5182,602,7570.70
NE5_115,230,188,243150,793,94348.496.794.6113,866,8660.76
NE5_222,443,822,868222,216,06847.896.794.6167,444,3170.75
NE5_319,414,144,957192,219,25747.796.794.6145,145,7850.76
NE5_435,602,627,361352,501,26148.996.894.7257,297,4240.73
NE5_539,509,022,440391,178,44049.596.794.5285,303,0060.73
NE5_638,119,633,918377,422,11847.796.894.7275,284,9260.73
ZVEJ25_122,502,142,793222,793,49348.296.894.6173,630,4410.78
ZVEJ25_226,264,479,451260,044,35147.596.894.6202,756,8100.78
ZVEJ25_319,884,007,259196,871,35948.196.894.6153,807,3480.78
ZVEJ25_430,314,118,184300,139,78447.096.994.8234,102,0910.78
ZVEJ25_534,172,785,511338,344,41148.296.994.7264,070,0110.78
ZVEJ25_632,515,172,804321,932,40448.296.994.7251,187,4530.78
ZVEJ31_142,951,382,412425,261,21252.096.994.7215,656,4790.51
ZVEJ31_241,717,115,447413,040,74750.796.994.8209,910,9860.51
ZVEJ31_336,806,312,233364,418,93353.896.794.4185,131,9890.51
ZVEJ31_434,986,764,509346,403,60951.396.994.6166,115,7370.48
ZVEJ31_534,797,229,121344,527,02153.896.894.5164,914,1580.48
ZVEJ31_639,275,860,102388,869,90252.096.894.6185,999,3140.48
Raw data statistics for the newly sequenced libraries.

Summary of the whole dataset including ancient and modern samples

The final dataset includes 35 samples with 509,351,727 sites in neutral regions before filtering (see Methods section for a detailed description of which regions were considered for variant calling). Sites not called across all samples (0% missing data allowed) were then discarded and 72,045,170 were retained. Multi-allelic sites (3815) were also removed bringing the final number of filtered sites to 72,041,355 (Online-only Table 2). Minimum and maximum coverage per sample within the final dataset is 11.3x and 55x respectively (within filtered intervals) with an average coverage across all samples of 29.7x (Online-only Table 2). We calculated the number of transitions (ts), transversions (tv) and the ts/tv ratio per sample (Online-only Table 2). As expected, all eight ancient samples that were not subjected to UDG-treatment showed a higher ts/tv ratio than their UDG-treated counterparts (see Fig. 2), consistent with higher levels of DNA damage in these samples. The Brazilian sample Sumidouro 5 shows the highest excess of transition, possibly due to poor DNA preservation caused by environmental conditions. All other samples (both modern and UDG-treated ancient) showed similar ts/tv ratio with an average of 1.72, maximum and minimum of 1.76 and 1.63 respectively (see Online-only Table 2, Fig. 2).
Fig. 2

(a) Transitions/Transversions ratio (ts/tv) per sample. Ancient and modern samples are represented by triangles and circles respectively. UDG and non-UDG treated samples are in blue and orange respectively. (b) same as in a) but with a different y axis to focus on the ts/tv ratio among modern and UDG-treated ancient samples. (c) Number of transitions (ts) and transversions (tv) per sample.

(a) Transitions/Transversions ratio (ts/tv) per sample. Ancient and modern samples are represented by triangles and circles respectively. UDG and non-UDG treated samples are in blue and orange respectively. (b) same as in a) but with a different y axis to focus on the ts/tv ratio among modern and UDG-treated ancient samples. (c) Number of transitions (ts) and transversions (tv) per sample.
Measurement(s)genome
Technology Type(s)DNA sequencing
Factor Type(s)modern/ancient human
Sample Characteristic - OrganismHomo sapiens
  23 in total

1.  The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data.

Authors:  Aaron McKenna; Matthew Hanna; Eric Banks; Andrey Sivachenko; Kristian Cibulskis; Andrew Kernytsky; Kiran Garimella; David Altshuler; Stacey Gabriel; Mark Daly; Mark A DePristo
Journal:  Genome Res       Date:  2010-07-19       Impact factor: 9.043

2.  The prehistoric peopling of Southeast Asia.

Authors:  Hugh McColl; Fernando Racimo; Lasse Vinner; Fabrice Demeter; Takashi Gakuhari; J Víctor Moreno-Mayar; George van Driem; Uffe Gram Wilken; Andaine Seguin-Orlando; Constanza de la Fuente Castro; Sally Wasef; Rasmi Shoocongdej; Viengkeo Souksavatdy; Thongsa Sayavongkhamdy; Mohd Mokhtar Saidin; Morten E Allentoft; Takehiro Sato; Anna-Sapfo Malaspinas; Farhang A Aghakhanian; Thorfinn Korneliussen; Ana Prohaska; Ashot Margaryan; Peter de Barros Damgaard; Supannee Kaewsutthi; Patcharee Lertrit; Thi Mai Huong Nguyen; Hsiao-Chun Hung; Thi Minh Tran; Huu Nghia Truong; Giang Hai Nguyen; Shaiful Shahidan; Ketut Wiradnyana; Hiromi Matsumae; Nobuo Shigehara; Minoru Yoneda; Hajime Ishida; Tadayuki Masuyama; Yasuhiro Yamada; Atsushi Tajima; Hiroki Shibata; Atsushi Toyoda; Tsunehiko Hanihara; Shigeki Nakagome; Thibaut Deviese; Anne-Marie Bacon; Philippe Duringer; Jean-Luc Ponche; Laura Shackelford; Elise Patole-Edoumba; Anh Tuan Nguyen; Bérénice Bellina-Pryce; Jean-Christophe Galipaud; Rebecca Kinaston; Hallie Buckley; Christophe Pottier; Simon Rasmussen; Tom Higham; Robert A Foley; Marta Mirazón Lahr; Ludovic Orlando; Martin Sikora; Maude E Phipps; Hiroki Oota; Charles Higham; David M Lambert; Eske Willerslev
Journal:  Science       Date:  2018-07-06       Impact factor: 47.728

3.  A likelihood method for estimating present-day human contamination in ancient male samples using low-depth X-chromosome data.

Authors:  J Víctor Moreno-Mayar; Thorfinn Sand Korneliussen; Jyoti Dalal; Gabriel Renaud; Anders Albrechtsen; Rasmus Nielsen; Anna-Sapfo Malaspinas
Journal:  Bioinformatics       Date:  2020-02-01       Impact factor: 6.937

Review 4.  Beyond broad strokes: sociocultural insights from the study of ancient genomes.

Authors:  Fernando Racimo; Martin Sikora; Marc Vander Linden; Hannes Schroeder; Carles Lalueza-Fox
Journal:  Nat Rev Genet       Date:  2020-03-03       Impact factor: 53.242

5.  mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters.

Authors:  Hákon Jónsson; Aurélien Ginolhac; Mikkel Schubert; Philip L F Johnson; Ludovic Orlando
Journal:  Bioinformatics       Date:  2013-04-23       Impact factor: 6.937

6.  Bayesian inference of ancient human demography from individual genome sequences.

Authors:  Ilan Gronau; Melissa J Hubisz; Brad Gulko; Charles G Danko; Adam Siepel
Journal:  Nat Genet       Date:  2011-09-18       Impact factor: 38.330

7.  The variant call format and VCFtools.

Authors:  Petr Danecek; Adam Auton; Goncalo Abecasis; Cornelis A Albers; Eric Banks; Mark A DePristo; Robert E Handsaker; Gerton Lunter; Gabor T Marth; Stephen T Sherry; Gilean McVean; Richard Durbin
Journal:  Bioinformatics       Date:  2011-06-07       Impact factor: 6.937

8.  The Neolithic Transition in the Baltic Was Not Driven by Admixture with Early European Farmers.

Authors:  Eppie R Jones; Gunita Zarina; Vyacheslav Moiseyev; Emma Lightfoot; Philip R Nigst; Andrea Manica; Ron Pinhasi; Daniel G Bradley
Journal:  Curr Biol       Date:  2017-02-02       Impact factor: 10.834

9.  Fast and accurate short read alignment with Burrows-Wheeler transform.

Authors:  Heng Li; Richard Durbin
Journal:  Bioinformatics       Date:  2009-05-18       Impact factor: 6.937

10.  EAGER: efficient ancient genome reconstruction.

Authors:  Alexander Peltzer; Günter Jäger; Alexander Herbig; Alexander Seitz; Christian Kniep; Johannes Krause; Kay Nieselt
Journal:  Genome Biol       Date:  2016-03-31       Impact factor: 13.583

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.