Enrico Cappellini1,2, Frido Welker3,4, Luca Pandolfi5, Jazmín Ramos-Madrigal3, Diana Samodova6, Patrick L Rüther6, Anna K Fotakis3, David Lyon6, J Víctor Moreno-Mayar7, Maia Bukhsianidze8, Rosa Rakownikow Jersie-Christensen6, Meaghan Mackie3,6, Aurélien Ginolhac9, Reid Ferring10, Martha Tappen11, Eleftheria Palkopoulou12, Marc R Dickinson13, Thomas W Stafford14, Yvonne L Chan15, Anders Götherström16, Senthilvel K S S Nathan17, Peter D Heintzman18,19, Joshua D Kapp18, Irina Kirillova20, Yoshan Moodley21, Jordi Agusti22,23, Ralf-Dietrich Kahlke24, Gocha Kiladze25, Bienvenido Martínez-Navarro22,23,26, Shanlin Liu3,27, Marcela Sandoval Velasco3, Mikkel-Holger S Sinding3,28, Christian D Kelstrup6, Morten E Allentoft7, Ludovic Orlando7,29, Kirsty Penkman13, Beth Shapiro18,30, Lorenzo Rook5, Love Dalén15, M Thomas P Gilbert3,31, Jesper V Olsen32, David Lordkipanidze8, Eske Willerslev33,34,35,36. 1. Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark. ecappellini@bio.ku.dk. 2. Evolutionary Genomics Section, Globe Institute, University of Copenhagen, Copenhagen, Denmark. ecappellini@bio.ku.dk. 3. Evolutionary Genomics Section, Globe Institute, University of Copenhagen, Copenhagen, Denmark. 4. Department of Human Evolution, Max Planck Institute for Evolutionary Anthropology, Leipzig, Germany. 5. Dipartimento di Scienze della Terra, Università degli Studi di Firenze, Florence, Italy. 6. Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark. 7. Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark. 8. Georgian National Museum, Tbilisi, Georgia. 9. Life Sciences Research Unit, University of Luxembourg, Belvaux, Luxembourg. 10. Department of Geography and Environment, University of North Texas, Denton, TX, USA. 11. Department of Anthropology, University of Minnesota, Minneapolis, MN, USA. 12. Department of Genetics, Harvard Medical School, Cambridge, MA, USA. 13. Department of Chemistry, University of York, York, UK. 14. Stafford Research, Lafayette, CO, USA. 15. Department of Bioinformatics and Genetics, Swedish Museum of Natural History, Stockholm, Sweden. 16. Department of Archaeology and Classical Studies, Stockholm University, Stockholm, Sweden. 17. Sabah Wildlife Department, Kota Kinabalu, Malaysia. 18. Department of Ecology and Evolutionary Biology, University of California Santa Cruz, Santa Cruz, CA, USA. 19. Tromsø University Museum, The Arctic University of Norway (UiT), Tromsø, Norway. 20. Ice Age Museum, National Alliance of Shidlovskiy 'Ice Age', Moscow, Russia. 21. Department of Zoology, University of Venda, Thohoyandou, South Africa. 22. Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain. 23. Institut Català de Paleoecologia Humana i Evolució Social, Universitat Rovira i Virgili, Tarragona, Spain. 24. Senckenberg Research Station of Quaternary Palaeontology, Weimar, Germany. 25. Geology Department, Tbilisi State University, Tbilisi, Georgia. 26. Departament d'Història i Geografia, Universitat Rovira i Virgili, Tarragona, Spain. 27. BGI Shenzhen, Shenzen, China. 28. Greenland Institute of Natural Resources, Nuuk, Greenland. 29. Laboratoire d'Anthropobiologie Moléculaire et d'Imagerie de Synthèse, Université Paul Sabatier, Toulouse, France. 30. Howard Hughes Medical Institute, University of California Santa Cruz, Santa Cruz, CA, USA. 31. University Museum, Norwegian University of Science and Technology, Trondheim, Norway. 32. Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen, Denmark. jesper.olsen@cpr.ku.dk. 33. Lundbeck Foundation GeoGenetics Centre, Globe Institute, University of Copenhagen, Copenhagen, Denmark. ewillerslev@bio.ku.dk. 34. Department of Zoology, University of Cambridge, Cambridge, UK. ewillerslev@bio.ku.dk. 35. Wellcome Trust Sanger Institute, Hinxton, UK. ewillerslev@bio.ku.dk. 36. Danish Institute for Advanced Study, University of Southern Denmark, Odense, Denmark. ewillerslev@bio.ku.dk.
Abstract
The sequencing of ancient DNA has enabled the reconstruction of speciation, migration and admixture events for extinct taxa1. However, the irreversible post-mortem degradation2 of ancient DNA has so far limited its recovery-outside permafrost areas-to specimens that are not older than approximately 0.5 million years (Myr)3. By contrast, tandem mass spectrometry has enabled the sequencing of approximately 1.5-Myr-old collagen type I4, and suggested the presence of protein residues in fossils of the Cretaceous period5-although with limited phylogenetic use6. In the absence of molecular evidence, the speciation of several extinct species of the Early and Middle Pleistocene epoch remains contentious. Here we address the phylogenetic relationships of the Eurasian Rhinocerotidae of the Pleistocene epoch7-9, using the proteome of dental enamel from a Stephanorhinus tooth that is approximately 1.77-Myr old, recovered from the archaeological site of Dmanisi (South Caucasus, Georgia)10. Molecular phylogenetic analyses place this Stephanorhinus as a sister group to the clade formed by the woolly rhinoceros (Coelodonta antiquitatis) and Merck's rhinoceros (Stephanorhinus kirchbergensis). We show that Coelodonta evolved from an early Stephanorhinus lineage, and that this latter genus includes at least two distinct evolutionary lines. The genus Stephanorhinus is therefore currently paraphyletic, and its systematic revision is needed. We demonstrate that sequencing the proteome of Early Pleistocene dental enamel overcomes the limitations of phylogenetic inference based on ancient collagen or DNA. Our approach also provides additional information about the sex and taxonomic assignment of other specimens from Dmanisi. Our findings reveal that proteomic investigation of ancient dental enamel-which is the hardest tissue in vertebrates11, and is highly abundant in the fossil record-can push the reconstruction of molecular evolution further back into the Early Pleistocene epoch, beyond the currently known limits of ancient DNA preservation.
The sequencing of ancient DNA has enabled the reconstruction of speciation, migration and admixture events for extinct taxa1. However, the irreversible post-mortem degradation2 of ancient DNA has so far limited its recovery-outside permafrost areas-to specimens that are not older than approximately 0.5 million years (Myr)3. By contrast, tandem mass spectrometry has enabled the sequencing of approximately 1.5-Myr-old collagen type I4, and suggested the presence of protein residues in fossils of the Cretaceous period5-although with limited phylogenetic use6. In the absence of molecular evidence, the speciation of several extinct species of the Early and Middle Pleistocene epoch remains contentious. Here we address the phylogenetic relationships of the Eurasian Rhinocerotidae of the Pleistocene epoch7-9, using the proteome of dental enamel from a Stephanorhinus tooth that is approximately 1.77-Myr old, recovered from the archaeological site of Dmanisi (South Caucasus, Georgia)10. Molecular phylogenetic analyses place this Stephanorhinus as a sister group to the clade formed by the woolly rhinoceros (Coelodonta antiquitatis) and Merck's rhinoceros (Stephanorhinus kirchbergensis). We show that Coelodonta evolved from an early Stephanorhinus lineage, and that this latter genus includes at least two distinct evolutionary lines. The genus Stephanorhinus is therefore currently paraphyletic, and its systematic revision is needed. We demonstrate that sequencing the proteome of Early Pleistocene dental enamel overcomes the limitations of phylogenetic inference based on ancient collagen or DNA. Our approach also provides additional information about the sex and taxonomic assignment of other specimens from Dmanisi. Our findings reveal that proteomic investigation of ancient dental enamel-which is the hardest tissue in vertebrates11, and is highly abundant in the fossil record-can push the reconstruction of molecular evolution further back into the Early Pleistocene epoch, beyond the currently known limits of ancient DNA preservation.
Phylogenetic placement of extinct species increasingly relies on aDNA sequencing. Efforts to improve the molecular tools underlying aDNA recovery have enabled the reconstruction of ~0.4 Ma and ~0.7 Ma old DNA sequences from temperate deposits[3] and subpolar regions[12], respectively. However, no aDNA data have so far been generated from species that became extinct beyond this time range. In contrast, ancient proteins represent a more durable source of genetic information, reported to survive, in eggshell, up to 3.8 Ma[13]. Ancient protein sequences can carry taxonomic and phylogenetic information useful to trace the evolutionary relationships between extant and extinct species[14,15]. However, so far, the recovery of ancient mammal proteins from sites too old or too warm to be compatible with aDNA preservation is mostly limited to collagen type I (COL1). Being highly conserved[16], this protein is not an ideal phylogenetic marker. For example, regardless of endogeneity[17], collagen-based phylogenetic placement of Dinosauria in relation to extant Aves appears to be unstable[6]. This suggests that the exclusive use of COL1 in deep-time molecular phylogenetics is constraining. Here, we sought to overcoming these limitations by testing whether dental enamel can better preserve a richer set of ancient proteins that are preserved longer than COL1.Dated to ~1.77 Ma by a combination of 40Ar/39Ar dating, paleomagnetism and biozonation[18,19], the archaeological site of Dmanisi (Georgia, South Caucasus; Fig. 1a) represents a context currently considered outside the scope of aDNA recovery. This site has been excavated since 1983, resulting in the discovery, along with stone tools and contemporaneous fauna (Table S1), of almost one hundred hominin fossils, including five skulls representing the georgicus paleodeme within Homo erectus
[10]. These are the earliest fossils of the genus Homo outside Africa.
Figure 1
Dmanisi location, stratigraphy, and Stephanorhinus specimen GNM Dm.5/157-16635.
a, Geographic location of Dmanisi in the South Caucasus. The base map was generated using public domain data from www.naturalearthdata.com. b, Generalised stratigraphic profile indicating origin and age of the analysed specimens. c, Isolated left lower molar (m1 or m2) of Stephanorhinus ex gr. etruscus-hundsheimensis, from Dmanisi (labial view). Scale bar: 1 cm.
The geology of the Dmanisi deposits favours the preservation of faunal materials (Supplementary Information: Extended Methods and Results), as the primary aeolian deposits provide rapid burial in fine-grained, calcareous sediments. We studied 12 bone and 14 enamel+dentine samples from 23 specimens of large mammals from multiple excavation units within stratum B1 (Fig. 1b, Extended Data Fig. 1, Extended Data Table 1, Table S3). This is an ashfall deposit that contains faunal remains in different geomorphic contexts. All of these are firmly dated between 1.85-1.76 Ma[19]. High-resolution tandem MS was used to confidently sequence ancient protein residues from the set of faunal remains, after digestion-based (protocols A and B), or digestion-free (protocol C), sample preparation (Methods and Supplementary Information). Ancient DNA analysis was unsuccessfully attempted on a subset of five bone and dentine specimens (Methods).
Extended Data Figure 1
Generalized stratigraphic profiles for Dmanisi, indicating specimen origins.
a, Type section of the Dmanisi M5 Excavation block. b, Stratigraphic profile of excavation area M6. M6 preserves a larger gully associated with the pipe-gully phase of stratigraphic-geomorphic development in Stratum B1. The thickness of Stratum B1 gully fill extends to the basalt surface, but includes “rip-ups” of Strata A1 and A2, showing that B1 deposits post-date Stratum A. c, Stratigraphic section of excavation area M17. Here, Stratum B1 was deposited after erosion of Stratum A deposits. The stratigraphic position of the Stephanorhinus sample Dm.5/157-16635 is highlighted with a red diamond. The Masavara basalt is ca. 50 cm below the base of the shown profile. d, Northern section of Block 2. Following collapse of a pipe and erosion to the basalt, the deeper part of this area was filled with local gully fill of Stratum B1/x/y/z. Note the uniform burial of all Stratum B1 deposits by Strata B2-B4. Sampled specimens are indicated by CGG five-digit numbers. See Extended Data Table 1 for both CGG and GNM specimen numbers.
Extended Data Table 1
Genome and proteome survival in 23 Dmanisi fossil fauna specimens.
For each specimen, the Centre for GeoGenetics (CGG) reference number and the Georgian National Museum (GNM) specimen field number are reported. *or the narrowest possible taxonomic identification achievable using comparative anatomy methods. †Only collagens survive. B = Bone, D = Dentine, E = Enamel. Extractions of enamel might include some residual dentine. Accordingly, both tissues are either listed separately (○D, ●E, in case of no collagen preservation), or together (●E+D, in case of collagen preservation). Open circles (○) indicate no molecular preservation; (●) closed circles indicate molecular preservation.
CGG ref. numb.
GNM specimen number
Morphological identification*
Anatomy
Ancient DNA
Protein extr. Method A
Protein extr. Method B
Protein extr. Method C
Phylogenetic analysis
16486
Dm.bXl.sqA6.V._.
Canis etruscus
P4 sin.
○E+D
16626
Dm.6/154.2/4.A4.17
Artiodactyla
tibia sin.
○B
16628
Dm.7/154.2.A2.27
Cervidae
me Ill&IV dex.
●B†
16629
Dm.5/154.3.A4.32
Cervidae
hem imandible sin. with dp2, dp3, dp4, m1
○B
●E+D
16630
Dm.6/151.4.A4.12
Pseudodama nestii
hemimandible dex. with p2-m3
○B
○D, ●E
16631
Dm.69/64.3.81.53
Cervidae
maxilla sin. with P3
○B
○D, ●E
16632
Dm.5/154.2.A4.38
Equus stenonis
i3 dex.
●E+D
Fig. S10
16633
Dm.5/153.3.A2.33
Equus stenonis
mc Ill & mc II sin.
○B
16634
Dm.7/151.2.81/A4.1
Equus stenonis
m/1 or m/2 dex.
○D, ●E
16635
Dm.5/157.profile cleaning
Stephanorhinus sp.
m/1 sin.
○
○D, ●E
Fig. 4, Fig. S11
16636
Dm.6/153.1.A4.13
Rhinocerotidae
tibia dex.
○B
16637
Dm.7/154.2.A4.8
Bovidae
mt lll&IV sin.
●B†
16638
Dm.5/154.1.B1.1
Bovidae
hemimandible dex. with p3-m3
○B
○D, ●E
Fig. S12
16639
Dm.8/154.4.A4.22
Bovidae
maxilla dex. with P2-M2
○D, ●E
Fig. S13
16640
Dm.6/151.2.A4.97
Bison georgicus
mt lll&IV sin.
○B
16641
Dm.8/152.3.B1.2
Bison georgicus
m3 dex.
○D, ●E
Fig. S14
16642
Dm.8/153.4.A4.5
Canis etruscus
hemimandible sin. with p1-m2
○D, ●E
16856
Dm.M6/7.Il.296
Cervidae
m2 sin.
○
●D†
○D, ●E
●E+D
16857
Dm.bXl.profile cleaning
lndet.
long bone fragment of a herbivore
○
●B†
○B
○B
Fig. S15, EDF6
16858
Dm.bXl.North.B1a.collection
Cervidae
metapodium fragment
○B
○B
○B
16859
D4.collection
lndet.
fragments of pelvis and ribs of a large mammal
○
○B
○B
○B
16860
Dm.65/62.1.A1.
Cervidae
P4 sin.
○
○D, ●E
○D, ●E
16861
Dm.64/63.1.B1z.collection
Equus stenonis
fragment of an upper tooth
○D, ●E
○D, ●E
Neg. contr. (blank)
NC
NC
NC
We recovered endogenous proteins from 15 out of 23 studied specimens. Digestion-based peptide extraction from bone, dentine and enamel specimens led to the sporadic recovery (6/19) of a limited number of collagen fragments. In contrast, digestion-free peptide extraction of enamel+dentine and bone specimens resulted in high rates of enamel proteome recovery (13/14 specimens, Extended Data Table 1).The small proteome[20,21] of mature dental enamel consists of structural enamel proteins, i.e. amelogenin (AMELX), enamelin (ENAM), amelotin (AMTN), and ameloblastin (AMBN), and enamel-specific proteases secreted during amelogenesis, i.e. matrix metalloproteinase-20 (MMP20) and kallikrein 4 (KLK4). The presence of non-specific proteins, such as serum albumin (ALB), has also been previously reported in mature dental enamel[20] (Extended Data Table 2). The depth of coverage for these proteins varied considerably across their sequence, with some positions covered by over 1000 peptide spectrum matches (Extended Data Fig. 2). The high depth of coverage also allows to identify multiple isoforms of AMELX (Extended Data Fig. 3).
Extended Data Table 2
Proteome composition and coverage.
Aggregated data from different extraction methods and/or tissues from the same specimen. In those cells reporting two values separated by the “|” symbol, the first value refers to MaxQuant (MQ) searches performed selecting unspecific digestion, while the second value refers to MQ searches performed selecting trypsin digestion. For those cells including one value only, it refers to MQ searches performed selecting unspecific digestion. Final amino acid coverage, incorporating both MQ and PEAKS searches, is reported in the last column. *supporting all peptides. See Extended Data Table 1 for tissue sources per specimen and both CGG and GNM specimen numbers.
Specimen
Protein Name
Sequence length
Razor and unique peptides
Matched spectra*
Coverage after MaxQuant searches (%)
Final coverage after MaxQuant and PEAKS searches (%)
Final coverage (aa)
16628
Collagen alpha-1(I)
1158
5
8
3.2
3.2
37
16629
Amelogenin X
209
79
190
36.8
36.8
77
Ameloblastin
440
51
84
25.0
25.0
110
Enamelin
1129
58
133
6.2
6.5
73
Collagen alpha-1(I)
1453
3
3
2.0
2.0
29
Collagen alpha-1(III)
1464
2
3
1.4
1.4
20
Amelotin
212
2
2
4.7
4.7
10
16630
Enamelin
1129
180 | 3
530 | 5
11.8 | 2.7
15.4
174
Ameloblastin
440
105
231
30.9
31.4
138
Amelogenin X
213
116
529
62.0
62.9
134
AmelogeninY
192
4
9
13.0
22.9
44
Amelotin
212
5
6
8.0
8.0
17
16631
Enamelin
916
175
751
11.0
11.7
107
Amelogenin X
213
156
598
48.8
61.5
131
AmelogeninY
90
5
18
15.6
25.6
23
Ameloblastin
440
71
133
24.1
25.2
111
MMP 20
482
2
2
3.9
3.9
19
16632
Enarnelin
1144
401
2160
17.9
19.1
219
Amelogenin X
192
280
960
84.4
84.4
162
MMP 20
424
49
67
33.3
33.3
141
Serum albumin
607
11
18
6.1
6.1
37
Collagen alpha-1(1)
1513
4
4
2.6
2.6
40
16634
Amelogenin X
185
68
157
53.5
53.5
99
Ameloblastin
440
47
58
23.4
23.4
103
Enamelin
920
33
87
4.5
4.5
41
MMP 20
483
4
4
5.6
5.6
27
16635
Amelogenin X
206
394 | 3
2793 | 5
73.8 | 7.8
85.9
177
Enamelin
1150
382 | 2
2966 | 2
18.3 | 1.6
25.1
289
Ameloblastin
442
131
463
31.3
39.3
166
Amelotin
267
26
148
9.9
9.9
20
Serum albumin
607
34
64
18.5
24.5
149
MMP20
483
15
25
11.8
15.3
74
16637
Collagen alpha-1(I)
1453
2
2
1.7
1.7
25
Collagen alpha-1(II)
1421
2
2
1.9
1.9
27
Collagen alpha-1(III)
1464
2
2
1.6
1.6
23
16638
Enamelin
1129
235 | 7
1155 | 13
11.8 | 4.7
12.9
146
Amelogenin X
192
185 | 3
734 | 5
52.0 | 10.9
60.4
116
Ameloblastin
440
64 | 2
120 | 4
30.0 | 5.7
36.4
160
MMP 20
481
6
7
8.1
9.1
44
16639
Enamelin
1129
202
726
12.0
12.6
142
Amelogenin X
213
167
624
59.2
67.6
144
Ameloblastin
440
88
155
26.8
30.5
134
AmelogeninY
192
13
13
18.8
18.8
36
16641
Amelogenin X
213
91
251
64.3
65.3
139
Ameloblastin
440
69
122
28.9
28.9
127
Enamelin
1129
24
75
7.8
7.8
88
Amelotin
212
3
3
7.1
7.1
15
16642
Amelogenin X
185
89
245
42.7
42.7
79
Enarnelin
733
14
19
2.5
2.5
18
Ameloblastin
421
3
3
7.1
7.1
30
MMP20
483
2
2
3.5
3.5
17
16856
Amelogenin X
209
66 | 4
365 | 25
38.8
45.5
95
Enamelin
916
58 | 13
153 | 70
8.2
10.2
93
Ameloblastin
440
21
31
14.8
14.8
65
Collagen alpha-1(I)
1047
8 | 10
9 | 11
14.5
16.9
177
Collagen alpha-2(1)
1054
4 | 8
51 9
10.6
10.6
112
Serum albumin
583
0 | 8
0 | 12
16.6
16.6
97
AmelogeninY
90
3
7
10.0
10.0
9
16857
Collagen alpha-1(I)
1047
18 | 14
24 | 18
21.7
23.4
245
Collagen alpha-2(1)
1274
16 | 11
17 | 11
17.7
24.3
310
16860
Amelogenin X
192
46
98
30.7
32.3
62
Ameloblastin
440
19
37
9.1
9.1
40
Enamelin
900
15
25
3.8
3.8
34
16861
Amelogenin X
185
14
15
36.8
38.9
72
Ameloblastin
343
2
2
4.4
4.4
15
Enarnelin
915
2
2
1.2
1.2
11
Neg.Contr. Gr. 1:
ND
235, 275, 706
Neg.Contr. Gr. 2:
ND
630, 875, 889
Neg. Contr. Gr. 3:
Amelogenin X
122
5
7
18.0
18.0
22
1214, 1218
Extended Data Fig. 2
Proteomic sequence coverage for specimen Dm.5/157-16635 (Stephanorhinus).
a, c, e, g, i, j, PSM sequence coverage of proteins AMBN, ENAM, AMELX, AMTN, MMP20 and ALB, respectively. Annotations include: “amino acid position, amino acid called in that position (number of PSMs/peptides covering that position)” for the phylogenetically informative SAPs within Rhinocerotidae. b, d, f, h, Frequency (%) of phosphorylated (green) and non-phosphorylated (red) PSMs per amino acid position for AMBN, ENAM, AMELX and AMTN, respectively. Numbers within the bars provide the PSM counts. k, Violinplot of PSM coverage distribution for all covered sites (n=693) and those of phylogenetic relevance (SAPs, n=30). The boxplots define the range of the data, with whiskers extending to 1.5 the interquartile range, 25th and 75th percentiles (boxes), and medians (dots). All panels based on MQ results only. Supplementary File “Key MS-MS Spectra” contains spectral examples and fragment ion series alignments for each of the marked SAPs.
Extended Data Figure 3
Peptide and ion fragment coverage of amelogenin X (AMELX) isoforms 1 and 2 from specimen Dm.M6/7.II.296-16856 (Cervidae).
Peptides specific to amelogenin X (AMELX) isoforms 1 and 2 appear in the upper and lower parts of the figure, respectively. No amelogenin X isoform 2 is currently reported in public databases for the Cervidae group. Accordingly, the amelogenin X isoform 2-specific peptides were identified by MaxQuant spectral matching against bovine (Bos Taurus) amelogenin X isoform 2 (UniProt accession number P02817-2). Amelogenin X isoform 2, also known as leucine-rich amelogenin peptide (LRAP), is a naturally occurring amelogenin X isoform from the translation product of an alternatively spliced transcript.
Multiple lines of evidence support the authenticity and the endogenous origin of the sequences recovered. Dental enamel proteins are extremely tissue-specific and confined to the dental enamel mineral matrix[20]. The amino acid composition of the intra-crystalline protein fraction, measured by amino acid racemisation analysis, indicates that the dental enamel behaves as a closed system, unaffected by amino acid and protein residues exchange with the burial environment (Extended Data Fig. 4). The measured rate of asparagine and glutamine deamidation, a spontaneous form of hydrolytic damage consistently observed in ancient samples[22], is particularly advanced. Deamidation in Dmanisi enamel is higher than in the control enamel sample, supporting the antiquity of the peptides recovered (Fig. 2a, Supplementary Information). Other forms of non-enzymatic modifications are also present. Tyrosine (Y) experienced mono- and di-oxidation while tryptophan (W) was extensively converted into multiple oxidation products (Fig. 2b, Supplementary Information). Oxidative degradation of histidine (H) and conversion of arginine (R) leading to ornithine accumulation were also observed (Supplementary Information). These modifications are absent, or much less frequent, in the control sample. Similarly, unlike in the control, the peptide length distribution in the Dmanisi dataset is dominated by shorter fragments, generated by advanced, diagenetically-induced, terminal hydrolysis[23] (Fig. 2c, d). Together all these independent lines of evidence clearly define the substantial biomolecular damage affecting the proteomes retrieved and independently support the authenticity of the sequences reconstructed. To demonstrate beyond reasonable doubt the correct peptide sequence assignments of our MS2 spectra, we performed manual validation of peptide-spectrum-matches, conducted fragment ion intensity predictions, and generated synthetic peptides, for a range of phylogenetically informative and phosphorylated peptides (Methods and Supplementary Information: Key MS2 Spectra).
Extended Data Figure 4
Amino Acid Racemisation.
Extent of intra-crystalline racemization in enamel for the free amino acid (FAA, x-axis) fraction and the total hydrolysable amino acids (THAA, y-axis) fraction for four amino acids (Asx, Glx, Ala and Phe). Note differences in axis scale. Intra-crystalline data from Proboscidea enamel from a range of UK sites[64] has been shown for comparison (black crosses). Both taxa from Dmanisi and the UK exhibit a similar relationship between FAA and THAA racemization and R2 values have been calculated based on a polynomial relationship (order = 2, all >0.93).
Figure 2
Enamel proteome degradation.
a, Deamidation of asparagine (N) and glutamine (Q). Violin plots based on 1000 bootstrap replicates. The boxplots define the range of the data, with whiskers extending to 1.5 the interquartile range, 25th and 75th percentiles (boxes), and medians (dots). Tissue source (B = Bone, D = Dentine, E = Enamel) and the number of peptides used for the calculation are shown at the bottom. b, Extent of tryptophan (W) oxidation leading to several diagenetic products, measured as relative spectral counts. c, Alignment of peptides (positions 124-137, Enamelin) retrieved by digestion-free acid demineralisation from Pleistocene Stephanorhinus ex gr. etruscus-hundsheimensis specimen (GNM Dm.5/157-16635). d, Barplot of peptide length distribution of specimen Dm.5/157-16635 and Medieval (CTRL) undigested ovicaprine dental enamel proteomes.
We confidently detect phosphorylation (Fig. 3, Extended Data Figs. 2, 5), a stable and tightly in vivo regulated physiological post-translational modification (PTM) previously detected in dental enamel proteins[24,25]. Most of the phosphorylated sites we identified belong to the S-x-E/phS motif, recognised by the secreted kinases of the Fam20C family, which are involved in phosphorylation of extracellular proteins and regulation of biomineralization[26]. Spectra supporting the identification of serine phosphorylation were validated manually and by comparison with MS2 obtained from synthetic peptides (Supplementary Information), confirming the automated MaxQuant identifications. Phosphorylated serine and threonine residues may be subjected to spontaneous dephosphorylation. However, by complexing with the Ca2+ ions in the enamel hydroxyapatite matrix, the peptide-bound phosphate groups can remain stable over millennia, as recently observed in ancient bone[27]. Previous studies demonstrated that, when complexed with mineral matrix, ~3.8 Ma protein residues can be retrieved from sub-tropical environments[13]. Limited availability of free water in the enamel matrix further reduces spontaneous dephosphorylation via beta-elimination. Altogether, these observations demonstrate that the heavily modified dental enamel proteome retrieved from the ~1.77 Ma old Dmanisi faunal material is endogenous and almost complete.
Figure 3
Sequence motif analysis of ancient enamel proteome phosphorylation.
Indicated is the overrepresentation of specific amino acids within six positions N- and C-terminal of the phosphorylated amino acids (position 0). See Extended Data Figure 5 for MS2 examples of both S-x-E and S-x-phS phosphorylated motifs.
Extended Data Figure 5
Ancient enamel proteome phosphorylation.
Annotated spectra including phosphorylated serine (phS). a, Phosphorylation in the S-x-E motif (AMEL). b, Phosphorylation in the S-x-phS motif (AMBN). Phosphorylation was independently observed in all three separate analyses of Dm.5/157-16635, including multiple spectra and peptides (see Extended Data Fig. 2).
Next, we used the palaeoproteomic sequence information to improve taxonomic assignment and achieve sex attribution for some of the Dmanisi faunal remains. Phylogenetic analysis of the five largest enamel+dentine proteomes, and of a moderately large bone proteome, allowed to confirm or improve the morphological identification of their specimens of origin (Extended Data Fig. 6; Figs. S10-15). In addition, confident identification of peptides specific for the isoform Y of amelogenin, coded on the non-recombinant portion of the Y chromosome, indicates that four tooth specimens, namely Dm.6/151.4.A4.12-16630 (Pseudodama), Dm.69/64.3.B1.53-16631 (Cervidae), Dm.8/154.4.A4.22-16639 (Bovidae), and Dm.M6/7.II.296-16856 (Cervidae), belonged to male individuals[21] (Extended Data Fig. 7a-d).
Extended Data Figure 6
Phylogenetic relationships between the comparative reference dataset and specimen Dm.bXI-16857.
Consensus tree from Bayesian inference. The posterior probability of each bipartition is shown as a percentage to the left of each node.
Extended Data Figure 7
Amelogenin Y-specific matches.
a) Specimen Dm.6/151.4.A4.12-16630 (Pseudodama). b) Specimen Dm.69/64.3.B1.53-16631 (Cervidae). c) Specimen Dm.8/154.4.A4.22-16639 (Bovidae). d) Specimen Dm.M6/7.II.296-16856 (Cervidae). Note the presence of deamidated glutamine (deQ) and asparagine (deN), oxidated methionine (oxM), and phosphorylated serine (phS).
An enamel+dentine fragment, from the lower molar of a Stephanorhinus ex gr. etruscus-hundsheimensis (Dm.5/157-16635; Fig. 1c, Supplementary Information), returned the highest proteomic sequence coverage, encompassing a total of 875 amino acids, across 987 peptides (6 proteins; Extended Data Fig. 2; Supplementary Information). Following alignment of the enamel protein sequences retrieved from Dm.5/157-16635 against their homologues from all the extant rhinoceros species, plus the extinct woolly rhinoceros (†Coelodonta antiquitatis) and Merck’s rhinoceros (†Stephanorhinus kirchbergensis), phylogenetic reconstructions place the Dmanisi specimen closer to the extinct woolly and Merck’s rhinoceroses than to the extant Sumatran rhinoceros (Dicerorhinus sumatrensis), as an early divergent sister lineage (Fig. 4; Extended Data Fig. 8).
Figure 4
Phylogenetic relationships between the comparative enamel proteome dataset and specimen Dm.5/157-16635 (Stephanorhinus ex gr. etruscus-hundsheimensis).
Consensus tree from Bayesian inference on the concatenated alignment of six enamel proteins, using Homo sapiens as an outgroup. For each bipartition, we show the posterior probability obtained from the Bayesian inference. Additionally, for bipartitions where the Bayesian and the Maximum-likelihood inference support are different, we show (right) the support obtained in the latter. Scale indicates estimated branch lengths.
Extended Data Figure 8
Effect of the missingness in the tree topology.
a, Maximum-likelihood phylogeny obtained using PhyML and the protein alignment excluding the ancient Dmanisi rhinoceros Dm.5/157-16635. b, Topologies obtained from 100 random replicates of the Woolly rhinoceros (Coelodonta antiquitatis). In each replicate the amount of missing sites was similar to the one observed in the Dm.5/157-16635 specimen (72.4% missingness). The percentage shown for each topology indicates the number of replicates in which that particular topology was recovered. c, Similar to b, but for the Javan rhinoceros (Rhinoceros sondaicus). d, Similar to b, but for the black rhinoceros (Diceros bicornis).
Our phylogenetic reconstruction confidently recovers the expected differentiation of the Rhinoceros genus from other genera considered, in agreement with previous cladistic[28] and genetic analyses[29] (Supplementary Information). This topology defines two-horned rhinoceroses as monophyletic and the one-horned condition as plesiomorphic, as previously proposed (Supplementary Information). We caution, however, that the higher-level relationships we observe between the rhinoceros monophyletic clades might be affected by demographic events, such as incomplete lineage sorting[30] and/or gene flow between groups[31], due to the limited number of markers considered. A confident and stable reconstruction of the structure of the Rhinocerotidae family needs the strong support only high-resolution whole-genome sequencing can provide. Regardless, the highly supported placement of the Dmanisi rhinoceros in the (Stephanorhinus, Woolly, Sumatran) clade will remain unaffected, should deeper phylogenetic relationships between the Rhinoceros genus and other family members be revised (Extended Data Fig. 8).The phylogenetic relationships of the genus Stephanorhinus within the family Rhinocerotidae, as well as those of the several species recognized within this genus, are contentious. Stephanorhinus was initially included in the extant South-East Asian genus Dicerorhinus represented by the Sumatran rhinoceros species (D. sumatrensis)[32]. This hypothesis has been rejected and, based on morphological data, Stephanorhinus has been identified as a sister taxon of the woolly rhinoceros[33]. Furthermore, ancient DNA analysis supports a sister relationship between the woolly rhinoceros and D. sumatrensis
[7,34,35].As the Stephanorhinus ex gr. etruscus-hundsheimensis sequences from Dmanisi branch off basal to the common ancestor of the woolly and Merck’s rhinoceroses, these two species most likely derived from an early Stephanorhinus lineage expanding eastward from western Eurasia. Throughout the Plio-Pleistocene, Coelodonta adapted to continental and later to cold-climate habitats in central Asia. Its earliest representative, C. thibetana, displayed some clear Stephanorhinus-like anatomical features[33]. The presence in eastern Europe and Anatolia of the genus Stephanorhinus
[35] is documented at least since the late Miocene, and the Dmanisi specimen most likely represents an Early Pleistocene descendent of the Western-Eurasian branch of this genus.Ultimately, our phylogenetic reconstructions show that, as currently defined, the genus Stephanorhinus is paraphyletic, in line with previous morphological and palaeobiogeographical evidence (Supplementary Information). Accordingly, a systematic revision of the genera Stephanorhinus and Coelodonta, as well as their closest relatives, is needed.In this study, we show that enamel proteome sequencing can overcome the time limits of ancient DNA preservation and the reduced phylogenetic content of COL1 sequences. Given the abundance of teeth in the palaeontological record, the approach presented here holds the potential to address a wide range of questions pertaining to the Early and Middle Pleistocene evolutionary history of a large number of mammals, including hominins, at least in temperate climates.
Methods
Dmanisi & sample selection
Dmanisi is located about 65 km southwest of the capital city of Tbilisi in the Kvemo Kartli region of Georgia, at an elevation of 910 meters above sea level (Lat: 41° 20’ N, Lon: 44° 20’ E)[10,18]. The 23 fossil specimens we analysed were retrieved from stratum B1, in excavation blocks M17, M6, block 2, and area R11 (Extended Data Table 1, Extended Data Fig. 1). Stratum B deposits date between 1.78 Ma and 1.76 Ma[19]. All the analysed specimens were collected between 1984 and 2014 and their taxonomic identification was based on traditional comparative anatomy.After the sample preparation and data acquisition for all the Dmanisi specimens was concluded, we applied the whole experimental procedure to a medieval ovicaprine (sheep/goat) dental enamel+dentine specimen that was used as control. For this sample, we used extraction protocol “C”, and generated tandem MS data using a Q Exactive HF mass spectrometer (Thermo Fisher Scientific). The data were searched against the goat proteome, downloaded from the NCBI Reference Sequence Database (RefSeq) archive on 31st May 2017 (Supplementary Information). The ovicaprine specimen was found at the “Hotel Skandinavia” site in the city of Århus, Denmark and stored at the Natural History Museum of Denmark, Copenhagen.
Biomolecular preservation
We assessed the potential of ancient protein preservation prior to proteomic analysis by measuring the extent of amino acid racemisation in a subset of samples (6/23)[36]. Enamel chips, with all dentine removed, were powdered, and two subsamples per specimen were subject to analysis of their free (FAA) and total hydrolysable (THAA) amino acid fractions. Samples were analysed in duplicate by RP-HPLC, with standards and blanks run alongside each one of them (Supplementary Information). The D/L values of aspartic acid/asparagine, glutamic acid/glutamine, phenylalanine and alanine (D/L Asx, Glx, Phe, Ala) were assessed (Extended Data Fig. 4) to provide an overall estimate of intra-crystalline protein decomposition (IcPD).
Proteomics
All the sample preparation procedures for palaeoproteomic analysis were conducted in laboratories dedicated to the analysis of ancient DNA and ancient proteins in clean rooms fitted with filtered ventilation and positive pressure, in line with recent recommendations for ancient protein analysis[37]. A mock “extraction blank”, containing no starting material, was prepared, processed and analysed together with each batch of ancient samples.
Sample preparation
The external surface of bone samples was gently removed, and the remaining material was subsequently powdered. Enamel fragments, occasionally mixed with small amounts of dentine, were removed from teeth with a cutting disc and subsequently crushed into a rough powder. Ancient protein residues were extracted from approximately 180-220 mg of mineralised material, unless otherwise specified, using three different extraction protocols, hereafter referred to as “A”, “B” and “C” (Supplementary Information):
Extraction Protocol A - FASP
Tryptic peptides were generated using a filter-aided sample preparation (FASP) approach[38], as previously performed on ancient samples[39].
Extraction Protocol B - GuHCl Solution and Digestion
Bone or enamel+dentine powder was demineralised in 1 mL 0.5 M EDTA pH 8.0. After removal of the supernatant, all demineralised pellets were re-suspended in a 300 μL solution containing 2 M guanidine hydrochloride (GuHCl, Thermo Scientific), 100 mM Tris pH 8.0, 20 mM 2-Chloroacetamide (CAA), 10 mM Tris (2-carboxyethyl)phosphine (TCEP) in ultrapure H2O[40,41]. A total of 0.2 μg of mass spectrometry-grade rLysC (Promega P/N V1671) enzyme was added before the samples were incubated for 3-4 hours at 37°C with agitation. Samples and negative controls were subsequently diluted to 0.6 M GuHCl, and 0.8 μg of mass spectrometry-grade Trypsin (Promega P/N V5111) was added. Next, samples and negative controls were incubated overnight under mechanical agitation at 37°C. On the following day, samples were acidified, and the tryptic peptides were purified on C18 Stage-Tips, as previously described[42].
Extraction Protocol C - Digestion-Free ACID Demineralisation
Dental enamel powder, with possible trace amounts of dentine, was demineralised in 1.2 M HCl at room temperature, after which the solubilised protein residues were directly cleaned and concentrated on Stage-Tips, as described above. The sample prepared on Stage-Tip “#1217” was processed with 10% TFA instead of 1.2 M HCl. All the other parameters and procedures were identical to those used for all the other samples extracted with protocol “C”.
Tandem mass spectrometry
Different sets of samples (Supplementary Information §5.1, 5.2) were analysed by nanoflow liquid chromatography coupled to tandem mass spectrometry (nanoLC-MS/MS) on an EASY-nLC™ 1000 or 1200 system connected to a Q-Exactive, a Q-Exactive Plus, or to a Q-Exactive HF (Thermo Scientific, Bremen, Germany) mass spectrometer. Before and after each MS/MS run measuring ancient or extraction blank samples, two successive MS/MS runs were included in the sample queue in order to prevent carryover contamination between the samples. These consisted, first, of a MS/MS run ("MS/MS blank" run) with an injection exclusively of the buffer used to re-suspend the samples (0.1% TFA, 5% ACN), followed by a second MS/MS run ("MS/MS wash" run) with no injection.
Data analysis
Raw data files generated during MS/MS spectral acquisition were searched using MaxQuant[43], version 1.5.3.30, and PEAKS[44], version 7.5. A two-stage peptide-spectrum matching approach was adopted (Supplementary Information §5.3). Raw files were initially searched against a target/reverse database of collagen and enamel proteins retrieved from the UniProt and NCBI Reference Sequence Database (RefSeq) archives[45,46], taxonomically restricted to mammalian species. A database of partial “COL1A1” and “COL1A2” sequences from cervid species[47] was also included. The results from the preliminary analysis were used for a first, provisional reconstruction of protein sequences (MaxQuant search 1, MQ1).For specimens whose dataset resulted in a narrower, though not fully resolved, initial taxonomic placement, a second MaxQuant search (MQ2) was performed using a new protein database taxonomically restricted to the “order” taxonomic rank as determined after MQ1. For the MQ2 matching of the MS/MS spectra from specimen Dm.5/157-16635, partial sequences of serum albumin and enamel proteins from Sumatran (Dicerorhinus sumatrensis), Javan (Rhinoceros sondaicus), Indian (Rhinoceros unicornis), woolly (Coelodonta antiquitatis), Mercks (Stephanorhinus kirchbergensis), and Black rhinoceros (Diceros bicornis), were also added to the protein database. All the protein sequences from these species were reconstructed from draft genomes for each species (Dalen and Gilbert, unpublished data, Supplementary Information).For each MaxQuant and PEAKS search, enzymatic digestion was set to “unspecific” and the following variable modifications were included: oxidation (M), deamidation (NQ), N-term Pyro-Glu (Q), N-term Pyro-Glu (E), hydroxylation (P), phosphorylation (S). The error tolerance was set to 5 ppm for the precursor and to 20 ppm, or 0.05 Da, for the fragment ions in MaxQuant and PEAKS respectively. For searches of data generated from sample fractions partially or exclusively digested with trypsin, another MaxQuant and PEAKS search was conducted using the “enzyme” parameter set to “Trypsin/P”. Carbamidomethylation (C) was set: (i) as a fixed modification, for searches of data generated from sets of sample fractions exclusively digested with trypsin, or (ii) as a variable modification, for searches of data generated from sets of sample fractions partially digested with trypsin. For searches of data generated exclusively from undigested sample fractions, carbamidomethylation (C) was not included as a modification, neither fixed nor variable.The datasets re-analysed with MQ2 search, were also processed with the PEAKS software using the entire workflow (PEAKS de novo to PEAKS SPIDER) in order to detect hitherto unreported single amino acid polymorphisms (SAPs). Any amino acid substitution detected by the “SPIDER” homology search algorithm was validated by repeating the MaxQuant search (MQ3). In MQ3, the protein database used for MQ2 was modified to include the amino acid substitutions detected by the “SPIDER” algorithm.
Ancient protein sequence reconstruction
The peptide sequences confidently identified by the MQ1, MQ2, MQ3 were aligned using the software Geneious[48] (v. 5.4.4, substitution matrix BLOSUM62). The peptide sequences confidently identified by the PEAKS searches were aligned using an in-house R-script. A consensus sequence for each protein from each specimen was generated in FASTA format, without filtering on depth of coverage. Amino acid positions that were not confidently reconstructed were replaced by an “X”. Novel SAPs discovered through PEAKS were only accepted if these were further validated by repeating the MaxQuant search (MQ3). All isoleucine were converted into leucines, as standard MS/MS cannot differentiate between these two isobaric amino acids. For possible deamidated sites, we checked whether there were positions in our reference sequence database where both Q and E or both N and D occurred on the same position, and where we also had ancient sequences matching. For sample Dm.5/157-16635, only one such position existed, and this was replaced by an “X” in our consensus sequence. Based on parsimony, for other Q, E, N, and D positions we called the amino acid present in the reference proteome, regardless of their phylogenetic relevance. The output of the MQ2 and 3 searches was used to extend the coverage of the ancient protein sequences initially identified in the MQ1 iteration. For specimen DM.5/157-16335, all the experimentally identified peptides, as well as the respective best matching MS/MS spectra covering the sites informative for Rhinocerotidae phylogenetic inference, are provided as Supplementary Information (“Key MS-MS Spectra” file). All the reported MS/MS spectra are annotated using the advanced annotation mode of MaxQuant. Selected spectra matching to peptides covering phylogenetically informative amino acid positions were manually inspected, validated and annotated by an experienced mass spectrometrist, in all cases in full agreement with bioinformatic sequence assignment (Supplementary Information, “Key MS-MS Spectra” file). We utilized MS[2]PIP fragment ion spectral intensity prediction[49] (version: v20190107; model: HCD) to demonstrate that the experimentally observed fragment ion intensities are highly correlated with the theoretical ones (Fig. S3). Finally, we generated synthetic peptides for 19 selected peptides covering Rhinocerotidae SAPs in DM.5/157-16635.
Post translational modifications
Deamidation
After removal of likely contaminants, the extent of glutamine and asparagine deamidation was estimated for individual specimens, by using the MaxQuant output files as previously published[41] (Supplementary Information).
Other Spontaneous Chemical Modifications
Spontaneous post-translational modifications (PTMs) associated with chemical protein damage were searched using the PEAKS PTM tool and the dependent peptides search mode[50] in MaxQuant. In the PEAKS PTM search, all modifications in the Unimod database were considered. The mass error was set to 5.0 ppm and 0.5 Da for precursor and fragment, respectively. For PEAKS, the de novo ALC score was set to a threshold of 15 % and the peptide hit threshold to 30. The results were filtered by an FDR of 5 %, de novo ALC score of 50 %, and a protein hit threshold of ≥ 20. The MaxQuant dependent peptides search was carried out with the same search settings as described above and with a dependent peptide FDR of 1 % and a mass bin size of 0.0065 Da.
Phosphorylation
Class I phosphorylation sites were selected with localisation probabilities of ≥0.98 in the Phosph(ST)Sites MaxQuant output file. Sequence windows of ±6 aa from all identified sites were compared against a background file containing all non-phosphorylated peptides using a linear kinase sequence motif enrichment analysis in IceLogo (version 1.3.8)[51].
Phylogenetic Analysis
Reference datasets
We assembled a reference dataset consisting of publicly available protein sequences from representative ungulate species belonging to the following families: Equidae, Rhinocerotidae, Suidae and Bovidae (Supplementary Information §7 and §8). As Cervidae and carnivores are absent from protein sequence databases to a various extent, we did not attempt phylogenetic placement of samples from these taxa. Instead, we conducted our phylogenetic analysis on the five best-performing enamel proteomes (Dm.5/154.2.A4.38-16632), Dm.5/157-16635, Dm.5/154.1.B1.1-16638, Dm.8/154.4.A4.22-16639, Dm.8/152.3.B1.2-16641) and the largest bone proteome (Dm.bXI.North.B1a.collection-16658) we recovered (see Extended Data Table 2).We extended this dataset with the protein sequences from extinct and extant rhinoceros species including: the woolly rhinoceros (†Coelodonta antiquitatis), the Merck’s rhinoceros (†Stephanorhinus kirchbergensis), the Sumatran rhinoceros (Dicerorhinus sumatrensis), the Javan rhinoceros (Rhinoceros sondaicus), the Indian rhinoceros (Rhinoceros unicornis), and the Black rhinoceros (Diceros bicornis). Their corresponding protein sequences were obtained following translation of high-throughput DNA sequencing data, after filtering reads with mapping quality lower than 30 and nucleotides with base quality lower than 20, and calling the majority rule consensus sequence using ANGSD[52] For the woolly and Merck’s rhinoceroses we excluded the first and last five nucleotides of each DNA fragment in order to minimize the effect of post-mortem ancient DNA damage[53]. Each consensus sequence was formatted as a separate blast nucleotide database. We then performed a tblastn[54] alignment using the corresponding white rhinoceros sequence as a query, favouring ungapped alignments in order to recover translated and spliced protein sequences. Resulting alignments were processed using ProSplign algorithm from the NCBI Eukaryotic Genome Annotation Pipeline[55] to recover the spliced alignments and translated protein sequences.
Construction of phylogenetic trees
For each specimen, multiple sequence alignments for each protein were built using MAFFT[56] and concatenated onto a single alignment per specimen. These were inspected visually to correct obvious alignment mistakes, and all the isoleucine residues were substituted with leucine ones to account for indistinguishable isobaric amino acids at the positions where the ancient protein carried one of such amino acids. Based on these alignments, we inferred the phylogenetic relationship between the ancient samples and the species included in the reference dataset by using three approaches: distance-based neighbour-joining, maximum likelihood and Bayesian phylogenetic inference (Supplementary Information).Neighbour-joining trees were built using the phangorn[57] R package, restricting to sites covered in the ancient samples. Genetic distances were estimated using the JTT model, considering pairwise deletions. We estimated bipartition support through a non-parametric bootstrap procedure using 500 pseudoreplicates. We used PHyML 3.1[58] for maximum likelihood inference based on the whole concatenated alignment. For likelihood computation, we used the JTT substitution model with two additional parameters for modelling rate heterogeneity and the proportion of invariant sites. Bipartition support was estimated using a non-parametric bootstrap procedure with 500 replicates. Bayesian phylogenetic inference was carried out using MrBayes 3.2.6[59] on each concatenated alignment, partitioned per gene. While we chose the JTT substitution model in the two approaches above, we allowed the Markov chain to sample parameters for the substitution rates from a set of predetermined matrices, as well as the shape parameter of a gamma distribution for modelling across-site rate variation and the proportion of invariable sites. The MCMC algorithm was run with 4 chains for 5,000,000 cycles. Sampling was conducted every 500 cycles and the first 25% were discarded as burn-in. Convergence was assessed using Tracer v. 1.6.0, which estimated an ESS greater than 5,500 for each individual, indicating reasonable convergence for all runs.
Ancient DNA Analysis
The samples were processed using strict aDNA guidelines in a clean lab facility at the Natural History Museum of Denmark, University of Copenhagen. DNA extraction was attempted on five of the ancient animal samples (Supplementary Information §9, §13). Powdered samples (120-140 mg) were extracted using a silica-in-solution method[12,60]. To prepare the samples for NGS sequencing, 20 μL of DNA extract was built into a blunt-end library using the NEBNext DNA Sample Prep Master Mix Set 2 (E6070) with Illumina-specific adapters. The libraries were PCR-amplified with inPE1.0 forward primers and custom-designed reverse primers with a 6-nucleotide index[61]. Two extracts (MA399 and MA2481, from specimens 16859 and 16635 respectively) yielded detectable DNA concentrations (Table S9). The libraries generated from specimen 16859 and 16635 were processed on different flow cells. They were pooled with others for sequencing on an Illumina 2000 platform (MA399_L1, MA399_L2), using 100bp single read chemistry, and on an Illumina 2500 platform (MA2481_L1), using 81bp single read chemistry.The data were base-called using the Illumina software CASAVA 1.8.2 and sequences were demultiplexed with a requirement of a full match of the six nucleotide indexes that were used. Raw reads were processed using the PALEOMIX pipeline following published guidelines[62], mapping against the cow nuclear genome (Bos taurus 4.6.1, accession GCA_000003205.4), the cow mitochondrial genome (Bos taurus), the red deer mitochondrial genome (Cervus elaphus, accession AB245427.2), and the human nuclear genome (GRCh37/hg19), using BWA backtrack[63] v0.5.10 with the seed disabled. All other parameters were set as default. PCR duplicates from mapped reads were removed using the picard tool MarkDuplicate [http://picard.sourceforge.net/].
Sample Dm.5/157-16635 Morphological Measurements
We followed the methodology introduced by Guérin[32]. The maximal length of the tooth is measured with a digital calliper at the lingual side of the tooth and parallel to the occlusal surface. All measurements are given in mm (Supplementary Information §3).
Generalized stratigraphic profiles for Dmanisi, indicating specimen origins.
a, Type section of the Dmanisi M5 Excavation block. b, Stratigraphic profile of excavation area M6. M6 preserves a larger gully associated with the pipe-gully phase of stratigraphic-geomorphic development in Stratum B1. The thickness of Stratum B1 gully fill extends to the basalt surface, but includes “rip-ups” of Strata A1 and A2, showing that B1 deposits post-date Stratum A. c, Stratigraphic section of excavation area M17. Here, Stratum B1 was deposited after erosion of Stratum A deposits. The stratigraphic position of the Stephanorhinus sample Dm.5/157-16635 is highlighted with a red diamond. The Masavara basalt is ca. 50 cm below the base of the shown profile. d, Northern section of Block 2. Following collapse of a pipe and erosion to the basalt, the deeper part of this area was filled with local gully fill of Stratum B1/x/y/z. Note the uniform burial of all Stratum B1 deposits by Strata B2-B4. Sampled specimens are indicated by CGG five-digit numbers. See Extended Data Table 1 for both CGG and GNM specimen numbers.
Proteomic sequence coverage for specimen Dm.5/157-16635 (Stephanorhinus).
a, c, e, g, i, j, PSM sequence coverage of proteins AMBN, ENAM, AMELX, AMTN, MMP20 and ALB, respectively. Annotations include: “amino acid position, amino acid called in that position (number of PSMs/peptides covering that position)” for the phylogenetically informative SAPs within Rhinocerotidae. b, d, f, h, Frequency (%) of phosphorylated (green) and non-phosphorylated (red) PSMs per amino acid position for AMBN, ENAM, AMELX and AMTN, respectively. Numbers within the bars provide the PSM counts. k, Violinplot of PSM coverage distribution for all covered sites (n=693) and those of phylogenetic relevance (SAPs, n=30). The boxplots define the range of the data, with whiskers extending to 1.5 the interquartile range, 25th and 75th percentiles (boxes), and medians (dots). All panels based on MQ results only. Supplementary File “Key MS-MS Spectra” contains spectral examples and fragment ion series alignments for each of the marked SAPs.
Peptide and ion fragment coverage of amelogenin X (AMELX) isoforms 1 and 2 from specimen Dm.M6/7.II.296-16856 (Cervidae).
Peptides specific to amelogenin X (AMELX) isoforms 1 and 2 appear in the upper and lower parts of the figure, respectively. No amelogenin X isoform 2 is currently reported in public databases for the Cervidae group. Accordingly, the amelogenin X isoform 2-specific peptides were identified by MaxQuant spectral matching against bovine (Bos Taurus) amelogenin X isoform 2 (UniProt accession number P02817-2). Amelogenin X isoform 2, also known as leucine-rich amelogenin peptide (LRAP), is a naturally occurring amelogenin X isoform from the translation product of an alternatively spliced transcript.
Amino Acid Racemisation.
Extent of intra-crystalline racemization in enamel for the free amino acid (FAA, x-axis) fraction and the total hydrolysable amino acids (THAA, y-axis) fraction for four amino acids (Asx, Glx, Ala and Phe). Note differences in axis scale. Intra-crystalline data from Proboscidea enamel from a range of UK sites[64] has been shown for comparison (black crosses). Both taxa from Dmanisi and the UK exhibit a similar relationship between FAA and THAA racemization and R2 values have been calculated based on a polynomial relationship (order = 2, all >0.93).
Ancient enamel proteome phosphorylation.
Annotated spectra including phosphorylated serine (phS). a, Phosphorylation in the S-x-E motif (AMEL). b, Phosphorylation in the S-x-phS motif (AMBN). Phosphorylation was independently observed in all three separate analyses of Dm.5/157-16635, including multiple spectra and peptides (see Extended Data Fig. 2).
Phylogenetic relationships between the comparative reference dataset and specimen Dm.bXI-16857.
Consensus tree from Bayesian inference. The posterior probability of each bipartition is shown as a percentage to the left of each node.
Amelogenin Y-specific matches.
a) Specimen Dm.6/151.4.A4.12-16630 (Pseudodama). b) Specimen Dm.69/64.3.B1.53-16631 (Cervidae). c) Specimen Dm.8/154.4.A4.22-16639 (Bovidae). d) Specimen Dm.M6/7.II.296-16856 (Cervidae). Note the presence of deamidated glutamine (deQ) and asparagine (deN), oxidated methionine (oxM), and phosphorylated serine (phS).
Effect of the missingness in the tree topology.
a, Maximum-likelihood phylogeny obtained using PhyML and the protein alignment excluding the ancient Dmanisi rhinoceros Dm.5/157-16635. b, Topologies obtained from 100 random replicates of the Woolly rhinoceros (Coelodonta antiquitatis). In each replicate the amount of missing sites was similar to the one observed in the Dm.5/157-16635 specimen (72.4% missingness). The percentage shown for each topology indicates the number of replicates in which that particular topology was recovered. c, Similar to b, but for the Javan rhinoceros (Rhinoceros sondaicus). d, Similar to b, but for the black rhinoceros (Diceros bicornis).
Genome and proteome survival in 23 Dmanisi fossil fauna specimens.
For each specimen, the Centre for GeoGenetics (CGG) reference number and the Georgian National Museum (GNM) specimen field number are reported. *or the narrowest possible taxonomic identification achievable using comparative anatomy methods. †Only collagens survive. B = Bone, D = Dentine, E = Enamel. Extractions of enamel might include some residual dentine. Accordingly, both tissues are either listed separately (○D, ●E, in case of no collagen preservation), or together (●E+D, in case of collagen preservation). Open circles (○) indicate no molecular preservation; (●) closed circles indicate molecular preservation.
Proteome composition and coverage.
Aggregated data from different extraction methods and/or tissues from the same specimen. In those cells reporting two values separated by the “|” symbol, the first value refers to MaxQuant (MQ) searches performed selecting unspecific digestion, while the second value refers to MQ searches performed selecting trypsin digestion. For those cells including one value only, it refers to MQ searches performed selecting unspecific digestion. Final amino acid coverage, incorporating both MQ and PEAKS searches, is reported in the last column. *supporting all peptides. See Extended Data Table 1 for tissue sources per specimen and both CGG and GNM specimen numbers.
Supplementary Material
Supplementary information is available in the online version of the paper.
Authors: David Lordkipanidze; Marcia S Ponce de León; Ann Margvelashvili; Yoel Rak; G Philip Rightmire; Abesalom Vekua; Christoph P E Zollikofer Journal: Science Date: 2013-10-18 Impact factor: 47.728
Authors: Ludovic Orlando; Aurélien Ginolhac; Guojie Zhang; Duane Froese; Anders Albrechtsen; Mathias Stiller; Mikkel Schubert; Enrico Cappellini; Bent Petersen; Ida Moltke; Philip L F Johnson; Matteo Fumagalli; Julia T Vilstrup; Maanasa Raghavan; Thorfinn Korneliussen; Anna-Sapfo Malaspinas; Josef Vogt; Damian Szklarczyk; Christian D Kelstrup; Jakob Vinther; Andrei Dolocan; Jesper Stenderup; Amhed M V Velazquez; James Cahill; Morten Rasmussen; Xiaoli Wang; Jiumeng Min; Grant D Zazula; Andaine Seguin-Orlando; Cecilie Mortensen; Kim Magnussen; John F Thompson; Jacobo Weinstock; Kristian Gregersen; Knut H Røed; Véra Eisenmann; Carl J Rubin; Donald C Miller; Douglas F Antczak; Mads F Bertelsen; Søren Brunak; Khaled A S Al-Rasheid; Oliver Ryder; Leif Andersson; John Mundy; Anders Krogh; M Thomas P Gilbert; Kurt Kjær; Thomas Sicheritz-Ponten; Lars Juhl Jensen; Jesper V Olsen; Michael Hofreiter; Rasmus Nielsen; Beth Shapiro; Jun Wang; Eske Willerslev Journal: Nature Date: 2013-06-26 Impact factor: 49.962
Authors: Enrico Cappellini; Ana Prohaska; Fernando Racimo; Frido Welker; Mikkel Winther Pedersen; Morten E Allentoft; Peter de Barros Damgaard; Petra Gutenbrunner; Julie Dunne; Simon Hammann; Mélanie Roffet-Salque; Melissa Ilardo; J Víctor Moreno-Mayar; Yucheng Wang; Martin Sikora; Lasse Vinner; Jürgen Cox; Richard P Evershed; Eske Willerslev Journal: Annu Rev Biochem Date: 2018-04-25 Impact factor: 23.643
Authors: Elena R Schroeter; Caroline J DeHart; Timothy P Cleland; Wenxia Zheng; Paul M Thomas; Neil L Kelleher; Marshall Bern; Mary H Schweitzer Journal: J Proteome Res Date: 2017-01-23 Impact factor: 4.466
Authors: Eske Willerslev; M Thomas P Gilbert; Jonas Binladen; Simon Y W Ho; Paula F Campos; Aakrosh Ratan; Lynn P Tomsho; Rute R da Fonseca; Andrei Sher; Tatanya V Kuznetsova; Malgosia Nowak-Kemp; Terri L Roth; Webb Miller; Stephan C Schuster Journal: BMC Evol Biol Date: 2009-05-11 Impact factor: 3.260