| Literature DB >> 30497376 |
Gerard C P Schaafsma1, Mauno Vihinen2.
Abstract
BACKGROUND: Benchmark datasets are essential for both method development and performance assessment. These datasets have numerous requirements, representativeness being one. In the case of variant tolerance/pathogenicity prediction, representativeness means that the dataset covers the space of variations and their effects.Entities:
Keywords: Benchmark datasets; Mutation; Representativeness; Variation; Variation interpretation
Mesh:
Substances:
Year: 2018 PMID: 30497376 PMCID: PMC6267811 DOI: 10.1186/s12859-018-2478-6
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
General properties of the investigated benchmark datasets
| dataset | collection | subset of VariBench dataset | original filename | no. of variants | no. of variants mapped to PDB | % mapped to PDB | no. of variants in ExAC | % in ExAC |
|---|---|---|---|---|---|---|---|---|
| DS1 | VariSNP | Neutral single nucleotide variants | 446,013 | 39,081 | 8.76 | |||
| DS2 | VariBench Dataset 1 | Neutral_dbSNP_build_131_mapped | 23,671 | 2358 | 9.96 | |||
| DS3 | VariBenchDataset 1 | Pathogenic_SNP_mapped | 19,335 | 10,242 | 52.97 | 263 | 1.36 | |
| DS4 | VariBench Dataset 2 | 1 | Neutral_dataset_Olatubosun_et_al_with_mapping_annotated | 19,459 | 2245 | 11.54 | ||
| DS5 | VariBench Dataset 2 | 1 | Pathogenic_training_dataset_from_PONP | 14,610 | 7261 | 49.7 | 221 | 1.51 |
| DS6 | VariBench Dataset 4 | 1 | Neutral_dataset_from_Thusberg_et_al_clustered_with_mapping | 17,623 | 1743 | 9.89 | ||
| DS7 | VariBench Dataset 4 | 1 | Pathogenic_dataset_Thusberg_et_al_clustered_with_mapping | 17,525 | 9519 | 54.32 | 227 | 1.30 |
| DS8 | VariBench Dataset 5 | 2 | Neutral_dataset_Olatubosun_et_al_clustered_with_mapping | 14,647 | 1706 | 11.65 | ||
| DS9 | VariBench Dataset 5 | 2 | Pathogenic_dataset_Olatubosun_et_al_clustered_with_mapping | 13,096 | 6652 | 50.79 | 195 | 1.49 |
| DS10 | VariBench Dataset 7 | 2 | Neutral_PON-P2_training_data | 13,063 | 1731 | 13.25 | ||
| DS11 | VariBench Dataset 7 | 2 | Pathogenic_PON-P2_training_data | 12,584 | 6420 | 51.02 | 173 | 1.37 |
| DS12 | VariBench Dataset 7 | 2 | Neutral_PON-P2_test_data | 1605 | 150 | 9.35 | ||
| DS13 | VariBench Dataset 7 | 2 | Pathogenic_PON-P2_test_data.csv | 1301 | 481 | 36.97 | 23 | 1.77 |
| DS14 | VariBench Dataset 7 | 2 | Neutral_PON-P2_c95_training | 8664 | 953 | 11 | ||
| DS15 | VariBench Dataset 7 | 2 | Pathogenic_PON-P2_c95_training | 7151 | 3728 | 52.13 | 81 | 1.13 |
| DS16 | VariBench Dataset 7 | 2 | Neutral_PON-P2_c95_test | 1053 | 82 | 7.79 | ||
| DS17 | VariBench Dataset 7 | 2 | Pathogenic_PON-P2_c95_test | 751 | 272 | 36.22 | 12 | 1.60 |
| DS18 | VariBench Dataset 9 | predictSNP_selected_tool_scores | 16,098 | 4494 | 27.92 | |||
| DS19 | VariBench Dataset 9 | varibench_selected_tool_scores | 10,266 | 3418 | 33.29 | |||
| DS20 | VariBench Dataset 9 | exovar_filtered_tool_scores | 8850 | 2985 | 33.73 | |||
| DS21 | VariBench Dataset 9 | humvar_filtered_tool_scores | 40,389 | 10,990 | 27.21 | |||
| DS22 | PolyPhen-2 | humvar-2011_12.neutral.humvar.output | 21,151 | 2169 | 10.25 | |||
| DS23 | PolyPhen-2 | humvar-2011_12.deleterious.humvar.output | 22,196 | 10,290 | 46.36 | 342 | 1.54 | |
| DS24 | SwissVar | SwissVar_latest | 75,042 | 12,749 | 16.99 | 16,049 | 21.39 |
Mapping of the datasets to UniProt protein sequences
| dataset | no. of unique UniProt protein sequences | no. of variants mapped to a UniProt sequence | % variants mapped | maximum no. of variants mapped to a UniProt sequence | UniProt ID with maximum no. of variants | protein name | gene |
|---|---|---|---|---|---|---|---|
| DS1 | 17,571 | 378,706 | 84.9 | 1451 | Q8WZ42 | Titin |
|
| DS2 | 7230 | 18,660 | 78.8 | 71 | P20929 | Nebulin |
|
| DS3 | 1182 | 19,318 | 99.9 | 2294 | P04637 | Cellular tumor antigen p53 |
|
| DS4 | 6541 | 15,880 | 81.6 | 56 | P46013 | Proliferation marker protein Ki-67 |
|
| DS5 | 1093 | 14,597 | 99.9 | 382 | P00451 | Coagulation factor VIII |
|
| DS6 | 4895 | 13,811 | 78.4 | 71 | P20929 | Nebulin |
|
| DS7 | 953 | 17,514 | 99.9 | 2294 | P04637 | Cellular tumor antigen p53 |
|
| DS8 | 4517 | 11,847 | 80.9 | 56 | P46013 | Proliferation marker protein Ki-67 |
|
| DS9 | 884 | 13,096 | 100.0 | 382 | P00451 | Coagulation factor VIII |
|
| DS10 | 4997 | 10,882 | 83.3 | 27 | Q86WI1 | Fybrocystin-L |
|
| DS11 | 979 | 12,584 | 100.0 | 378 | P00451 | Coagulation factor VIII |
|
| DS12 | 545 | 1288 | 80.2 | 14 | Q13576 | Ras GTPase-activating-like protein IQGAP2 |
|
| DS13 | 90 | 1301 | 100.0 | 100 | P04839 | Cytochrome b-245 heavy chain |
|
| DS14 | 3799 | 7185 | 82.9 | 26 | Q86WI1 | Fybrocystin-L |
|
| DS15 | 785 | 7151 | 100.0 | 196 | P00439 | Phenylalanine-4-hydroxylase |
|
| DS16 | 424 | 848 | 80.5 | 11 | Q8NEM0 | Microcephalin |
|
| DS17 | 72 | 751 | 100.0 | 89 | P04839 | Cytochrome b-245 heavy chain |
|
| DS18 | 3278 | 12,056 | 74.9 | 363 | P00451 | Coagulation factor VIII |
|
| DS19 | 4129 | 10,154 | 98.9 | 1799 | P04637 | Cellular tumor antigen p53 |
|
| DS20 | 3509 | 8662 | 97.9 | 137 | P68871 | Hemoglobin subunit beta |
|
| DS21 | 9038 | 39,735 | 98.4 | 460 | P00451 | Coagulation factor VIII |
|
| DS22 | 8791 | 21,151 | 100.0 | 48 | P20930, Q7Z442 | Filaggrin, Polycystic kidney disease protein 1-like 2 | |
| DS23 | 1852 | 22,196 | 100.0 | 472 | P00451 | Coagulation factor VIII |
|
| DS24 | 12,735 | 75,042 | 100.0 | 1338 | P04637 | Cellular tumor antigen p53 |
|
Analysis of the chromosomal distribution of variants in dataset DS1
| Chromosome | no. of genes | CDS length | no. of observed variants | no. of expected variants (no. of genes) | no. of expected variants (CDS length) | p-valuea | p-valuea |
|---|---|---|---|---|---|---|---|
| 1 | 2037 | 3,483,903 | 45,856 | 45,915 | 45,339 | 0.773155 | 0.010565 |
| 2 | 1238 | 2,517,642 | 31,391 | 27,905 | 32,765 | < 10−4 | < 10–4 |
| 3 | 1071 | 1,965,098 | 24,735 | 24,141 | 25,574 | < 10−4 | < 10–4 |
| 4 | 745 | 1,365,661 | 16,936 | 16,793 | 17,773 | 0.260634 | < 10–4 |
| 5 | 882 | 1,601,648 | 19,148 | 19,881 | 20,844 | < 10−4 | < 10–4 |
| 6 | 1035 | 1,735,760 | 22,495 | 23,330 | 22,589 | < 10− 4 | 0.523159 |
| 7 | 901 | 1,609,177 | 21,764 | 20,309 | 20,942 | < 10−4 | < 10–4 |
| 8 | 668 | 1,135,640 | 16,239 | 15,057 | 14,779 | < 10−4 | < 10–4 |
| 9 | 770 | 1,382,150 | 19,117 | 17,356 | 17,987 | < 10− 4 | < 10–4 |
| 10 | 727 | 1,322,286 | 17,489 | 16,387 | 17,208 | < 10−4 | 0.0292 |
| 11 | 1278 | 2,005,315 | 28,704 | 28,807 | 26,097 | 0.532354 | < 10–4 |
| 12 | 1033 | 1,776,908 | 20,797 | 23,284 | 23,125 | < 10−4 | < 10–4 |
| 13 | 324 | 634,435 | 7401 | 7303 | 8257 | 0.247573 | < 10–4 |
| 14 | 614 | 1,079,560 | 13,972 | 13,840 | 14,049 | 0.254342 | 0.511939 |
| 15 | 589 | 1,189,858 | 14,846 | 13,276 | 15,485 | < 10−4 | < 10–4 |
| 16 | 858 | 1,451,775 | 22,351 | 19,340 | 18,893 | < 10−4 | < 10–4 |
| 17 | 1184 | 1,971,211 | 26,518 | 26,688 | 25,653 | 0.284589 | < 10–4 |
| 18 | 268 | 534,152 | 6644 | 6041 | 6951 | < 10−4 | 0.000187 |
| 19 | 1467 | 2,277,812 | 34,032 | 33,067 | 29,643 | < 10−4 | < 10–4 |
| 20 | 540 | 811,690 | 11,340 | 12,172 | 10,563 | < 10−4 | < 10–4 |
| 21 | 233 | 342,226 | 5194 | 5252 | 4454 | 0.424789 | < 10–4 |
| 22 | 439 | 712,404 | 10,412 | 9895 | 9271 | < 10−4 | < 10–4 |
| X | 840 | 1,296,174 | 8557 | 18,934 | 16,868 | < 10−4 | < 10–4 |
| Y | 45 | 67,500 | 51 | 1014 | 878 | < 10−4 | 0.010565 |
aresults of binomial test
Summary of the chromosomal distributions in the datasets. Chromosomes with non-biased distribution are indicated by an asterisk
| chromosome | |||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| dataset | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | X | Y | no. of chromosomes |
| DS1 | * | * | * | * | * | * | * | 7 | |||||||||||||||||
| DS2 | * | * | * | * | * | * | * | * | 8 | ||||||||||||||||
| DS3 | * | * | * | 3 | |||||||||||||||||||||
| DS4 | * | * | * | * | * | * | * | * | * | 9 | |||||||||||||||
| DS5 | * | * | * | * | * | 5 | |||||||||||||||||||
| DS6 | * | * | * | * | * | * | * | * | * | 9 | |||||||||||||||
| DS7 | * | * | * | 3 | |||||||||||||||||||||
| DS8 | * | * | * | * | * | * | * | 7 | |||||||||||||||||
| DS9 | * | * | * | * | * | * | 6 | ||||||||||||||||||
| DS10 | * | * | * | * | * | * | * | * | * | 9 | |||||||||||||||
| DS11 | * | * | * | * | * | 5 | |||||||||||||||||||
| DS12 | * | * | * | * | * | * | * | * | * | 9 | |||||||||||||||
| DS13 | * | * | * | 3 | |||||||||||||||||||||
| DS14 | * | * | * | * | * | * | * | * | * | * | * | 11 | |||||||||||||
| DS15 | * | * | * | * | * | 5 | |||||||||||||||||||
| DS16 | * | * | * | * | * | * | * | * | * | * | * | * | * | 13 | |||||||||||
| DS17 | * | * | * | * | * | * | 6 | ||||||||||||||||||
| DS18 | * | * | * | * | * | * | 6 | ||||||||||||||||||
| DS19 | * | * | * | * | * | 5 | |||||||||||||||||||
| DS20 | * | * | * | * | * | * | * | * | * | * | * | 11 | |||||||||||||
| DS21 | * | * | * | * | 4 | ||||||||||||||||||||
| DS22 | * | * | * | * | * | * | * | * | * | * | * | 11 | |||||||||||||
| DS23 | * | * | * | 3 | |||||||||||||||||||||
| DS24 | * | * | 2 | ||||||||||||||||||||||
| no. of datasets | 9 | 4 | 5 | 5 | 2 | 3 | 6 | 7 | 8 | 3 | 8 | 11 | 12 | 10 | 19 | 12 | 5 | 5 | 1 | 6 | 5 | 7 | 0 | 7 | |
Mapping of the datasets to PDB structures and CATH domains
| dataset | no. of variants mapped to PDB | no. of variants mapped to CATH domain | % mapped to CATH domain (of mapped to PDB) | no. of variants with a CATH classification | % with a CATH classification (of mapped to PDB) | no. of unique CATH superfamilies |
|---|---|---|---|---|---|---|
| DS1 | 39,081 | 23,303 | 59.63 | 21,853 | 55.92 | 700 |
| DS2 | 2358 | 1387 | 58.82 | 1319 | 55.94 | 319 |
| DS3 | 10,242 | 6580 | 64.25 | 6396 | 62.45 | 239 |
| DS4 | 2245 | 1325 | 59.02 | 1262 | 56.21 | 306 |
| DS5 | 7261 | 4687 | 64.55 | 4556 | 62.75 | 227 |
| DS6 | 1743 | 991 | 56.86 | 941 | 53.99 | 277 |
| DS7 | 9519 | 6100 | 64.08 | 5920 | 62.19 | 234 |
| DS8 | 1706 | 973 | 57.03 | 928 | 54.40 | 269 |
| DS9 | 6652 | 4301 | 64.66 | 4170 | 62.69 | 223 |
| DS10 | 1731 | 865 | 49.97 | 826 | 47.72 | 253 |
| DS11 | 6420 | 4350 | 67.76 | 4212 | 65.61 | 220 |
| DS12 | 150 | 66 | 44.00 | 62 | 41.33 | 32 |
| DS13 | 481 | 142 | 29.52 | 135 | 28.07 | 18 |
| DS14 | 953 | 478 | 50.16 | 454 | 47.64 | 186 |
| DS15 | 3728 | 2557 | 68.59 | 2486 | 66.68 | 188 |
| DS16 | 82 | 38 | 46.34 | 36 | 43.90 | 21 |
| DS17 | 272 | 78 | 28.68 | 73 | 26.84 | 12 |
| DS18 | 4494 | 2980 | 66.31 | 2862 | 63.68 | 274 |
| DS19 | 3418 | 2081 | 60.88 | 2035 | 59.54 | 210 |
| DS20 | 2985 | 2086 | 69.88 | 2031 | 68.04 | 235 |
| DS21 | 10,990 | 7051 | 64.16 | 6786 | 61.75 | 402 |
| DS22 | 2169 | 1301 | 59.98 | 1217 | 56.11 | 291 |
| DS23 | 10,290 | 6566 | 63.81 | 6353 | 61.74 | 307 |
| DS24 | 12,749 | 7828 | 61.40 | 7499 | 58.82 | 347 |
Fig. 1Number of unique CATH domains in relation to the log number of variants mapped to a PDB structure in each dataset. +: neutral datasets, *: pathogenic datasets, x: mixed datasets. The largest dataset, DS1, had also the largest number of unique CATH superfamilies, and there seems to be a positive correlation between the number of mapped variants and the number of unique CATH superfamilies.
Kolmogorov-Smirnov 2-sample test statistics (KS) for each dataset on the Class, Architecture, Topology and Homology levels of CATH superfamilies
| dataset | KS Class | KS Architecture | KS Topology | KS Homology |
|---|---|---|---|---|
| DS1 | 0.25 (0.99688)a | 0.17 (0.76005) | 0.30 (< 10−4) | 0.36 (< 10−4) |
| DS2 | 0.25 (0.99688) | 0.20 (0.53720) | 0.60 (< 10−4) | 0.65 (< 10−4) |
| DS3 | 0.25 (0.99688) | 0.33 (0.05499) | 0.68 (< 10−4) | 0.74 (< 10−4) |
| DS4 | 0.25 (0.99688) | 0.23 (0.34203) | 0.61 (< 10− 4) | 0.66 (< 10− 4) |
| DS5 | 0.25 (0.99688) | 0.30 (0.10884) | 0.69 (< 10−4) | 0.75 (< 10−4) |
| DS6 | 0.25 (0.99688) | 0.17 (0.76005) | 0.65 (< 10− 4) | 0.70 (< 10− 4) |
| DS7 | 0.25 (0.99688) | 0.33 (0.05499) | 0.68 (< 10−4) | 0.74 (< 10− 4) |
| DS8 | 0.25 (0.99688) | 0.20 (0.53720) | 0.65 (< 10−4) | 0.70 (< 10− 4) |
| DS9 | 0.25 (0.99688) | 0.30 (0.10884) | 0.69 (< 10− 4) | 0.75 (< 10− 4) |
| DS10 | 0.25 (0.99688) | 0.20 (0.53720) | 0.67 (< 10− 4) | 0.72 (< 10− 4) |
| DS11 | 0.25 (0.99688) | 0.30 (0.10884) | 0.70 (< 10− 4) | 0.76 (< 10− 4) |
| DS12 | 0.25 (0.99688) | 0.50 (0.00062) | 0.94 (< 10− 4) | 0.96 (< 10− 4) |
| DS13 | 0.50 (0.53344) | 0.73 (< 10− 4) | 0.97 (< 10− 4) | 0.98 (< 10− 4) |
| DS14 | 0.25 (0.99688) | 0.23 (0.34203) | 0.75 (< 10− 4) | 0.79 (< 10− 4) |
| DS15 | 0.25 (0.99688) | 0.33 (0.05499) | 0.73 (< 10− 4) | 0.79 (< 10− 4) |
| DS16 | 0.25 (0.99688) | 0.67 (< 10− 4) | 0.96 (< 10− 4) | 0.98 (< 10− 4) |
| DS17 | 0.50 (0.53344) | 0.80 (< 10− 4) | 0.98 (< 10− 4) | 0.99 (< 10− 4) |
| DS18 | 0.25 (0.99688) | 0.17 (0.76005) | 0.64 (< 10− 4) | 0.70 (< 10− 4) |
| DS19 | 0.25 (0.99688) | 0.27 (0.20033) | 0.72 (< 10− 4) | 0.77 (< 10− 4) |
| DS20 | 0.25 (0.99688) | 0.17 (0.76005) | 0.68 (< 10− 4) | 0.74 (< 10− 4) |
| DS21 | 0.25 (0.99688) | 0.17 (0.76005) | 0.49 (< 10− 4) | 0.56 (< 10− 4) |
| DS22 | 0.25 (0.99688) | 0.20 (0.53720) | 0.61 (< 10− 4) | 0.68 (< 10− 4) |
| DS23 | 0.25 (0.99688) | 0.27 (0.2003) | 0.60 (< 10− 4) | 0.66 (< 10− 4) |
| DS24 | 0.25 (0.99688) | 0.23 (0.34203) | 0.56 (< 10− 4) | 0.62 (< 10− 4) |
ap-value in brackets
Coverage of proteins and all features compared to reference (%)
| dataset | UniProt | CATH 1st level | CATH 2nd level | CATH 3rd level | CATH 4th level | Pfam | EC 1st level | EC 2nd level | EC 3rd level | EC 4th level | GO |
|---|---|---|---|---|---|---|---|---|---|---|---|
| DS1 | 86.98 | 100.00 | 100.00 | 82.48 | 77.18 | 90.77 | 100.00 | 98.18 | 99.41 | 99.15 | 98.33 |
| DS2 | 35.79 | 100.00 | 83.33 | 40.35 | 35.17 | 36.01 | 100.00 | 90.91 | 85.29 | 61.61 | 72.97 |
| DS3 | 5.85 | 100.00 | 70.00 | 32.48 | 26.35 | 13.85 | 100.00 | 76.36 | 61.18 | 24.92 | 46.03 |
| DS4 | 32.38 | 100.00 | 83.33 | 39.17 | 33.74 | 34.16 | 100.00 | 89.09 | 84.71 | 60.14 | 70.93 |
| DS5 | 5.41 | 100.00 | 70.00 | 30.91 | 25.03 | 12.94 | 100.00 | 76.36 | 60.59 | 23.99 | 44.55 |
| DS6 | 24.23 | 100.00 | 83.33 | 35.24 | 30.54 | 33.10 | 100.00 | 89.09 | 83.53 | 49.23 | 63.13 |
| DS7 | 4.72 | 100.00 | 70.00 | 32.09 | 25.80 | 13.52 | 100.00 | 76.36 | 60.59 | 22.99 | 42.61 |
| DS8 | 22.36 | 100.00 | 83.33 | 34.84 | 29.66 | 31.64 | 100.00 | 87.27 | 82.94 | 48.45 | 61.76 |
| DS9 | 4.38 | 100.00 | 70.00 | 30.51 | 24.59 | 12.68 | 100.00 | 76.36 | 60.00 | 22.21 | 41.55 |
| DS10 | 24.74 | 100.00 | 80.00 | 32.87 | 27.89 | 28.46 | 100.00 | 87.27 | 81.76 | 51.24 | 62.60 |
| DS11 | 4.85 | 100.00 | 70.00 | 30.31 | 24.26 | 11.65 | 100.00 | 72.73 | 58.24 | 23.22 | 42.15 |
| DS12 | 2.70 | 75.00 | 50.00 | 5.51 | 3.53 | 2.56 | 83.33 | 27.27 | 14.71 | 2.79 | 18.11 |
| DS13 | 0.45 | 75.00 | 26.67 | 2.95 | 1.98 | 1.40 | 66.67 | 18.18 | 8.24 | 1.01 | 9.00 |
| DS14 | 18.81 | 100.00 | 76.67 | 25.20 | 20.51 | 20.93 | 100.00 | 78.18 | 70.00 | 37.38 | 51.87 |
| DS15 | 3.89 | 100.00 | 66.67 | 27.17 | 20.73 | 9.61 | 100.00 | 70.91 | 55.88 | 20.43 | 38.21 |
| DS16 | 2.10 | 75.00 | 33.33 | 3.54 | 2.32 | 2.02 | 66.67 | 25.45 | 12.94 | 2.48 | 14.72 |
| DS17 | 0.36 | 75.00 | 20.00 | 1.77 | 1.32 | 1.12 | 66.67 | 14.55 | 7.06 | 0.77 | 8.27 |
| DS18 | 16.23 | 100.00 | 83.33 | 36.02 | 30.21 | 22.29 | 100.00 | 85.45 | 80.00 | 38.24 | 58.66 |
| DS19 | 20.44 | 100.00 | 76.67 | 28.15 | 23.15 | 21.99 | 100.00 | 85.45 | 74.12 | 35.22 | 58.93 |
| DS20 | 17.37 | 100.00 | 83.33 | 31.89 | 25.91 | 19.24 | 100.00 | 89.09 | 75.88 | 38.78 | 59.35 |
| DS21 | 44.74 | 100.00 | 90.00 | 50.98 | 44.32 | 42.19 | 100.00 | 94.55 | 90.59 | 71.13 | 82.17 |
| DS22 | 43.52 | 100.00 | 90.00 | 39.17 | 32.08 | 36.55 | 100.00 | 90.91 | 86.47 | 61.53 | 75.51 |
| DS23 | 9.17 | 100.00 | 80.00 | 39.96 | 33.85 | 18.77 | 100.00 | 87.27 | 75.29 | 35.76 | 55.21 |
| DS24 | 63.04 | 100.00 | 90.00 | 44.49 | 38.26 | 58.14 | 100.00 | 96.36 | 94.71 | 85.53 | 91.74 |
Mapping of the datasets to Pfam domains
| dataset | number of unique Pfam domains | number of variants with a Pfam domain | % variants with a Pfam domain of total number of variants in dataset | no. of variants mapped to a UniProt sequence | % variants with a Pfam domain of number of variants mapped to UniProt | KS statistica |
|---|---|---|---|---|---|---|
| DS1 | 5213 | 148,681 | 33.34 | 378,706 | 39.26 | 0.25 (< 10− 4) |
| DS2 | 2065 | 7307 | 30.87 | 18,660 | 39.16 | 0.64 (< 10− 4) |
| DS3 | 794 | 14,228 | 73.59 | 19,318 | 73.65 | 0.86 (< 10− 4) |
| DS4 | 1954 | 6589 | 33.86 | 15,880 | 41.49 | 0.66 (< 10− 4) |
| DS5 | 742 | 10,997 | 75.27 | 14,597 | 75.34 | 0.87 (< 10− 4) |
| DS6 | 1898 | 5293 | 30.03 | 13,811 | 38.32 | 0.67 (< 10− 4) |
| DS7 | 775 | 12,842 | 73.28 | 17,514 | 73.32 | 0.86 (< 10− 4) |
| DS8 | 1810 | 4833 | 33.00 | 11,847 | 40.80 | 0.68 (< 10−4) |
| DS9 | 727 | 9796 | 74.80 | 13,096 | 74.80 | 0.87 (< 10− 4) |
| DS10 | 1632 | 4396 | 33.65 | 10,882 | 40.40 | 0.72 (< 10−4) |
| DS11 | 668 | 9641 | 76.61 | 12,584 | 76.61 | 0.88 (< 10− 4) |
| DS12 | 147 | 579 | 36.07 | 1288 | 44.95 | 0.97 (< 10−4) |
| DS13 | 80 | 897 | 68.95 | 1301 | 68.95 | 0.99 (< 10− 4) |
| DS14 | 1197 | 2656 | 30.66 | 7185 | 36.97 | 0.79 (< 10− 4) |
| DS15 | 551 | 5619 | 78.85 | 7151 | 80.31 | 0.90 (< 10− 4) |
| DS16 | 116 | 354 | 33.62 | 848 | 42.22 | 0.98 (< 10−4) |
| DS17 | 64 | 526 | 70.04 | 751 | 70.04 | 0.99 (< 10−4) |
| DS18 | 1265 | 7190 | 44.66 | 12,056 | 59.64 | 0.78 (< 10−4) |
| DS19 | 1172 | 4859 | 47.33 | 10,154 | 47.85 | 0.80 (< 10−4) |
| DS20 | 1046 | 4818 | 54.44 | 8662 | 55.62 | 0.82 (< 10− 4) |
| DS21 | 2301 | 20,415 | 50.55 | 39,735 | 51.38 | 0.60 (< 10− 4) |
| DS22 | 2090 | 7727 | 36.53 | 21,151 | 36.53 | 0.64 (< 10−4) |
| DS23 | 1073 | 16,309 | 73.48 | 22,196 | 73.48 | 0.81 (< 10− 4) |
| DS24 | 3325 | 41,997 | 55.94 | 75,042 | 55.96 | 0.61 (< 10−4) |
ap-value between brackets
Mapping of datasets to EC classification at 4 levels
| dataset | number of variants with EC numbers | % of total number of variants | KS 1st level | KS 2nd level | KS 3rd level | KS 4th level |
|---|---|---|---|---|---|---|
| DS1 | 92,063 | 20.64 | 0.17 (0.99996) | 0.16 (0.41923) | 0.15 (0.04553) | 0.41 (< 10− 4) |
| DS2 | 4665 | 19.71 | 0.17 (0.99996) | 0.15 (0.57158) | 0.22 (0.00050) | 0.43 (< 10− 4) |
| DS3 | 7190 | 37.19 | 0.33 (0.80956) | 0.27 (0.02676) | 0.42 (< 10− 4) | 0.81 (< 10− 4) |
| DS4 | 4754 | 24.43 | 0.17 (0.99996) | 0.15 (0.57158) | 0.23 (0.00020) | 0.44 (< 10− 4) |
| DS5 | 6951 | 47.58 | 0.33 (0.80956) | 0.27 (0.02676) | 0.43 (< 10− 4) | 0.82 (< 10− 4) |
| DS6 | 3911 | 22.19 | 0.17 (0.99996) | 0.11 (0.88044) | 0.16 (0.01740) | 0.54 (< 10− 4) |
| DS7 | 6744 | 38.48 | 0.33 (0.80956) | 0.27 (0.02676) | 0.43 (< 10− 4) | 0.83 (< 10− 4) |
| DS8 | 3552 | 24.25 | 0.17 (0.99996) | 0.13 (0.73544) | 0.17 (0.01232) | 0.55 (< 10− 4) |
| DS9 | 6485 | 49.52 | 0.33 (0.80956) | 0.27 (0.02676) | 0.44 (< 10−4) | 0.83 (< 10− 4) |
| DS10 | 3035 | 23.23 | 0.17 (0.99996) | 0.13 (0.73544) | 0.18 (0.00596) | 0.53 (< 10−4) |
| DS11 | 6445 | 51.22 | 0.33 (0.80956) | 0.31 (0.00785) | 0.45 (< 10−4) | 0.83 (< 10− 4) |
| DS12 | 455 | 28.35 | 0.17 (0.99996) | 0.73 (< 10−4) | 0.85 (< 10− 4) | 0.97 (< 10− 4) |
| DS13 | 320 | 24.60 | 0.50 (0.31803) | 0.82 (< 10−4) | 0.92 (< 10− 4) | 0.99 (< 10− 4) |
| DS14 | 1758 | 20.29 | 0.17 (0.99996) | 0.22 (0.12644) | 0.30 (< 10−4) | 0.65 (< 10− 4) |
| DS15 | 3880 | 54.26 | 0.33 (0.80956) | 0.29 (0.01477) | 0.44 (< 10−4) | 0.81 (< 10− 4) |
| DS16 | 264 | 25.07 | 0.50 (0.31803) | 0.75 (< 10−4) | 0.87 (< 10− 4) | 0.98 (< 10− 4) |
| DS17 | 207 | 27.56 | 0.50 (0.31803) | 0.85 (< 10−4) | 0.93 (< 10− 4) | 0.99 (< 10− 4) |
| DS18 | 4585 | 28.48 | 0.33 (0.80956) | 0.18 (0.29309) | 0.29 (< 10−4) | 0.65 (< 10− 4) |
| DS19 | 2283 | 22.24 | 0.33 (0.80956) | 0.20 (0.19638) | 0.26 (< 10−4) | 0.67 (< 10− 4) |
| DS20 | 3142 | 35.50 | 0.50 (0.31803) | 0.13 (0.73544) | 0.24 (< 10−4) | 0.64 (< 10− 4) |
| DS21 | 12,723 | 31.50 | 0.33 (0.80956) | 0.13 (0.73544) | 0.19 (0.00407) | 0.60 (< 10−4) |
| DS22 | 4841 | 22.89 | 0.17 (0.99996) | 0.11 (0.88044) | 0.22 (0.0032) | 0.43 (< 10−4) |
| DS23 | 8710 | 39.24 | 0.33 (0.80956) | 0.16 (0.41923) | 0.35 (< 10−4) | 0.72 (< 10− 4) |
| DS24 | 24,218 | 32.27 | 0.17 (0.99996) | 0.09 (0.97024) | 0.16 (0.01740) | 0.57 (< 10−4) |
ap-value between brackets
Number of unique Gene Ontology (GO) terms allocated to each dataset, Kolmogorov-Smirnov 2-sample test statistics (KS) on term level and on GO aspect level (molecular function, cellular component, biological process)
| dataset | number of unique GO terms | KS statistic term level | KS statistic aspect level |
|---|---|---|---|
| DS1 | 17,343 | 0.27 (< 10− 4)a | 0.33 (0.97621) |
| DS2 | 12,869 | 0.40 (< 10−4) | 0.33 (0.97621) |
| DS3 | 8118 | 0.62 (< 10−4) | 0.67 (0.31972) |
| DS4 | 12,510 | 0.29 (< 10−4) | 0.33 (0.97621) |
| DS5 | 7858 | 0.60 (< 10−4) | 0.67 (0.31972) |
| DS6 | 11,134 | 0.37 (< 10−4) | 0.33 (0.97621) |
| DS7 | 7515 | 0.64 (< 10−4) | 0.67 (0.31972) |
| DS8 | 10,893 | 0.38 (< 10−4) | 0.33 (0.97621) |
| DS9 | 7329 | 0.62 (< 10−4) | 0.67 (0.31972) |
| DS10 | 11,041 | 0.37 (< 10−4) | 0.33 (0.97621) |
| DS11 | 7434 | 0.63 (< 10−4) | 0.67 (0.31972) |
| DS12 | 3194 | 0.82 (< 10−4) | 0.33 (0.97621) |
| DS13 | 1587 | 0.91 (< 10−4) | 0.67 (0.31972) |
| DS14 | 9149 | 0.48 (< 10−4) | 0.33 (0.97621) |
| DS15 | 6739 | 0.62 (< 10−4) | 0.67 (0.31972) |
| DS16 | 2597 | 0.85 (< 10−4) | 0.33 (0.97621) |
| DS17 | 1459 | 0.92 (< 10−4) | 0.67 (0.31972) |
| DS18 | 10,345 | 0.54 (< 10−4) | 0.33 (0.97621) |
| DS19 | 10,393 | 0.58 (< 10−4) | 0.67 (0.31972) |
| DS20 | 10,468 | 0.41 (< 10−4) | 0.33 (0.97621) |
| DS21 | 14,492 | 0.41 (< 10−4) | 0.33 (0.97621) |
| DS22 | 13,318 | 0.38 (< 10−4) | 0.33 (0.97621) |
| DS23 | 9739 | 0.54 (< 10−4) | 0.67 (0.31972) |
| DS24 | 16,180 | 0.36 (< 10−4) | 0.33 (0.97621) |
ap-value between brackets
Summary of all the test results
| dataset | no. of chromosomesa | CATH Class level | CATH Architecture level | CATH Topology level | CATH Homology level | EC 1st level | EC 2nd level | EC 3rd level | EC 4th level | Pfam | GO terms level | GO aspect level | score without chromosomesb |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| DS1 | 7 | 1c | 1 | 0d | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 6 |
| DS2 | 8 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 6 |
| DS3 | 3 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 5 |
| DS4 | 9 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 6 |
| DS5 | 5 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 5 |
| DS6 | 9 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 6 |
| DS7 | 3 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 5 |
| DS8 | 7 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 6 |
| DS9 | 6 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 5 |
| DS10 | 9 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 6 |
| DS11 | 5 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 5 |
| DS12 | 9 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 3 |
| DS13 | 3 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 3 |
| DS14 | 11 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 5 |
| DS15 | 5 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 5 |
| DS16 | 13 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 3 |
| DS17 | 6 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 3 |
| DS18 | 6 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 5 |
| DS19 | 5 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 5 |
| DS20 | 11 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 5 |
| DS21 | 4 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 6 |
| DS22 | 11 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 6 |
| DS23 | 3 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 5 |
| DS24 | 2 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 6 |
anumber of chromosomes with unbiased distribution of variants
bsum of scores in all categories tested
ccategory has score 1 if distribution was unbiased
dcategory has score 0 if distribution was biased