| Literature DB >> 36262159 |
Houcemeddine Turki1, Dariusz Jemielniak2, Mohamed A Hadj Taieb1, Jose E Labra Gayo3, Mohamed Ben Aouicha1, Mus'ab Banat4, Thomas Shafee5,6, Eric Prud'hommeaux7, Tiago Lubiana8, Diptanshu Das9,10, Daniel Mietchen11,12,13,14.
Abstract
Urgent global research demands real-time dissemination of precise data. Wikidata, a collaborative and openly licensed knowledge graph available in RDF format, provides an ideal forum for exchanging structured data that can be verified and consolidated using validation schemas and bot edits. In this research article, we catalog an automatable task set necessary to assess and validate the portion of Wikidata relating to the COVID-19 epidemiology. These tasks assess statistical data and are implemented in SPARQL, a query language for semantic databases. We demonstrate the efficiency of our methods for evaluating structured non-relational information on COVID-19 in Wikidata, and its applicability in collaborative ontologies and knowledge graphs more broadly. We show the advantages and limitations of our proposed approach by comparing it to the features of other methods for the validation of linked web data as revealed by previous research.Entities:
Keywords: COVID-19 epidemiology; Collaborative curation; Data quality; Knowledge graph refinement; Public Health Emergency of International Concern; Public health surveillance; SPARQL; Shape Expressions; Validation constraints; Wikidata
Year: 2022 PMID: 36262159 PMCID: PMC9575845 DOI: 10.7717/peerj-cs.1085
Source DB: PubMed Journal: PeerJ Comput Sci ISSN: 2376-5992
Figure 2Example of a Wikidata property and its annotations.
Wikidata page of a clinical property (Source: https://w.wiki/aeF, Derived from: https://w.wiki/aeG, License: CC0). It includes the labels, descriptions, and aliases of the property in multiple languages (red), the object data type (blue), statements where the property is the subject (green) as well as property constraints (brown).
Figure 1RefB workflow.
Process of RefB, a bot that adds scholarly references to biomedical Wikidata statements based on PubMed Central (Source: https://w.wiki/an$, License: CC BY 4.0). The source code of RefB is available at https://github.com/Data-Engineering-and-Semantics/refb/.
Constraint types for the usage of Wikidata properties.
Each property constraint is given with its Wikidata identifier, an English label and an English description.
| Wikidata ID | Constraint type | Description |
|---|---|---|
| Q19474404 | Single value constraint | Constraint used to specify that this property generally contains a single value per item |
| Q21502404 | Format constraint | Constraint used to specify that the value for this property has to correspond to a given pattern |
| Q21502408 | Mandatory constraint | Status of a Wikidata property constraint: indicates that the specified constraint applies to the subject property without exception and must not be violated |
| Q21502410 | Distinct values constraint | Constraint used to specify that the value for this property is likely to be different from all other items |
| Q21510852 | Commons link constraint | Constraint used to specify that the value must link to an existing Wikimedia Commons page |
| Q21510854 | Difference within range constraint | Constraint used to specify that the value of a given statement should only differ in the given way. Use with qualifiers minimum quantity/maximum quantity |
| Q21510856 | Mandatory qualifier constraint | Constraint used to specify that the listed qualifier has to be used |
| Q21510862 | Symmetric constraint | Constraint used to specify that the referenced entity should also link back to this entity |
| Q21510863 | Used as qualifier constraint | Constraint used to specify that a property must only be used as a qualifier |
| Q21510864 | Value requires statement constraint | Constraint used to specify that the referenced item should have a statement with a given property |
| Q21510495 | Relation of type constraint | Relation establishing dependency between types/meta-levels of its members |
| Q21510851 | Allowed qualifiers constraint | Constraint used to specify that only the listed qualifiers should be used. Novalue disallows any qualifier |
| Q21510865 | Value type constraint | Constraint used to specify that the referenced item should be a subclass or instance of a given type |
| Q21514353 | Allowed units constraint | Constraint used to specify that only listed units may be used |
| Q21510857 | Multi-value constraint | Constraint used to specify that a property generally contains more than one value per item |
| Q21510859 | One-of constraint | Constraint used to specify that the value for this property has to be one of a given set of items |
| Q21510860 | Range constraint | Constraint used to specify that the value must be between two given values |
| Q21528958 | Used for values only constraint | Constraint used to specify that a property can only be used as a property for values, not as a qualifier or reference |
| Q21528959 | Used as reference constraint | Constraint used to specify that a property must only be used in references or instances of citation (Q1713) |
| Q25796498 | Contemporary constraint | Constraint used to specify that the subject and the object have to coincide or coexist at some point in history |
| Q21502838 | Conflicts-with constraint | Constraint used to specify that an item must not have a given statement |
| Q21503247 | Item requires statement constraint | Constraint used to specify that an item with this statement should also have another given property |
| Q21503250 | Type constraint | Constraint used to specify that the item described by such properties should be a subclass or instance of a given type |
| Q54554025 | Citation needed constraint | Constraint specifies that a property must have at least one reference |
| Q62026391 | Suggestion constraint | Status of a Wikidata property constraint: indicates that the specified constraint merely suggests additional improvements, and violations are not as severe as for regular or mandatory constraints |
| Q64006792 | Lexeme value requires lexical category constraint | Constraint used to specify that the referenced lexeme should have a given lexical category |
| Q42750658 | Value constraint | Class of constraints on the value of a statement with a given property. For constraint: use specific items ( |
| Q51723761 | No bounds constraint | Constraint specifies that a property must only have values that do not have bounds |
| Q52004125 | Allowed entity types constraint | Constraint used to specify that only listed entity types are valid for this property |
| Q52060874 | Single best value constraint | Constraint used to specify that this property generally contains a single “best” value per item, though other values may be included as long as the “best” value is marked with a preferred rank |
| Q52558054 | None of constraint | Constraint specifying values that should not be used for the given property |
| Q52712340 | One-of qualifier value property constraint | Constraint used to specify which values can be used for a given qualifier when used on a specific property |
| Q52848401 | Integer constraint | Constraint used when values have to be integer only |
| Q53869507 | Property scope constraint | Constraint to define the scope of the property (main value, qualifier, references, or combination); only supported by KrBot currently |
Figure 3Example of a property constraint violation indicated via the Wikidata user interface.
On the page of the Wikidata item Q3603152 (flash blindness), a constraint violation is indicated by the encircled exclamation mark. Clicking on it reveals the display of the popup with some further explanation (File available on Wikimedia Commons: https://w.wiki/ZuJ, License: CC0).
Figure 4Entity Schema example.
Entity Schema for COVID-19 dashboards, search engines and datasets (Source: https://www.wikidata.org/wiki/EntitySchema:E205. File available on Wikimedia Commons: https://w.wiki/4rg5, License: CC0).
Figure 5Web interface of the Wikidata Query Service.
It involves a query field (black), a query builder (red), a short link button (pink), a Run button (blue), a visualization mode button (purple), a download button (brown), an embedding code generation button (grey), a results field (green), and a sample query button (yellow) (Source: https://w.wiki/aeH, Derived from: https://query.wikidata.org, License: CC0).
Figure 6Sample statistical data available through Wikidata.
The item about the COVID-19 pandemic in Tunisia is shown (Adapted from: https://www.wikidata.org/wiki/Q87343682, Source: https://w.wiki/uUr, License: CC0).
Tasks for the heuristics-based evaluation of epidemiological data using the Wikidata SPARQL endpoint.
Each validation task is given with its identifier, a brief description of the heuristic validation criteria and an example where the data does not fit them. See the section “Constraint-driven heuristics-based validation of epidemiological data” for definitions of the epidemiological variables.
| Task | Description | Sample filtered deficient statement |
|---|---|---|
| Validating qualifiers of COVID-19 epidemiological statements | ||
| V1 | Verify | |
| V2 | Verify | |
| Ensuring the cumulative pattern of | ||
| V3 | Identify | ( |
| V4 | Find missing values of | ( |
| Validating values of epidemiological data for a given date | ||
| V5 | Identifying | |
| V6 | Identify | ( |
| V7 | Identify | ( |
| V8 | Identify | ( |
| V9 | Identify | ( |
| V10 | Comparing the epidemiological variables of a general outbreak with the ones of its components | ( |
Matrix overview of data quality issues identified per validation task and epidemiological Wikidata property.
Rows represent validation tasks as defined in Table 2, columns the corresponding epidemiological Wikidata properties, and the value in a given cell represents the number of deficient statements identified by the row’s specific task for the column’s epidemiological Wikidata property on a given date (August 8, 2020).
|
|
|
|
|
| Overall | |
|---|---|---|---|---|---|---|
| V1 | 18 | 9 | 10 | 2 | 1 | 40 |
| V2 | 2 | 91 | 6 | 0 | 0 | 99 |
| V3 | 660 | 92 | 6 | 5 | 763 | |
| V4 | 2,081 | 2,247 | 149 | 1 | 4,478 | |
| V5 | 0 | 0 | 0 | 0 | 0 | 0 |
| V6 | 8 | 8 | 8 | |||
| V7 | 1 | 1 | 1 | |||
| V8 | 9 | 9 | 9 | |||
| V9 | 17 | 17 | 17 | |||
| V10 | 60 | 19 | 1 | 0 | 1 | 81 |
|
| 2,856 | 2,467 | 189 | 9 | 10 | 5,496 |
Figure 7Distribution statistics.
Confidence intervals for different p-values (p) when using a normal distribution (Source: https://w.wiki/aKT, License: Public Domain) (after Ward & Murray-Ward, 1999).
Figure 8Key elements of data quality workflows on Wikidata.
Interactions between consistency rules, property statements, and RDF validation languages (Source: https://w.wiki/ao5, License: CC BY 4.0).