| Literature DB >> 35534084 |
Ben Gordon1, Clara Fennessy1, Susheel Varma1, Jake Barrett1, Enez McCondochie2, Trevor Heritage2, Oenone Duroe2, Richard Jeffery2, Vishnu Rajamani2, Kieran Earlam3, Victor Banda4, Neil Sebire5.
Abstract
OBJECTIVES: To objectively evaluate freely available data profiling software tools using healthcare data.Entities:
Keywords: health informatics; information management; information technology
Mesh:
Year: 2022 PMID: 35534084 PMCID: PMC9086620 DOI: 10.1136/bmjopen-2021-054186
Source DB: PubMed Journal: BMJ Open ISSN: 2044-6055 Impact factor: 3.006
Detailed scoring criteria per feature
| Feature tree | Score | |||||
| Data ingestion and integration | → | Data consolidation | → | Connectivity to N data sources | 5 | |
| → | Data extraction, transformation and loading (ETL) and ETL support | 5 | ||||
| → | Data modelling | 5 | ||||
| → | Data propagation | → | Data flow orchestration, enterprise application integration, exchange of messages and transactions | 5 | ||
| → | Enterprise data replication, transfer large amounts of data between databases | 5 | ||||
| → | Versioning and file management | 5 | ||||
| → | Data virtualisation | → | Data access | 5 | ||
| → | Data federation | → | Enterprise information integration | 5 | ||
| | |
|
| ||||
| Data preparation and cleaning | → | Parsing and standardisation | → | Tagging data with keywords, descriptions or categories | 5 | |
| → | (Data scrubbing)/cleansing/handling blank values/reformatting values/threshold checking | 5 | ||||
| → | Data enhancement/enrichment/curation | 5 | ||||
| → | Natural language processing | 5 | ||||
| → | Address validation/geocoding | 5 | ||||
| → | Master data management | 5 | ||||
| → | Data masking | 5 | ||||
| → | Identity resolution, linkage, merging and consolidation | → | Data deduping | 5 | ||
| → | Machine learning/training a statistical model | 5 | ||||
| → | Data aggregation | 5 | ||||
| → | Data binning | 5 | ||||
| → | Grouping similar data/clustering | 5 | ||||
| → | Outlier detection and removal | 5 | ||||
| → | Master reference data management | → | ‘Hub’ infrastructure to source and distribute master/reference data | 5 | ||
| → | Master data versioning based on data history and timelines | 5 | ||||
| → | Workflow integrations to steward and publish the master/reference data | 5 | ||||
| → | Graph data stores to define relationships for creating a flexible knowledge graph | 5 | ||||
| → | Accessible API (Application Programming Interface) for real-time access to shared reference data | 5 | ||||
| | |
|
| ||||
| Data profiling, exploration/ pattern detection | → | Relationship discovery | → | Cross-table redundancy analysis | 5 | |
| → | Performing data quality assessment, risk of performing joins on the data | 5 | ||||
| → | Identifying distributions, key candidates, foreign-key candidates, functional dependencies, embedded value dependencies and performing intertable analysis. | 5 | ||||
| → | Content discovery | → | Data pattern discovery | 5 | ||
| → | Domain analysis | 5 | ||||
| → | Discovering metadata and assessing its accuracy | 5 | ||||
| → | Structure discovery | → | Column value frequency analysis and statistics, collecting descriptive statistics like min, max, count and sum. | 5 | ||
| → | Table structure analysis, collecting data types, length and recurring patterns. | 5 | ||||
| → | Drill-through analysis | 5 | ||||
| | |
|
| ||||
| Data monitoring | → | Monitoring and alerting | → | Time series data identified and collection by metric name and key/value pairs | 5 | |
| → | Flexible query language to leverage this dimensionality | 5 | ||||
| → | Graphing and dashboarding support | 5 | ||||
| | |
|
| ||||
| Data use | → | Metadata management | → | Concept identification and naming | 5 | |
| → | Data categorisation | 5 | ||||
| → | Lineage | 5 | ||||
| → | Relationship with other metadata | 5 | ||||
| → | Comments and remarks | 5 | ||||
| → | Data statistics (profiles) | 5 | ||||
| → | Knowledge graph | 5 | ||||
| → | Privacy and security | → | Data anonymisation | 5 | ||
| → | Role based access control | 5 | ||||
| → | Secure environment setup and deployment | 5 | ||||
| → | Container-based deployment | 5 | ||||
| → | Data mining | → | Interactive data visualisation | 5 | ||
| → | Visual programming and analysis | 5 | ||||
| → | Visual illustrations and training documentation | 5 | ||||
| → | Sample data/generate fake data | 5 | ||||
| → | Add-ons and extension functionality | 5 | ||||
Figure 1Main results of documentation based functionality for data quality categories by tool.
Specific data profiling tool functionalities evaluated
| Functionality type | Function | Data profiling tools capable of natively executing function | |||||||
| K | DC | O | W | PP | AP | TOS | WR | ||
| No of rows | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| No of nulls | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Percentage of nulls | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| No of distinct values (cardinality) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| Percentage of distinct values (No of distinct values divided by the no of rows) | ✓ | ✓ | ✓ | ✓ | |||||
| Frequency histograms (equi-width, equi-depth, etc.) | ✓ | ✓ | |||||||
| Minimum and maximum values in a numeric column | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Constancy (Frequency of most frequent value divided by number of rows) | ✓ | ✓ | ✓ | ||||||
| Quartiles (three points that divide the numeric values into four equal groups) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| Distribution of first digit in numeric values (to check Benford’s law) | ✓ | ✓ | ✓ | ||||||
| Basic types (eg, numeric, alphanumeric, date, time) | ✓ | ✓ | |||||||
| Database Management Systems-specific data type (eg, varchar, timestamp) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| Measurement of Value length (minimum, maximum, average, median) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |||
| Maximum number of digits in numeric values | ✓ | ✓ | ✓ | ✓ | |||||
| Maximum number of decimals in numeric values | ✓ | ✓ | ✓ | ||||||
| Histogram of value patterns (Aa9…) | ✓ | ✓ | ✓ | ✓ | |||||
| Generic semantic data type (eg, code, date/time, quantity, identifier) | ✓ | ✓ | ✓ | ✓ | |||||
| Semantic domain (eg, credit card, first name, city) | ✓ | ✓ | ✓ | ✓ | |||||
| Unique column combinations (key discovery) | ✓ | ||||||||
| Relaxed unique column combinations | ✓ | ||||||||
| Inclusion dependencies (foreign key discovery) | ✓ | ||||||||
| Relaxed inclusion dependencies | ✓ | ||||||||
| Functional dependencies | ✓ | ||||||||
| Conditional functional dependencies | ✓ | ||||||||
| Correlation analysis | ✓ | ✓ | ✓ | ||||||
| Association rule mining | ✓ | ||||||||
| Cluster analysis | ✓ | ||||||||
| Outlier detection | ✓ | ✓ | ✓ | ||||||
| Exact duplicate tuple detection | ✓ | ✓ | ✓ | ||||||
| Relaxed duplicate tuple detection | ✓ | ✓ | ✓ | ||||||
| Total | 19 | 13 | 8 | 5 | 30 | 10 | 15 | 8 | |
AP, Aggregate Profiler; DC, DataCleaner; K, Knime; O, orange; PP, Pandas Profiling (Python); TOS, Talend Open Studio for Data Quality; W, WEKA; WR, WhiteRabbit.
Figure 2Results of profiling tasks using synthetic datasets. Knime and Pandas performed best for overall data profiling tasks for this healthcare dataset.