| Literature DB >> 30985890 |
Ben Scott1, Ed Baker1, Matt Woodburn1, Sarah Vincent1, Helen Hardy1, Vincent S Smith1.
Abstract
The Natural History Museum, London (NHM), generates and holds some of the largest global data sets relating to the biological and geological diversity of the natural world. A majority of these data were, until 2015, not widely accessible, and, even when published, were typically hard to find, poorly documented and in formats that impede discovery and integration. To better serve the bespoke needs of user communities outside and within the NHM, a dedicated data portal was developed to surface these data sets and provide a sustainable platform to encourage their citation and reuse. This paper describes the technical development of the data portal, from its inception to beta launch in December 2015, its first 2 years of operation, and future plans for the project. It outlines the development principles adopted for this prototypical project, which subsequently informed new digital project management methodologies at the NHM. The process of developing the data portal acted as a driver to implement policies necessary to encourage a culture of data sharing at the NHM.Entities:
Mesh:
Year: 2019 PMID: 30985890 PMCID: PMC6459053 DOI: 10.1093/database/baz038
Source DB: PubMed Journal: Database (Oxford) ISSN: 1758-0463 Impact factor: 3.451
Comparison of open source data portal platforms
| Name | Version tested | Technology | Author | Website | Example sites | Technology stack | Documentation | Functionality/extensibility | Popularity and sustainability |
|---|---|---|---|---|---|---|---|---|---|
| CKAN | 1.8.1/2.0 Alpha | Python, Pylons and PostgreSQL | Open Knowledge Foundation | Extensive experience of Python and PostgreSQL within Informatics Group developers. Pylons framework has been superseded by Pyramid. | Good | Modular and extensible—plenty of examples of open-source plug-ins to build upon. Has a module to integrate with Drupal/Scratchpads. Flexible data storage, we can change to cloud storage systems. Provides support for any type of data set. | Very popular; is becoming the de facto standard for data portals, little risk of obsolescence. Has EU backing and used to build | ||
| CKAN Version 2 is near Beta now, so we would need to build against the new, less well-documented version. | |||||||||
| Open Data Catalog | 1.0 | Django, Python and PostgreSQL | Azavea | Django is an extremely active and popular python framework. GIS built in. | None | Based on Django, so easily customizable. Supports all types of data sets. | Open Data Catalog does not seem to be widely used and is not even included on the software company's product pages: | ||
| Open Government Platform (Version: Alpha) | Alpha | Drupal, PHP, MySQL | Indian and US government | Building on top of Drupal—fits well with museum/scratchpads. | Very limited | Very customizable with Drupal. Supports all types of data sets. | Popularity waning? US moving to CKAN anyway: | ||
| Still Drupal version 6 though, and MySQL only; limited geospatial capabilities. | |||||||||
| It’s the open source version of the US’s | |||||||||
| Customized IPT | 1.0 | GBIF IPT, Java | Canadensys | Built upon the GBIF IPT—much simpler GBIF ingestion. | Good | Quick win: does have all the functionality we need for the collections data set on the data portal. However, it is very much customized for the needs of Canadensys—extending it to allow depositing data sets/cloud storage will be a lot of work. | Not future proof: looks to have just one (judging from the git repo logs) developer maintaining it who was not the original creator. Not a common approach—seems to be the only portal built with a customized IPT | ||
| Java—limited experience within Informatics Group developers. |
(Continued)
Figure 1Data set FileStore and DataStore model.
Figure 2Overview of the technical architecture for publishing collections data and digital media.
Figure 3Interactive map visualizing over one million geocoded collection objects.
Figure 4Sketchfab 3D model of southern right whale cranium http://data.nhm.ac.uk/dataset/3d-cetaceanscanning/resource/63a6168b-4594-4998-964e-86b8f7398e9c.
Figure 5The data portal homepage.
Linked open data before processing using GBIF
| Subject | Object | Predicate |
|---|---|---|
|
|
| ` |
|
|
|
Linked open data after processing using GBIF
| Subject | Object | Predicate |
|---|---|---|
|
|
|
|
|
|
|
CKAN packages used by the NHM Data Portal
| Package | Description |
|
| Provides a user interface to download resources using |
|
| Contact form. |
|
| SOLR to index and search data sets (used for specimen collection). |
|
| Adds geospatial searches within the datastore. |
|
| Developer and debugger tools. |
|
| Integration with DataCite to create DOIs. |
|
| Data set resource image galleries. |
|
| Loads the GBIF data set back into the portal. |
|
| Server-side graph rendering. |
|
| LDAP integration—allow staff to login with their museum account. |
|
| List view of resource records, displaying a subset of fields. |
|
| Geospatial visualization of records. |
|
| Main NHM extension, providing theming and generic customizations. |
|
| Embedding Sketchfab 3D models. |
|
| API for accessing data portal metrics. |
|
| Status banner for system alerts. |
|
| Twitter integration, for tweeting when data sets are created and updated. |
|
| Allow users with `member’ role within an organization to create/edit/delete their own data sets. |
|
| Embedded Youtube and Vimeo video players. |
Figure 6The old web search interface to the Entomology collections of the NHM.
EMU exports and record counts per annum (High total record count for 15/16/17 caused by repeated full data reload events.)
| Calendar year | Total records exported | Mean records per export | |
|---|---|---|---|
| 2015 | 31 | 4 302 239 | 138 781 |
| 2016 | 52 | 6 983 021 | 134 288 |
| 2017 | 163 | 6 030 596 | 36 997 |
| 2018 | 257 | 367 625* | 1430 |
Figure 7The Luigi ETL pipeline for loading KE EMu collection records into the data portal.
Figure 8View of NHM specimens on the NHM Data Portal showing DQIs from GBIF (green, no known errors; orange, minor errors; red, major errors).
Example of metadata for the BioAcoustica contributed dataset (20)
|
|
|
|
|---|---|---|
| Title | The name of the data set | BioAcoustica |
| Abstract | Short description of the data set | A worldwide collection of scientific recordings of animal sounds from the NHM and our collaborators |
| Keywords | Keywords | Bioacoustics, biodiversity, sound, taxonomy |
| Data set category | Broad theme of the data set | Research |
| License | How is the content licensed? | License not specified (BioAcoustica has a fine-grained system of licensing individual items of content) |
| Visibility | Public or private | Public |
Figure 9(A) Treemap of data sets hosted on the NHM Data Portal, size reflects the number of records. (B) Records downloaded from the NHM Data Portal each month. (C) NHM Data Portal Web traffic (page views and sessions). (D) Country of origin for users of the NHM Data Portal since launch.%”.
Continued
| Name | Version tested | Technology | Author | Website | Example sites | Technology stack | Documentation | Functionality/extensibility | Popularity and sustainability |
|---|---|---|---|---|---|---|---|---|---|
| Nodes Portal Toolkit | NPT 1.0 | Drupal, PHP, MySQL | GBIF | Drupal (currently version 6; 7 in development) | Limited | Built with Drupal so easily extensible. Designed for biodiversity data sets—extending to support any type of data set will require extensive customization. | Appears to have just one developer performing the upgrade from Drupal 6 to Drupal 7. | ||
| DataVerse | 3.5.0 | Java, PostgreSQL | IQSS Harvard Library Harvard University Information Technology | Java—limited knowledge with Informatics Group | Good | More focused on publication data—most sites are universities and libraries. UI and design is awful. Aimed at researchers. | Under active development, with analytical tools planned for future versions. Limited take-up outside of initial partners. Very few portals are being built with it. | ||
| DSpace | 1.8.1 | Java, PostgreSQL/Oracle | 1000s of universities | Java—limited knowledge with Informatics Group | Good | Designed for academic and research libraries/unis as an open access repository. Has a lot of the functionality we need. And more besides. Is probably overkill for what we want to do—collection management tools built in etc., is much more than an open access portal. | Lots of UK institutions using it—Imperial etc., |