Exploring open data quality
To contribute to discussions around open data quality, the ODI has been working with Experian to investigate quality in several UK Government open datasets. Here, ODI Associate Leigh Dodds introduces the project and its initial findings
There are a number of initiatives at the moment exploring the idea of data quality, with particular reference to describing, measuring and improving the quality of open data.
For example, the W3C Data on the Web Best Practices Working Group are producing a vocabulary for publishing and describing data quality metrics. There is also related work capturing best practices for sharing public sector data.
Various open data projects and communities are working to improve the quality of their open data and have started to share guidance. For example data.gov.sg have recently shared their data quality guide for tabular data. And Mark Frank and Johanna Walker at Southampton University have recently published a paper exploring a user-centred view of data quality.
To contribute to this ongoing discussion, we recently undertook a small project with Experian to explore data quality in some open datasets.
The project had several goals:
to identify the types of data quality issues we might find in some existing open datasets
to suggest some common data quality checks that both publishers and users could apply to data
We worked with the data quality team at Experian to run the datasets through their Pandora data quality tool. Pandora is a data-profiling tool designed to support exploration of datasets, highlight data quality issues and enrich data against other sources. For this project we used Pandora to generate some quality metrics for each of the datasets we reviewed.
You can recreate a number of the checks we carried out using the free version of the tool.
Our key insights are as follows:
There is still scope to improve how well datasets are documented and published to data.gov.uk and beyond
Even in large, well-used and maintained datasets there are a number of basic data quality checks that could be applied to improve data quality
Defining and using standard schemas for datasets would benefit both data publishers and users
Being able to quickly summarise and explore a dataset offers a powerful way to understand its structure and highlight potential data quality issues
The use of standard, open registers will be a significant boost to the quality of many open datasets
If you have any feedback on the findings or suggestions for how to build on the work further, then please get in touch with our labs team.