Open data publishing: towards better tools and processes

Publishing high-quality open data can still be costly and ad-hoc. The ODI is going to build and improve publishing tools to speed up and automate the process

null Photo: The Open Data Institute

By Olivier Thereaux and Dave Tarrant

In 2013, McKinsey estimated a global market powered by open data from all sectors would create an additional $3tn to $5tn a year. Another study by Capgemini concluded that by 2020, the use of open data will have reduced public administration costs across the EU28+ by €1.7bn.

That being said, an updated report from McKinsey estimates that only a small fraction (between 10–20%) of the value to the EU has so far been realised.

We at the ODI believe that alongside efforts to improve data access and usability, one of the ways to improve the situation is to look at pain points in how open data is currently being published, and develop potential solutions.

We are starting a new research and development project aiming to make it easier, faster and less costly to publish reliable, high-quality open data. The project will work on three core hypotheses around quality, speed and automation, and cost.

Quality

Ensuring open data is of a high quality is a prime concern amongst publishers, and understandably so: bad data is often not useful, can be unusable, waste time and money, and may even lead to reputational damage.

As we are finding in the early stages of our research, this often leads to a perception that ‘no data is better than bad data’ (or at least less risky). This lack of confidence in data quality appears to be a major blocker to open data being published across many governments.

One of the great benefits of open data is that the community can access it easily and help fix quality problems. However, enabling publishers to increase data quality prior to publication and use brings greater benefit, still.

Our hypothesis is that better tools – for cleaning or describing data, for example – and better integration of existing tools will improve data quality and help increase confidence and reduce blockers to publish more openly.

Speed and automation

Publishing quality open data can take a long time. One of the main reasons for this is that the publishing process doesn’t fit naturally with existing tools and workflows, particularly in government. Too much time is often spent manually repeating processes that could be automated.

In 2014, the Open Data Institute worked with the Mexican Government to help them publish 100 datasets in just 42 days. Integrating a data quality tool with the publishing platform was key to achieving this.

Our hypothesis is that integrating data tools more strongly can help automate and speed up the process of publishing high-quality open data.

Cost

Publishing open data is not meant, in itself, to be a profitable activity, but it should at least be as cost-effective as possible. This is not the case today.

The cost of online hosting is going down, and the growing landscape of open source portals and publishing tools makes it possible to make open data available online virtually for free.

And yet, publishers of open data still incur significant costs for their programmes, as seen for instance in this 2012 NAO analysis of the Government Open Data initiatives. This is largely due to the fact that publishing open data still requires experts to perform technically complex activities in order to gather, clean describe, link and prepare it for publication.

Our hypothesis is that better tools could help automate more of the publishing process and that better integration between them would reduce the complexity and number of steps required, ultimately reducing cost.. For example, building publishing workflows into commissioning new data systems would make it much less costly to publish data..

What we are doing, and how you can get involved

To test our three hypotheses, we have started a phase of research to iteratively audit of what tools are currently available for gathering, preparing and publishing open data.

We are also trying to understand where the pain points are, and what roles tools have in the problems as well as the solutions.

We will be organising a number of workshops across the UK in the next few months. We have already met with open data publishers in Exeter, Liverpool and Southampton, and plan to have at least one more, in Leeds, in the next few weeks.

If you publish data (or don’t, but would like to or feel you should), we encourage you to join these workshops. If you cannot, you can get in touch through this survey to let us know your pain points in publishing open data.

Finally, as well as understand the problem, we want to help shape the solutions. In the next six months, we will run three strands of open source development to create new tools, improve existing ones and create better-integrated workflows from tools that the community of data publishers routinely use. For this, we will collaborate closely with the successful bidders on our recent tender – to be revealed soon – and integrate our research to improve open data publishing in the UK as a whole.

Olivier Thereaux is Head of Tech and Dave Tarrant is Learning Skills Lead at the Open Data Institute. Follow @olivierthereaux and @davetax on Twitter