Four Challenges for Open Data
We've seen an amazing growth of interest in open data within the UK and worldwide over the past half-dozen years, but in my opinion there are still some challenges ahead if we want to get to a situation in which open data is being used to its full potential, by government and businesses and individuals. No doubt there are others that you can come up with now, and that will emerge as we move forward, but these are the four that are top of my list.
Free is not Always Open
There has been a welcome push recently to make data, particularly government data, freely available to whomever wants to use it. But making data freely available is only the first step in making it open for all to reuse. There are layers of barriers between information that anyone can get and data that anyone can use. I'm not just talking about licenses, or the format issues addressed by the five stars of open data, or even the community involvement tackled by the five stars of open data engagement, but barriers around comprehension and access and reliability. For example:
- The COINS dataset released by the government several years ago is a good example of freely available data that is essentially closed to anyone without the time, resources and expertise to comprehend it. (Fortunately there are some people who do!)
- Datasets that are only available as batch downloads raise barriers for those who are only interested in a portion of the dataset; those that are only available through an API make accessing the entire dataset for global analysis difficult.
- When organisations want to build businesses on top of open data, they don't just need the data itself, but also guarantees around persistence, reliability, accuracy and general ongoing maintenance that enable them to invest in it without being concerned about the world shifting beneath their feet.
The star schemes and checklists for open data serve to highlight places where we have found that open data publication hasn't achieved the outcomes that we were hoping for, and ideas about what to do to address that. Which leads on to the second challenge.
Open is not Always Free
Raising the standard of open data releases, to lower the barriers for reusers of that data as described above, can be costly, and organisations are rarely motivated by benefits that come to other people. If they are going to be publish open data, they need to find an incentive to do so that is closer to home. I wrote recently about Open Data Business Models that help public sector and third-sector organisations save money and deliver on their primary task, and for-profits make money, by publishing open data. Open data does not have to be all burden and no benefit for publishers.
In addition, where resources are limited, organisations need to choose where to invest in increasing the quality of the open data they publish. Theories about what generally makes open data useful may be "true but useless" in the context of a specific organisation. Data owners need to connect directly with current and potential reusers, to understand what they need and focus on those requirements. This doesn't preclude unanticipated reuse, but it does help ensure that effort isn't wasted in development work that only might be useful.
Analysis is not Always Easy
Open data, and the plethora of tools that make it easy to visualise, can lure us into a false sense of comprehension. We can take a list of areas with associated numbers of cancer deaths, throw them into a map and pick out the lightest and darkest areas as being best and worst. But all too often we do this without thinking about whether the differences between the figures are either statistically or practically significant.
This matters when people make decisions based these kinds of analyses: how much money should be assigned to each area, what traffic calming facilities should be introduced, which head teachers should be sacked. Statisticians have known how to handle figures responsibly for more than a century; as open data grows up, we need to embed that rigour into the analyses and visualisations we produce.
Open Data is not Always Good
The final challenge for our aspirations for open data is that open data is purely a tool: it is not good in and of itself. Open data can help people make more informed decisions, but it can equally mislead people into making poor decisions, or enable individuals to make good decisions for themselves that, in aggregate, lead to a more divided society.
I recently read "Everything is Obvious" by Duncan J. Watts. He talks about the Music Lab experiment, in which users of an online music store were able to rate and download music tracks, and others see the aggregated ratings of each track. As expected, some tracks were vastly more popular than others. The twist of the experiment was that it was run in eight parallel networks, each with exactly the same songs, and in each world the final ratings of the songs were vastly different: each world had a different track that was vastly more popular than all the rest. Users would choose to listen to songs with higher ratings, and rate those more highly themselves, leading those same songs to get even better ratings, and so on, a feedback cycle leading to a popularity measure whose relationship to the actual musical quality of the track was almost non-existent.
It doesn't seem much of a stretch to imagine that the same pattern applies to making data available about pupil attainment, say:
- parents make decisions about what school to send their children to based on pupil attainment scores
- those who can afford to move house to have more of a chance for their children to attend a good school
- these richer parents spend money on tutoring, on school trips, on extra-curricular activities and so on; the school has more resources to spend on the children
- the children do better at school
- the school's pupil's attainment marks go up
Run the cycle a few times and some schools blossom while others nose dive. We don't have the luxury of running this in eight parallel worlds, but it's not hard to see how initial minor deviations between schools can be exaggerated through this feedback cycle.
Open data can be a powerful force for social, environmental and economic change, but we shouldn't be blind to the fact that it can lead to bad outcomes as well as good -- or even no changes at all! We need to be clear about the goals that we want to achieve when we release open data, and monitor the impact of open data towards those goals, preferably experimentally, to build our evidence base.
Meeting the Challenges
With these challenges on my mind, I am really delighted to be taking up the position of Technical Director of the Open Data Institute. I am looking forward to addressing the challenges by
- studying and spreading the word about open data successes and failures, to help us all learn from each other
- helping data reusers to innovate around open data to drive good decision making
- helping data owners to publish their data in ways that maximise the benefits to themselves, the reusers of their data, and society as a whole
- contributing to the development of tools, services and standards to support this work
Most of all, I am excited about working with the open data community to magnify the impact of their extensive work on the individual concerns of new and existing businesses and public-sector organisations. We are as yet a small team at ODI, but it's one that I'm already proud to be part of.