Reliable Open Data
The past couple of months have seen two outages in the open data space. Early in September, in the UK, data.gov.uk suffered a DNS issue that rendered it unavailable at its usual location over the weekend. More recently, in the US, data.gov has been temporarily taken offline as part of the US government shutdown.
The availability of data catalogs such as these can make data harder to find, but catalogs merely list data hosted elsewhere. The true problems arise when the data itself is no longer accessible. Eric Mill from the Sunlight Foundation called for government agencies to:
Publish downloadable bulk data before or concurrently with building an API.
Explicitly encourage reuse and republishing of their data. (Considering public reuse of data [a risk to the public]() is not recommended.)
Document what data will remain during a shutdown, and keep this up all the time. Don't wait until the day before (or of) a shutdown.
Link to alternative sources for their data. Keep these links up during a shutdown.
The emphasis here, as in David Megginson's post on routing around the damage is on replication of data. This is something that is easy for open data because it (per the Open Definition) lowers both legal and technical barriers in republishing data, which is necessary when providing an alternative source.
But how do we make sure that open data does get replicated? There are three approaches that I've seen suggested:
- storing data on third-party services rather than government services in the first place — though this leaves data availability contingent on the functioning of that third party rather than government when in general, government funding for services is pretty secure
- relying on web archives — depending on their archiving strategy, this may omit recent data, and archives themselves are generally set up to aid future access and so not oriented to being a data backup service
- relying on citizens and civil society organisations to take copies and then themselves make available public data — this activity will naturally focus on data that is of most interest to the individuals who take the time to copy it; unless there is organised activity to back up all data, and keep that backup current, this is likely to be patchy
So while I think Eric's set of recommendations are correct, I would add fifth:
- Proactively ensure there are alternative sources for the data.
While it seems to have fallen out of favour more recently, it used to be the case that code was routinely mirrored across multiple servers, through bilateral agreements between the organisations hosting that code. Governments and other organisations could enter into similar arrangements with each other, or with specialist mirroring services, as part of their contingency planning for service interruptions.
I would like to see some experimentation with more radical options as well. For example, when there are key datasets that don't change frequently, the distributed nature of BitTorrent, which distributes the storage of data through a network, may be a useful alternative to the more standard but centralised HTTP.
Not all data can be easily replicated, which is why as Eric says, publishers should provide bulk downloads and not just APIs. But there are some services where the value of the data comes from its timeliness — for example, how the buses are running, the latest weather forecast, or new legislation. Replication of old data in these cases only helps to ensure historic analyses can be carried out. But most applications that rely on this kind of data will need something current.
Even in the US government shutdown, the Library of Congress preserved access to some services, namely the legislative sites THOMAS.gov and beta.congress.gov. The US government has judged these information services as excepted from the general shutdown, as essential to operations as air traffic control.
We should be asking for more data to be recognised as essential in this way, for more guarantees from governments about the availability of data in all circumstances. Only with such guarantees in place can we truly treat government information as infrastructure.