https://stock.adobe.com/uk/images/data-scientist-database-science-graph-thinking-computing/113578100?prev_url=detail

How hard is it to publish good-quality open data?

Thu Oct 4, 2018
$download_content = get_field('download_content');

How hard is it to publish good-quality open data? Two publishers share their honest experiences

The process of publishing data is notoriously difficult, with many pitfalls to avoid. But help is at hand – if you know where to look

Here two data publishers – Paul McGuinness from the Food Standards Agency and Neil Lawrence from ‎Oxford City Council – share their travails with Oliver Pickup so that others can glean knowledge from their experiences.

We also list useful data publishing resources.

Meet the data publishers

Paul McGuinness, Data Technician, Food Standards Agency
Paul McGuinness, Data Technician, Food Standards Agency

Paul McGuinness is a Data Technician in the Information Knowledge Management team at the Food Standards Agency (FSA). He has published a blog about the FSA’s data publishing developments.

When did you start publishing open data?

“Our open data journey at the FSA began back in 2016 when the decision was made that our approach to data would be ‘open by default’, thus openly publishing our data whenever possible. It is important to us that we are not simply publishing large amounts of data, but are routinely publishing data of a high quality and updating it frequently.

“Data publication in itself is not tricky; it is the development of the publication process that can be. Once a successful process has been developed it can be rolled out to others via training and coaching.”

Paul notes that the process is now being rolled out to data holders, adding that they will be able to self-publish regular updates to the datasets. “This is another step in embedding the culture of openness across the FSA,” he says.

As well as publishing the data, the data team supports data holders to prepare the data “…in line with the principles of open data, which in turn supports our commitment to transparency.”

How many datasets does the FSA manage now, and what benefits have you seen?

“We currently have 164 datasets published openly, which represents over 70% of the FSA’s data. However, we do not just aim to publish large amounts of data, but also to focus on data quality and regular updates.

“With regard to benefits, feedback about our open data has allowed us to further develop how we make the data available, more closely fitting the needs of the end user. An example of this is our Alerts API which has gone live recently.”

What were the main barriers you found, and how did you overcome them?

“Open data publication can be a bit of a daunting concept when it is first explored, so one of our biggest barriers was trying to get buy-in from our ‘data owners’ to embrace the culture of openness.

“The further we get along our open data journey the easier this becomes, though. [They] can see what others have published and share experiences. To date we have managed this without any of the initial fears being realised.

“We have also had technical barriers to overcome, particularly in relation to publication tools. The tools made available to us via the ODI have been a great help. We took the ODI’s free Octopub product on board quite early in its development – it is still our primary interface for publishing open data – and subsequently identified a number of issues. We then had to develop workarounds to deal with bugs and glitches in its functionality. This was time consuming and often quite frustrating.

“Other open-source resources such as the JSON Schema Tool and CSV Lint, another ODI tool, have been used by our data publishers and now form part of our data publication process.”

What do you wish you knew at the start of the process that you know now?

“Having seen a demonstration of the new version of the Octopub product, I wish that it had been available when we started using it as our primary publication tool. Identifying and isolating bugs in the functionality and developing workarounds for them has been very time consuming. It’s good to see that our feedback has been taken on board in the development of this new version.”

What advice would you give to people starting out on their data publishing journey?

Paul explains that it’s important to encourage a culture of openness as early as possible. “Getting everybody on board and working towards a common goal is really important. Having good governance procedures in place is also necessary to ensure that all appropriate questions around data protection, commercial data etc are being raised early in the publication process.

“Also, the role that good data leadership plays in this space should not be overlooked – we have been very lucky to have a chief executive and key directors who were incredibly supportive of our open data approach. Without this support it would not have been possible to make as much progress as we have.”

Commenting on Paul’s story, Leigh Dodds, Data Infrastructure Programme Lead at the ODI said: “It’s great to learn more about FSA’s data publishing journey and get honest feedback on some of our prototypes and tools. We recognise many of the successes, challenges and frustrations voiced and our aim is to align our resources to this journey – resources such as the second release of Octopub and our Open Standards for Data guidebook.”

Neil Lawrence, Digital Transformation Manager, Oxford City Council
© Neil Lawrence

Neil Lawrence is Digital Transformation Manager at ‎Oxford City Council. He has kept a record of his data publishing journey.

When – and why – did you start publishing open data?

“September 2016. Our council co-chairs a body called Smart Oxford that wanted to pursue the publication of open data to get people more involved with technology. Separately, our council agreed a digital strategy in November 2016 which has a commitment to publish open data. For me it is about committing to transparency and being open with the public we serve.”

Where did you initially go for information, and how useful was it?

“We relied to a degree on our original platform provider, Socrata. We also sought out other platform users to gain insights, such as Camden Council. Socrata organised a workshop with customers where we met other people and swapped success stories. Personal contacts through LocalGov Digital were helpful, as was the ODI website and training course.

“For open data there is no real equivalent. There is information about what makes data open, the five levels etc. If you work at it you can get your head around the certification types (the ODI course I went on was the most useful for this).”

What were the main barriers you found, and how did you overcome them?

“In terms of schemas there is nothing – you either know everything or nothing. The technical aspects of making a valid data table are another mind-boggling area (UTF-8, anyone?!). If you want a recommendation about a platform there is no useful comparison resource, and even if there was it would be too long to know where to start.

“I also think there is a degree of snobbery with open data. Data professionals will sometimes sneer at the idea of spending money on a data platform (they are the sort of people that wrote websites using Notepad and a browser), or the idea of using visualisations. It feels too clique-ish – you have to earn your stripes before you get to drink the Kool-Aid.

“In the end I just had to work through these things by myself until some of the cogs slipped into place (a data catalogue can be separate to data storage, for example). I tried some things out and kept a record of what worked and what didn’t.”

Were there any other barriers?

“Yes. In terms of resource,  I was the only person working actively on open data for a long time in the organisation, and I assumed no one else cared. But then our grants coordinator signed up to the 360Giving standard and suddenly we’re using a schema. I’m also starting to work directly with people that are producing the data to meet statutory requirements but without much regard for format. By enlisting them on changing their reports we can end up with data that is more readily re-usable.

“The ODI is a great resource for training, finding tools and getting access to weekly talks on open data. Find open data people on Twitter and follow them. Go to events to ask questions and meet people who can help. Use free trials from platforms to get to know them and if they suit you.”

“From a cultural perspective, it’s still really hard to convince anyone that going to the effort to produce open data has any return. If you are Transport for London you can do this, but this is an overused example about the benefits of open data that most organisations can’t relate to.

“A lack of knowledge has been a barrier. There’s still an assumption that data is open if you are able to search for it on the web, or is shown in a visualisation. The training course I went on at ODI was helpful because it challenged that. Also, being able to point to other similar organisations that achieve more than us is a helpful spur to action.

“On the technological front, understanding how to format CSVs was tough – I’m getting there now, with CSVLint, but I’m still uncertain about how to resolve some issues. Also I have no idea how we use FME, so we can’t automate production.

“Finally, the biggest single barrier for me has been legislative, and specifically the use of data derived from Ordnance Survey open data. There is no easy guide that explains this (despite what OS say). It [legislation] can also be an incentive – Transparency Code and INSPIRE mean we have to take action to produce data, which gets people on the case.”

What do you wish you knew at the start of the process that you know now?

“I still don’t think I know that much, but one of the things that would have helped would have been not starting with the platform but starting with the data and making sure it is in the right format, dependable, and useful. Plus I wish I had got other people on board quicker, and have them commit more than just their support.”

What advice would you give to people starting out in open data publishing?

“The ODI is a great resource for training, finding tools and getting access to weekly talks on open data. Find open data people on Twitter and follow them. Go to events to ask questions and meet people who can help. Use free trials from platforms to get to know them and if they suit you.”

Leigh Dodds, Data Infrastructure Programme Lead at the ODI said: “This case study demonstrates the sometimes complex path towards open data publishing, and also highlights the range of tools, resources support and training available, from the ODI and others. We second the advice to start with the data rather than the platform and to make use of social media, events and training to help steer decisions around data publishing.”

Read more in our blog: To improve open data, help publishers.

For data publishers, of varying levels, the below links offer useful advice and guidance. Consult them, bookmark them, use them.

  • Data.gov.uk – Showcasing data published by government departments and agencies, public bodies and local authorities. One can use this data to learn more about how the government works, carry out research, or build applications and services.
  • Octopub – A GitHub application / tool that provides a simple and frictionless way for users to publish data easily, quickly and correctly, having been developed by the ODI and based on the needs of publishers.
  • Standards Guidebook – Resources to help people find and choose open standards for data
  • Frictionless Data Field Guide  – This provides step-by-step instructions for improving data publishing workflow, as well as introducing new ways of working informed by Frictionless Data software.
  • Lintol – An open data validating tool, supported by the ODI, akin to a grammar checker for open data. This new project is aiming to create a new generation of tools to help individuals and teams test the quality of their data.
  • CSVLint – CSV (an acronym for comma separated values) looks easy, but it can be hard to make a CSV file that other people and, more importantly, machines can read easily. ODI-backed CSVLint helps publishers check that their CSV file is readable.
  • goodtables – Offers continuous data validation for spreadsheets, and monitors files in multiple formats, including CSV, Excel, LibreOffice, and more.

Oliver is a London-based writer. He specialises in tech, business, sport and culture. Follow @OliverPickup on Twitter