Exploring how data can improve PDFs: our work so far

Having recently released research on how PDFs can work better with data, ODI Leeds Head of Data Tom Forth and ODI Head of Tech Olivier Thereaux share their working and aims for how the work may develop with input from the community

Since spring 2017, we at the ODI, in close collaboration with our friends at ODI Leeds, have been working with Adobe to look at how PDF can work better with data. The research concluded a few weeks ago, and a report summarising its findings so far was published by ODI Leeds on 17 October.

PDF has a bad reputation in the open data world, often for good reasons. Releasing PDFs has too often been used to obfuscate data in public releases and responses to FOIs. Our research suggests that this remains a particular issue in the USA, while less of a problem elsewhere. In no way should our work on PDF and data be seen as a justification for this practice: we’ve heard the feedback from the community, and agree that we need to be vigilant in how this work is presented, lest it be used to justify bad practices rather than pushing for better uses of PDF.

Adobe, who commissioned this work, wanted to get a better understanding of how PDFs were used, and was keen for us to document good and bad practices, and explore possibilities for improving PDFs, where they are valuable.

Exploring how PDFs could be improved with data

As the report says, PDFs are for documents, not for open data publishing. Publishing data by printing out a spreadsheet is a bad idea. But that doesn’t mean that PDF documents cannot play a useful role alongside open data.

The report also explores how PDF – an open standard supported by an ecosystem of open and closed-source tools – can include attachments. We looked at how such PDFs might usefully improve documents about data, by including the data referred to in the document as an attachment. In some fields, especially related to archiving and the law, adding attachments to these documents may add value. We have been collecting examples of such use cases, good and bad, on a public spreadsheet, open to all.

Of the report’s seven suggested next-steps, the one that has caused the most controversy is its proposal to explore whether in some cases a PDF document with attached data might qualify for a 3-Star data rating. We’ve listened to feedback and we agree that the suggestion that example PDFs with data attachments may already be 3-star data was premature and confusing. We note that the 5-Star open data scheme only considers formats. Other ranking systems, such as our Open Data Certificates, aim to provide a broader and deeper view of how well an open dataset is published.

We embrace all feedback

We’ve had lots of feedback – positive and negative – in the past week. We’re listening, and we’re keen to put the energy to good use. Our public spreadsheet already has use cases where PDFs are unsuitable, and use cases where PDFs might add value to data. We’d love to see more use cases that can inform discussion on the PDF and Open Data W3C community group. The group is open to all, and you’re welcome to join us.

If you have ideas or experience in open data that you'd like to share, pitch us a blog or tweet us at @ODIHQ.