Code of Practice (Datasets)

General Comments

The Open Data Institute believes that publicly owned data should be open data — freely available without restrictions on its use — so that both public authorities themselves and the rest of society can gain the maximum benefits from that data.

The Protection of Freedom Act 2012 introduces a “right to data” for the first time, which is an important step forward in enabling individuals and organisations to get hold of data that they need. We would have welcomed the embedding of more substantive rights in the legislation, such as including a wider range of datasets, more exacting requirements on the formats that the data is provided in by public authorities, and a requirement for open licensing of published data, as we feel these would have better supported the government’s stated transparency and open data policies.

In this context, the Code of Practice (datasets) is an opportunity to make a clear distinction between the low bar of minimum permissible practice under law, and good practice that public authorities are encouraged to aim towards in all their dataset publication practice.

The current drafting of the Code too often emphasises the minimum that public authorities are required to do, when it should emphasise what public authorities should do, in the view of the Secretary of State, to adhere to good practice. The remainder of our response suggests new wording along these lines where appropriate.

We strongly recommend that the Code:

  1. clearly states that by default, public authorities should publish their data in a re-usable format and as open data under the Open Government Licence
  2. covers good practice for all datasets; public authorities will use the Code to determine their policies for many datasets, not just those covered by the limited scope of the Act
  3. spells out the good reasons for making data available as open data, as described in the Open Data White Paper
  4. consolidates existing guidance for the publication of datasets by public authorities, including the Public Data Principles
  5. is clear about what the “exceptional circumstances” that might result in not using the Open Government Licence might be
  6. describes the additional burdens on public authorities that would result from using a Charged Licence
  7. requires public authorities that use a Charged Licence to publish the cost/benefit analysis they carry out in selecting that licence over the Open Government Licence
  8. recommends mediation when there are complaints

Comments on i. Introduction

The introduction should early on draw the distinction between the minimum requirements under the law and desirable activity under the Code. It should state that, to adhere to good practice in the view of the Secretary of State, public authorities should publish the data that they create and manage as open data, which we define as “information that is available for anyone to use, for any purpose, at no cost”, in a re-usable format and under the Open Government Licence. Any exceptions from this rule, in particular charging for data, should be justified, as they increase the administrative burden on both the public authority and raise substantial barriers to the use and reuse of the data.

Paragraph 3 should be rephrased in a more positive way to encourage public authorities to publish open datasets, for example:

  1. The Act and the Code are intended to increase regular publication of up-to-date datasets, in a re-usable format, and licensed to encourage their reuse. This Code requires public authorities to publish data that they manage in a reusable format and as open data. The Act does not require datasets to be maintained or updated if they would not otherwise be updated as part of the public authority’s function.

Comments on ii. Scope

Paragraph 7 states that the Code of Practice applies only to those datasets that are within the scope of the Act. The definition of the term “dataset” within the Act excludes a large amount of potentially useful data that could be published by a public authority, such as that resulting from modelling or analysis, any kind of official statistics, and any datasets that result from converting data from one form to another.

Given the need for public authorities to adopt general policies around publishing datasets, which are unlikely to neatly match the scope of dataset in the legislation, it would be useful for the Code of Practice to be written so that it can be used as if it had a wider scope and to apply to any dataset owned by the public authority. It will avoid confusion and conflicting requirements for public authorities if the same good practice can apply to all these datasets as well as those included within the definition of “dataset” used by the Act.

In Paragraph 9, the Code says “The purpose of releasing these datasets is to increase transparency and accountability of a public authority’s decisions and functions.” Public authorities should be encouraged to publish datasets, both in response to Freedom of Information requests and proactively, in order to reap the wider benefits from their release, as well as for transparency and accountability purposes. For example, publishing datasets could

  • increase the efficiency of communication with service providers, partners and across the public sector
  • enable the public authority to get feedback which improves the quality of its own data and therefore the decisions that it makes
  • facilitate the entrepreneurial development of citizen-oriented services that improve the decisions that citizens make and support the creation of social, environmental and economic value

Paragraphs 9 to 13 serve to highlight the difficulties with the definition of “dataset” used within the Act. For example, neither the Act nor the Explanatory Notes that accompany it make it clear that aggregation is classed as “calculation” rather than “analysis”, and therefore aggregated datasets are included within the definition of “dataset”. It is good to have this clarification, and important that it appears within the Code of Practice, as this is missing from the legislation.

The Code of Practice should describe good practice for publishing datasets whether or not they fall under the Act’s definition of “dataset”. It should be made clear that the exclusions detailed within this section are there to enable the public authority to identify those datasets for which they are obliged to adhere to the Act and the Code. Where there is doubt, public authorities should be encouraged to err on the side of assuming that a dataset is covered by the Act and follow the best practice set out by Code.

The description of datasets covered by the Act should also be reframed more positively in the Code, to emphasise that many datasets are covered, for example, Paragraph 10 could be rephrased with:

Raw or source data is included in the scope of a dataset, as is aggregation of data to form high-level datasets (such as the creation of regional figures that were collected at a district level, or the creation of annual figures from data that were collected weekly).

In Paragraph 12, it would also be useful to clarify whether datasets that are modified through data cleansing (eg reformatting dates to a consistent format) or correction (eg revising an incorrect date) are also excluded from the definition of “dataset”. In our experience, this processing is commonly required on data to be released as open data, and increases the data quality for the public authority itself as well as for potential reusers. It would also be useful, in Paragraph 12, to describe whether redaction (eg removing columns containing personal data) is classified as the material alteration of a dataset.

Paragraph 14 highlights that some datasets are not covered by the Act or Code because they are not relevant copyright works. This paragraph should note, for the avoidance of doubt, that datasets that are subject to Crown Copyright must be published under the Open Government Licence.

Comments on iii. Disclosing datasets in an electronic form which is capable of reuse

Paragraph 11 defines “reusable format” as a “machine-readable format”, but publishing data in a machine-readable format does not necessarily make it reusable. In particular, if the machine-readable format is a proprietary format that is only readable when using expensive proprietary software from a single vendor, particularly if it cannot be easily exported to other formats, then it is not practically re-usable. The Code should encourage data publication in formats defined by open standards, such as CSV, XML, JSON and RDF, where there is good availability of open source tooling, such as readers, parsers and converters, for processing the data. This is consistent with government’s policy favouring the use of open standards and the appropriate use of open source tools.

Paragraph 12 states that datasets are often created in formats that are capable of re-use. Unfortunately, this is often not the case: datasets are often created in proprietary applications which used closed data formats. It is more accurate to say that many of the applications that public authorities use to capture and manage data, such as Microsoft Access or Excel, support export into re-usable formats such as CSV.

The final sentence of Paragraph 12 could be more positively framed as “To gain the maximum benefits from the reuse of datasets, public authorities should investigate the possibility of converting data to re-usable formats for publication. The Act does not oblige public authorities to carry out this conversion if it is impractical or overly burdensome.”

Comments on iv. Standards applicable to public authorities in connection with the disclosure of a dataset

Paragraph 14 refers to the Public Data Principles. It would be useful if the Public Data Principles were combined with this Code of Practice, so that public authorities only need to look at one document to work out what to do, and to avoid conflicting advice being given in the two documents. The Code of Practice should stand in its own right as a statement of good practice.

Paragraph 15 refers to Sir Tim Berners-Lee’s Five Star ranking system but provides a link to his Linked Data Design Principles, which is altogether different. The most helpful link for the five-star scheme is http://5stardata.info/. The final sentence of this paragraph talks about “applicants … assessing the suitability for re-use of a dataset they have received”, which the five-star scheme does not help with. Instead, the five-star scheme should be used by public authorities to guide their decisions about how to publish datasets. The final sentence should be replaced by:

This Code recommends that datasets are published at the three-star level at a minimum, and that public authorities aim to publish datasets that are likely to have a wide impact at a five-star level.

Paragraph 16 could again be reworded more positively as:

Published datasets should be accompanied by a sufficient amount of metadata and contextual information about how and why the dataset was compiled or created, and the processing that it has gone through, in order that users may fully comprehend the dataset they are dealing with and as part of compliance with Section 16 (duty to advise and assist) of the Act. Only where there are good reasons for this information not to be provided should it be omitted.

Paragraph 17 is excellent in giving strong guidance for the use of open standards that emphasises what public authorities should do rather than what they can get away with not doing. This paragraph should highlight that when private-sector organisations collect and manage data on behalf of public authorities, the rights to that data should be transferred to the public authority. This reduces the amount of third-party intellectual property within datasets managed by the public authority, and therefore increase the potential for publication of those datasets as open data.

Comments on v. Giving permission for datasets to be reused

We are pleased to see the emphasis in this section on publishing the licence under which a dataset is made available alongside the dataset itself. This alleviates the uncertainty that otherwise exists about whether a particular dataset can be reused. It should be made clear from the outset of this section that the Code recommends publication of datasets under the Open Government Licence.

Paragraph 18 mentions third-party intellectual property rights. Many users of the Code will not be familiar with the subtleties of copyright, database rights, third-party intellectual property rights, or derived data. It would be good for this section to include a definition for these concepts and/or point to other sources that define them such as the Intellectual Property Office.

The Code should encourage public authorities to publish the details of any third-party intellectual property rights that prevent a dataset from being published under the Open Government Licence, as this will help policy makers trace the impact of fundamental datasets in preventing the publication of public sector information as open data. These details should be published as open data, so that the impact can be aggregated across the public sector by policy makers.

It would be helpful for this section to specifically address the issue of Ordnance Survey and Royal Mail data included both directly and as derived data within datasets, as this is the most common source of third-party intellectual property rights within datasets managed by public authorities.

The Code should indicate that the Act states that the public authority has a legal obligation to publish the full, unmodified, dataset with a licence that highlights the third-party intellectual property rights within the data and therefore limits its potential for reuse. The Code should also advise that the public authority should also remove any data that cannot be openly licensed and publish the resulting dataset as open data, under the Open Government Licence. This would enable those parts of the data that aren’t covered by third-party intellectual property rights to be re-used.

In Paragraph 21, it would be worth highlighting that the ongoing business-as-usual information risk assessment practices carried out by the public authority should already include the assessment of the rights within the data as the public authority needs to ensure that it complies with the licence it has been granted in its own internal use of that data. The assessment of suitability for publication should be part of this assessment.

Comments on Licensing

Paragraph 22 should be reworded to more strongly guide public authorities towards publication of datasets as open data. We suggest:

Public authorities are strongly advised and encouraged to use the UK Open Government Licence to license the datasets they publish. The Act permits public authorities to use other licences, as described in this Code, but their use is strongly discouraged. They should only be used in exceptional circumstances, and the reasons for doing so openly documented.

In Paragraph 23, the UK Open Government Licence should be described as the recommended licensing model for the UK Government.

Paragraphs 24 and 25 should be grouped together to indicate that neither the Non-Commercial Government Licence and the Charged Licence are recommended licences. For example, it could say:

  1. In exceptional circumstances, public authorities may publish datasets under one of the following licences:
  • Non-Commercial Government Licence: this was developed to meet circumstances where data cannot be reused for commercial purposes. As with the Open Government Licence, public authorities can link to the Non-Commercial Government Licence on The National Archives website (http://www.nationalarchives.gov.uk/doc/non-commercial-government- licence).

  • Charged Licence (Beta): this is a transactional licence being developed for circumstances where reuse for commercial purposes requires the payment of a fee and/or royalties by the reuser. The licence uses standard licensing terms and forms part of the UK Government Licensing Framework and is available on the National Archives website (http://www.nationalarchives.gov.uk/information- management/government-licensing/charged-licence.htm).

This section should make it clear what “exceptional circumstances” might entail using one of these licences. It should also point out the cost involved in administering transactional licences.

Comments on vi. Costs and fees

In this section, the Code should point out that given a dataset is made available in machine-readable form (as any relevant dataset must be, under the Act), the public authority will usually incur no additional cost in making that data available under an open licence. On the contrary, publishing the data under a transactional Charged Licence will be more costly for the public authority, because of the administrative overhead involved in its publication such as:

  • costs involved in restricting use of data, both legal (eg suing infringers) and technical (eg through digital rights management technologies)
  • costs involved in issuing and renewing licences
  • the cost of legal advice
  • the cost of handling enquiries about the licensing arrangement

The Code should make these costs clear and explain the consequent rationale for the recommendation to use the Open Government Licence.

We would like the Code to require that any public authority that charges for reuse of a dataset publish, as open data, an assessment of the cost/benefit of publishing under a Charged Licence compared to publishing under the Open Government Licence. For data managed by public authorities, this assessment should include the benefits to the wider economy rather than simply the benefits to the authority that manages the dataset.

In Paragraph 32, we see the potential for the Secretary of State to specify that certain classes of dataset must be made available as open data. This power should be invoked if public authorities do not routinely follow the Code and best practice, by allowing their data to be reused as open data. The power should be presented as one that, in extremis, can be used by the Secretary of State to ensure that datasets are published as open data.

Comments on vii. Considering publication of datasets as part of a publication scheme

This section contains some very good guidance for public authorities and we are wholly supportive of its content.

In Paragraph 36, it would be useful if the Code described how to check whether a dataset is available through the UK Government Web Archive (ie go to http://webarchive.nationalarchives.gov.uk/*/{dataset-uri}).

Paragraph 38 could be rephrased more positively as:

When the authority publishes the dataset under its publication scheme, to maximise the benefit of making the data available, it must (as for responding to a request) provide it in an electronic form that is capable of reuse, where it is reasonably practicable to do so. What is reasonably practicable will again depend on the circumstances in each case and include the same considerations when dealing with a request, e.g. the cost of doing so, the work involved in doing any conversion and whether any specialist equipment or software is required.

Comments on ix. Complaints

The Code should set out the roles of the Information Commissioner, the Office of Public Sector Information and the Open Data User Group, in handling requests and complaints about the availability of datasets. The Open Data Institute would like to see a single point of contact for the public to which complaints about the licensing and general availability of reuse of data and datasets should be addressed.

The Open Data Institute would also like to see a mandatory mediation process between public authorities and reusers as a way of resolving disputes, particularly around licensing and charging issues.