We explain why the decision about which licence to use is one of the most important steps in publishing a dataset
Increasingly, government and commercial organisations are interested in sharing their data as "open data". For data to be open it must be published under an open licence. A clearly indicated licence sets out what people can and can’t do with the data, enabling use, adaptation and redistribution by anyone or any organisation.
While lots of licences aim to be "open", the terms they include may fall short of actually being "open" data, or can make reuse of their data very difficult. This guide explains some common licence terms publishers use that may not fit the definition of "open data" or which make open reuse practically impossible.
In order to make an informed decision publishers must have a clear understanding of what they hope to achieve by opening up their data, the ways they will hope it will be used, and the types of users they wish to engage.
The ODI Publisher’s Guide to Open Data Licensing outlines the key considerations that inform licensing decisions, including how to select between the existing standard open licences.
The Open Definition provides an agreed standard definition of open works and open licences. It clearly describes the licensing terms that must be included in an open licence. There are still many datasets published under licences that do not meet the terms of this definition. Publishers should have an awareness of the impact of specific licensing terms on whether their data is open and whether it can be easily reused by others.
The Open Definition differentiates between required and acceptable terms that may be included in licences applied to open works and datasets:
- Required terms are licensing permissions that must be present in order for a licence to be considered open.
- Acceptable terms are those which may be included in a licence, so long as they don’t otherwise conflict with the basic definitions of openness. These terms typically place restrictions on how data is reused by placing limitations on the forms of its reuse.
For example, the abilities to use and redistribute data are required provisions, while attribution and sharealike provisions are acceptable.
This document provides guidance for open data publishers on some of the issues that both they and consumers of their data will face if data is published under a licence, or via other terms and conditions, that include restrictive terms.
The guidance should be useful to publishers who are considering opening up their first dataset, and those who may want to understand why their data isn't being used as widely as possible.
Each of the following sections discusses the impacts of a different licensing term or method of granting rights to reusers.
A "no-derivatives" clause on a licence means that a reuser of the data cannot redistribute any changes that they make to a dataset. This provision is often added to ensure that a redistributed dataset is not materially different from its original source, addressing concerns that data may be misrepresented or distorted.
However, a publisher would need to enforce this clause for it to have real weight, requiring time and effort to police reuse. Concerns over misrepresentation are adequately addressed by "non-endorsement" clauses common to many open licences: these prohibit reusers from suggesting that derived data or analysis has any official status and clarifies that the data publisher does not endorse the actions of its reusers.
More seriously, no-derivatives clauses will stop reusers from performing a number of useful activities and then making the results available to others, such as:
- fixing errors in the data
- enriching a dataset with additional data, eg that they are contributing themselves as additional open data
- mixing or mashing up the dataset with other sources in order to create a more useful dataset, eg by adding extra columns or identifiers
The ability to take advantage of this type of activity is a key benefit of publishing open data: reusers are able to transform and mashup a dataset without requiring the original publisher to invest effort in supporting every possible format or linking their data with lots of different sources.
Discouraging network effects around their data is not in the best interests of data publishers.
A non-commercial provision on a licence requires that the user of a dataset may only use the dataset for "non-commercial" purposes. This provision is often added by publishers seeking to restrict the ability for others to exploit the publisher’s investment in data by creating business models or products around a dataset.
A non-commercial provision is problematic primarily because of a lack of clarity around what constitutes "commercial" usage. The Creative Commons licences themselves do not offer a clear definition with the result that the meaning may vary depending on the jurisdiction within which the licence is being used or contested. Definitions of commercial usage often refer to the gaining of "commercial advantage" or financial remuneration, without regard to the moral goals of the reuse.
For example, academic research that is partially supported by private funding, even where undertaken without the intention to exploit any discoveries commercially, may not be considered "non-commercial". Without a clear definition these reusers of the data cannot be sure that their intended use is compliant with the terms of a licence.
Some or all of the following may be considered commercial usage:
- A free-to-access not-for-profit application or service built using the data, but which uses advertising revenue or sponsorship to help cover operational costs
- A company using the data in a demonstration or showcase of the features for a commercial application or tool
- A company that uses the data to provide a mixture of for-profit and not-for-profit services to customers
- A company using the data to provide a useful paid-for service to not-for-profit organisations, researchers or government
- A research organisation using a dataset in a product that may be later commercialised
- Use of a dataset, eg embedding or visualisation, on a personal blog whose free hostings is covered by advertising revenue, such as on Wordpress.com
- Use of a dataset by an organisation, eg a university, that draws revenues from commercial services, including training
- Reporting of key figures from a dataset in a print or online newspaper
- Use of a dataset by a company in a way that isn’t tied to revenue generation, eg to improve internal reporting
- Use of a dataset by a commercial organisation to enrich other data that may then be published openly
- Use of a dataset by a charity or other non-research organisation that has a commercial arm
As an example of the confusion that relates to non-commercial provisions, a ruling in a 2014 court case in Germany found that a non-commercial provision meant that content (and data) could only be used for personal use. As a result, in Germany a non-commercial provision might limit all organisational uses of a dataset.
Data licensed with a non-commercial provision cannot legitimately be mixed with datasets licensed with certain other requirements. As noted in the next section, it is not possible to combine a dataset with a non-commercial provision with one that uses a sharealike requirement (unless that licence also contains a non-commercial restriction).
A non-commercial provision is also "viral" in the sense that derived datasets will automatically inherit the non-commercial clause.
Ultimately, as with the no-derivatives, non-commercial use clauses greatly limit the ability for reusers to remix, enrich and share a dataset.
A "sharealike" provision on a licence requires that anyone creating a derived dataset must release that new dataset under exactly the same or similar terms as the original. This provision is often added to try to force others to also release their data under an open licence.
A downside to sharealike provisions is that they can actually hinder open data adoption as they can result in data being less open than intended. For example, if a dataset is used to create an analysis that results in a new dataset, the analyst cannot place their results into the public domain or choose only attribution: they must use the original licence terms. The user is restricted in how they can share the results of their work.
Licences that use both sharealike terms and other restrictive terms, eg non-commercial, impose further limitations on reusers. For example, it is not possible to legally create a derived dataset that mixes together a dataset published under the CC-BY-SA and CC-BY-SA-NC licences. The licences are fundamentally incompatible.
Attribution terms in a licence are an acceptable part of publishing open data. They require that users of the data acknowledge the source of the data, typically by providing an agreed attribution statement and a relevant link in their application or service.
While the attribution requirements are not onerous there are occasions where attribution can restrict the freedom of data users:
- The attribution requirements may be undocumented or unclear, leaving users uncertain how to comply with the licensing terms
- The requirements may be very prescriptive, eg requiring that users display the attribution statement using a specific size of font or to display a logo that may be difficult to comply with across all devices (including mobile and embedded devices) and products that use the data
- When an application uses many different datasets, then the list of attribution statements may be very long so displaying them, eg on every page or screen in an application, isn't practical
Attribution norms often differ between communities. Within the scientific community data attribution is often specified as a formal citation, as used in an academic paper, that lists all of the authors or contributors to the data. In other communities only the name of the publisher and a link to the dataset are required.
While we encourage attribution as a norm for open data reusers, publishers are encouraged to be both clear and flexible around their requirements.
For example, the UK Open Government Licence was recently updated to include a default attribution statement and additional guidance for reusers publishing attribution statements for multiple datasets. The ODI guide to publishing machine-readable rights statements also includes some suggestions on designing attribution requirements.
API terms and conditions
Publishers providing APIs that allow developers to query a dataset may consider adopting an open approach by amending their API to provide a "free tier" of usage. Free tiers typically allow anyone to use an API to make a fixed number of queries, eg per day or per hour. Higher usage tiers often require commercial agreements in place. This method of access is potentially a component of an open data business model.
API terms and conditions are typically bespoke and are lengthy to read. While there are a number of off-the-shelf open licences, there are no off-the-shelf API terms. Developers signing up to use a free access tier in an API are obliged to understand how the specific terms and conditions will impact their intended use of the data.
Ideally the terms and conditions for an API would identify the data licence for the delivered data separately from the terms of service. Whatever tier of usage, whether free or paid, the retrieved data would then still be covered by an open licence.
If the API places restrictions on how the data is used such that it doesn't conform to the Open Definition then data exposed by the API cannot be considered to be open.
Some types of data, in particular data that is updated in real time or near-real time, can only be published effectively via an API or feed. However, "static" datasets that are only accessible via an API, and not available as bulk downloads, are unlikely to be compliant with the Open Definition.
Very many thanks to Owen Boswarva and the anonymous commenters who contributed comments on this draft.