open standards for data

This guide describes the key features or processes of an identifier scheme – the tool, its governance (including how identifiers are assigned) and use.

Authors: Leigh Dodds, Libby Young

It describes the ways these features can vary between schemes and how this impacts the costs and therefore also the sustainability of different schemes. It also explores some of the different combinations of these features across schemes, impacting how open or closed a scheme is, which also impacts sustainability.

What is an identifier?

Identifiers are part of how we make sense of the world and communicate. They act as labels to help us uniquely identify physical and digital objects and services, and as pointers to information available online or stored in a variety of databases and systems.

Every day we all benefit from identifiers. Passports and driver's licences enable us to access services or cross borders. Website addresses help us to find information online. Barcodes, food, books and medicines help us to make purchases and find information that will help us to stay healthy and safe.

Identifiers are part of our regional, national and global data infrastructure. Data infrastructure. consists of data assets, like identifiers, technologies that help us manage and use them, policies that govern how they are used, and the organisations that curate and maintain them.

Identifiers are a key building block for integrating data within organisations, between business partners and across sectors and industries. Use of common, standard identifiers helps to stop fraud and create transparency around government spending. They support international markets and the supply chains that help to keep supermarket shelves stocked. They help to manage and organise the scholarly record and find our favourite music.

This guide can be read in conjunction with the ‘Anatomy of an Identifier Scheme’ from the 2014 paper, focused on technical aspects of identifiers.

Creating identifiers

Creating a shared vocabulary

An identifier scheme includes the design decisions, policies and governance that describe how a specific set of identifiers are assigned and used. For example, a scheme will describe things like the syntax of the identifiers (e.g. whether it is a human-readable label or bar code), how and when they are assigned, and how they are licensed.

An identifier scheme is an agreed shared vocabulary of identifiers used to describe people, places, things or concepts. It is also a holistic term referring to the data standards (authoritative lists like registers), organisations (such as registration agencies) and processes (like assigning identifiers) required to create, implement and maintain identifiers.

Identifier schemes exist because shared vocabularies need to be managed. The complexity of a scheme’s shared vocabulary depends on who uses it and how it is used. Schemes used in a specific way, like GLIDE numbers which identify disasters, are less complex to manage and may even be automated. Schemes used by many people or organisations in many contexts, like PermID which identifies organisations, financial instruments and people, are more complex and require active management beyond automation, while at the same time needing to be machine-readable. Schemes used in more contexts usually need more data to meet different users’ needs. This all impacts a scheme’s running costs.

Building identifiers

Identifiers are the labels or tags assigned to specific people or things through an identifier scheme. The syntax of an identifier can be made up of words, letters, numbers, symbols, or a combination of these. For instance, Reuters Instrument Code or RIC identifiers abbreviate a company’s name and the stock exchange its listed on. An example is IBE.MC which identifies Iberdrola listed on the Madrid Stock Exchange.

The syntax of an identifier depends on how many unique identifiers a scheme needs, whether the scheme needs to be human-readable or machine-readable (or both), and how the scheme is governed and identifiers are used. For instance, Universally Unique Identifiers or UUIDs are 128-bit machine readable identifiers which are effectively guaranteed to be unique and are cheap to create because the only coordination needed is an algorithm. However they are not usable for many schemes because they are random with no encoding system, so they give no context about the objects they identify, are not human readable, and are not auditable in terms of how they assign identifiers. They also often take up more space than objects being identified: a UUID is 16 byte where a data point is often 2, 4 or 8 bytes, and a UUID is too long to label small physical objects unlike, say, 8-digit barcodes designed to be unique and physically small.

Governance

Maintaining identifiers

Any group of people or organisations can work together as a community to set up and maintain an identifier scheme. The community needs a shared understanding of the value of common identifiers (often including a register of the identifiers), and sufficient agreement, funding and resources to manage it. A community’s hierarchical organisation can be ‘top-down’, for instance as the DUNS community is, or ‘bottom-up’, as the permanent identifier community group for URLs is, or the Musicbrainz identifier scheme. A community can also be closed or proprietary, as DUNS is, or open as the perma-id group is.

In many data ecosystems, the shared vocabulary of an identifier scheme becomes so important and in effect infrastructure that dedicated organisations emerge to govern them, often with a focus on openness and independence. For example, the LEI scheme set up as a result of the global financial crisis now identifies over 1.7 million entities in over 200 countries and is overseen by a Global LEI Foundation, GLEIF. Alternatively, an existing organisation can take on the governance of essential identifiers. For instance, the Institute of Electrical and Electronics Engineers, IEEE, oversees identifiers for the electronics industry, with a dedicated Registration Authority for this purpose.

The organisation or organisations governing an identifier scheme are broadly either for-profit, non-profit or public sector. But there are many different potential configurations within these categories, and boundaries between them can blur. Examples include:

  • or profit (eg Dun & Bradstreet Inc)
  • for profit and public benefit (eg OpenCorporates and the OpenCorporates Trust)
  • non-profit (eg the Global LEI Foundation and Registration Agencies)
  • non-profit, with registration by a for profit organisation (eg the Object Management Group and Bloomberg LP for OpenFIGI identifiers)
  • non-profit, with for-profit members (eg GS1 identifier schemes)
  • non-profit, with academic members (eg ORCID)
  • public sector (eg UK Companies House)
  • public sector limited liability partnership (eg Geoplace LLP)

At the ODI, we recognise organisations that develop and maintain data infrastructure like identifiers as ‘data institutions’. Data institutions are organisations whose purpose involves stewarding data on behalf of others, often towards public, educational or charitable aims. They do a number of different things in practice - including protecting sensitive data and making it available under restricted conditions (like UK Biobank) or creating open datasets that anyone can access, use and share (like 360Giving). They have an important role to play in steering us towards a future where data is used to drive positive economic, societal and environmental impact.

Lastly, we note that the governance of some identifier schemes has no institutional form, with stakeholders contributing in-kind and/or financial support instead. For instance, GLIDE numbers are managed by a nonprofit community of 20 humanitarian and academic institutions led by ReliefWeb, La Red and ADRC. Sameas’ service to identify co-referent URIs is run by individuals with technical expertise and access to some financial support.

Identifier assignment

A primary aspect of governing an identifier scheme is managing how identifiers are assigned, which can also mean managing how entities proactively register for identifiers. Hence most identifier schemes have ‘registration agencies’ or ‘registration authorities’, which we look at specifically later in this paper.

Like data sits on a spectrum from open to closed, identifier assignment sits on a spectrum from centralised to decentralised (or federated). The standards and syntax governing a scheme need central coordination of some sort, but the assignment of identifiers using those standards can be decentralised. Centralising the assignment of identifiers can create bottlenecks if you need to assign lots of identifiers quickly and also creates a dependence on a single central entity, whereas decentralised assignment shares the work between organisations.

Assignment of identifiers that are part of the public sector (eg CAGE), commercial (eg PermID) or community initiatives tend to be more centralised, whereas ISO-governed schemes tend to be federated (eg DOI), often with the international standard being implemented by national agencies with local knowledge and relationships (eg LEI). Assignment can be even more decentralised, for instance to the local or individual level (eg UPRNs or ISRCs), through crowdsourcing (eg DBMedia) or automation (eg UUIDs, which are arguably centralised in a different way, by algorithms).

Using identifiers

Identifier use cases

Every sector uses identifiers in some way, from disaster relief and humanitarianism to shipping, banking and music, locating, counting, tracking and linking objects and ‘metadata’ about them. There are identifiers which organisations must use to be eligible for things like government procurement processes (eg DUNS or CAGE IDs), identifiers which play a key function in business operations (eg BIC or GS1 codes), identifiers which capture revenue (eg ISRCs or GS1 codes) or enable evaluation (eg DOIs and ORCIDs). These use cases highlight the role of identifiers as infrastructure.

Identifier licensing and access

How an identifier is licensed determines how it can be used. Many identifier schemes have open licences, for instance LEI, PermID, OpenCorporates, OpeFIGI or ROR. Some of these also offer open access registers and/or APIs. Some schemes charge for ‘bulk’ data access or access for commercial use. Some identifiers are proprietary or closed, with users paying for access regardless of the use case. An identifier may be licensed differently to the metadata it connects to. For instance, DOIs are open but the research papers they link to are often behind an academic publisher’s paywall. Many identifier schemes lack clear licensing altogether.

Fitting all the features together

Identifier schemes combine these features of governance, assignment and use in many different ways. For instance, closed governance, centralised assignment and closed identifier data schemes are often designed by for-profit businesses (eg Dun & Bradstreet) or used by schemes which are solely for commercial use (eg BIC codes). But some identifier schemes with closed governance and centralised assignment may choose to make identifiers open (eg PermID).

Some schemes may have open governance and decentralised assignment but closed identifier data, for instance non-profit schemes with for-profit users who locally assign identifiers and exercise database rights (as is the case for many GS1 identifiers). Similar schemes may have open governance, decentralised assignment and open identifier data but closed metadata, for instance DOIs which identify research papers often behind academic publisher paywalls.

Schemes which have open governance, decentralised assignment and both open identifier data and metadata are typically non-profit schemes with a public interest use case (eg LEIs for financial stability) or non-profit users (eg RORs for academia), or schemes that are community-driven (eg MusicBrainz and Discogs which are driven by music fans, or DBPedia and SameAs which are driven by volunteer technologists).