Aim to be boring: lessons for data infrastructure
“Anything you think is infrastructure probably isn’t - infrastructure that is useful is invisible”
We discussed data infrastructure, the lessons learned from building academic data infrastructure through organisations such as CrossRef, ORCID and DataCite, and whether data infrastructure should aim to be boring and invisible. This last part is key as we need to articulate how important data infrastructure is, what it is and why we are arguing for greater investment and focus on it.
Talking to Geoff certainly helped me understand the challenges that strong data infrastructure are helping to solve for academia and how the same need exists in other sectors.
When our roads, railway and energy infrastructures were being built we learnt how to maintain them for the good of everyone. We need to learn the same lesson for our data infrastructure.
What are CrossRef, ORCID and Datacite?
Geoff described CrossRef mission as fighting “linkrot” and breaking down barriers to encourage reuse of scholarly literature or, to put it another way, to maximise use of data.
To some people a broken link on a website might just be an annoyance but in the academic world it can be an existential issue: being able to authoritatively refer to historic evidence, research and thinking is a key part of the scientific method. To overcome this problem you need linked identifiers.
CrossRef make reference linking throughout online scholarly literature efficient and reliable. The organisation is funded by the publishers of academic research who pay a membership fee based on the size of their publishing revenues.
CrossRef’s citation-linking network covers over 75 million journal articles and other content items. The network doesn’t contain the full text instead it contains metadata sufficient to describe and link to the content stored by the publishers.
ORCID provides a registry of unique researcher identifiers and a transparent method of linking research activities and outputs to these identifiers. Recently ORCID published it’s 1.5millionth identifier.
DataCite’s purpose is to develop and support methods to locate, identify and cite data and other research objects.
They all contribute to building a strong data infrastructure for academic research.
Distributed networks benefit from some centralised governance
The CrossRef team came to an early realisation that their data infrastructure needed to be managed centrally, even if the publication of the data was distributed.
As Geoff puts it “distributed begets centralised”. To paraphrase his argument: every distributed network has a centralised component that has a goal of making the rest of the network successful.
This led Geoff to the conclusion that strong and trusted governance needed be built in from the outset.
As other parts of the scholarly data infrastructure, such as ORCID and the OpenAccess movement, are now emerging Geoff and his colleagues Jennifer Lin and Cameron Neylon have published a paper titled ‘Principles for Scholarly Infrastructure’  to ensure that the debate over governance and principles is public and open.
This paper is a close companion to the Open Data Institute’s first piece questioning who owns our data infrastructure? Both of the papers leave out a few questions: what is infrastructure? who is the community? and who funds a data infrastructure?
Emu Analytics said in response to our initial paper “infrastructure almost by definition requires a leap of faith”.
The academic world is starting to take this leap but there is clearly still an unwillingness to invest in infrastructure: people don’t want to take money out of research; they don’t like open ended financial commitments; and infrastructure bores them. Infrastructure often gets built in a silo as a side output from a project.
This is out of step with the evolution of academic research which is increasingly interdisciplinary, international and collaborative. Meanwhile, the infrastructure built around them confines research to institutions, countries, and closed networks. Other sectors are experiencing similar change yet face the same infrastructure challenge.
The funding challenge is even larger when you consider that for a governance organisation, sustainability is more than covering the cost of day-to-day activities. We need to generate a surplus so that the data infrastructure can respond to new challenges and changing user needs.
Organisations that maintain data infrastructure need to to build a good funding model and avoid funding models that lead to failure in either of these types of activity. For example, a regular 3-year funding cycle that distracts the organisation from serving the community in year 3 whilst raising concerns about whether the organisation will even exist in year 4 is neither helpful or useful. Building a funding model that mixes different revenue streams (for example membership revenues and grants) can help avoid such problems.
The strategy is to be boring
We discussed how organisations have typically chosen one of two strategies to secure funding:
- make infrastructure sound exciting and focus on the eye-catching services
- make infrastructure sound boring whilst stressing its importance
One theory is that the first strategy is bound to fail. Infrastructure is fundamentally boring. It requires investment over the long-term and needs to scale before it can provide the benefits of the eye-catching services. Funders will lose interest if they are sold a “quick win” when the benefits don’t get realised for many years.
If we accept this theory then we should be pitching good data infrastructure as being boring and invisible. We don’t talk about our roads unless they have potholes, or our water supply unless it is contaminated. We have learnt that roads and water need to be maintained and over the years we have discovered good ways to do that.
As a community of data users and publishers we need to create sustainable long-term funding models to ensure that at local, national and global levels and across multiple sectors we build data infrastructure that gets data as widely used as possible.
We want to know you think. Do you think infrastructure should be boring and invisible? How would you approach funding for data infrastructure?
 In the academic world this would be referenced using CrossRef as: Bilder G, Lin J, Neylon C (2015) Principles for Open Scholarly Infrastructure-v1, retrieved 2015-08-17, http://dx.doi.org/10.6084/m9.figshare.1314859