8.2 Face-Orange-ArticleHeroBanner-1110x452-ODI-Research

With the support of the Patrick J. McGovern Foundation, the ODI investigated how the development and maintenance of global data infrastructure can enable access to data and facilitate collaboration to support research and innovation aimed at addressing pressing global challenges. After an in-depth mapping of challenges in this space, we focused on enabling access to social media data to support public-interest research.

This page presents some of the findings and resources produced during the project and serves as a call to action for people, communities and organisations interested in continuing this research with us.

What is global data infrastructure and why is it important?

Data such as statistics, maps, real-time sensor readings, and experiment results help us to make decisions, build services, gain insight, and develop new scientific theories and innovations. As our economies and societies become ever more reliant on generating value from data, it is becoming increasingly important to build and maintain the vital data infrastructure that makes it possible to effectively collect, manage, use and share this valuable data, and to do so in responsible ways. In this interconnected world, our data infrastructure will need to become increasingly global as well.

One area where the development of global data infrastructure could have a major impact on people, communities and societies, is in the area of research and innovation. As was made clear by the impact of sharing health data during the Covid-19 pandemic, there is enormous potential value to increasing access to data and insights for research and innovation – not just in areas like health where data helped to track the spread of Covid, but in taking action to address the impacts of climate change, supporting evidence-based policy-making, combating exploitation and the spread of disinformation or harmful content online and confronting democratic and societal polarisation and fragmentation.

However, as the recent pandemic also made clear, building and maintaining global data infrastructure to increase access to data for research and innovation is complex and challenging. Doing so requires working across geopolitical boundaries, sectors, industries, disciplines, technical standards, and legal regimes. Bringing together these different contexts requires coordinating across a large number of stakeholders, each with their own requirements, goals, and legacy systems. It sometimes involves breaking down silos and often exposes contradictions and competing interests that complicate the development and maintenance of global data infrastructure.

To address the pressing challenges of our time, it is imperative that we understand the best ways to build and maintain global data infrastructure.

What we focused on

In order to drive progress in this area, we set out to investigate how the development and maintenance of global data infrastructure can enable access to data and facilitate collaboration across boundaries to support research aimed at addressing pressing global challenges.

This is obviously a very large topic, so we began by conducting desk research and expert interviews to identify and prioritise challenges and research questions worthy of investigation. See below for some of the topics we identified. Ultimately, we chose to focus on the question of how to enable public-interest researchers to access data currently siloed within private entities. There are many different approaches to enabling access to privately-held data, such as through supporting private-public partnerships, mandating access through legislation, paying for access, building data institutions to facilitate safe access, and/or through utilising privacy-enhancing technologies to enable access while protecting sensitive information. We wanted to understand the benefits and limitations of those approaches and identify ways of increasing access where possible.

Specifically, we set out to identify ways of enabling public-interest researchers to access data held by social media companies. The importance of social media companies is demonstrated not just by their reach (Meta had 2.9 billion monthly active users in 2023 and over half of the global population use some form of social media), but by their impact on everything from political movements and elections to social interactions and the physical and mental health of users. As a result, it is imperative that researchers are able to access data held by social media companies in order to investigate these impacts. This is true not just for academic research, but for journalism, open-source investigations, advocacy, policymaking and regulatory oversight. In past years, the value of accessing this data was demonstrated by the range of important research and findings that were enabled by it, such as the causes and impacts of teacher resignations during Covid-19, electoral disinformation and the prosecution of potential war criminals in international tribunals.

Unfortunately, many social media platforms have recently rolled back access to data that was previously made available for researchers. This includes X, Meta and Reddit. As a result, collecting and using this data is increasingly challenging for researchers worldwide. Many are forced to rely on data that is self-reported by platforms such as transparency reports, or risk being pursued for breach of contract by collecting data through means that may violate platform terms and conditions (eg web scraping).

This project aimed to identify ways for different types of researchers from different countries and regions to access important data held by social media companies to support public-interest research.

What we investigated

Within the project, we investigated three different aspects of this challenge. First, we sought to understand how different countries across the globe are attempting to enable access to social media data. Recently, governments in the United States, United Kingdom, and Brazil (to name but a few) have introduced legislative proposals to enable researcher access to social media data, but to date, the only policy that actually mandates researcher access to platform data to come into effect is the Digital Services Act (DSA) in the EU. Article 40 of the DSA mandates that providers of prominent social media platforms grant European researchers access to data for research aiming to detect, identify, and understand systemic risks within the EU. But even the DSA has some gaps and uncertainties; there is a lack of clarity regarding the application process that researchers must navigate, what constitutes ‘commercial ties’, what research questions are in/ out of scope, what the vetting process should look like and how access will be enforced. We sought to compare and contrast these efforts in different countries in order to understand which emerging approaches can/ cannot be transposed to countries with different social contexts, regulatory bodies, and sociotechnical infrastructure.

You can read more about our research on this topic in our short exploratory report, based on interviews with experts worldwide about the challenges of accessing platform data in their respective regions. We believe this will help researchers and policymakers working in different regions identify shared challenges and potentially identify areas for future collaboration.

Second, we sought to understand how public-interest researchers are seeking to access social media data outside of systems mandated by regulations. Not only are any new regulations mandating access to social media data likely to take years to be drafted, debated, ratified and enacted, even those mandates are likely to leave out many people conducting research in the public interest - eg journalists and open-source investigators that exist outside academia and more formal research organisations. We therefore set out to map the different approaches available to researchers that are attempting to gain access to social media data. Our research identified four main ways that researchers can access data about social media platforms: directly from the social media companies; directly from the platform/ app; from users of those platforms/ apps; and from third parties, which may have originally collected the data via any of the previous three routes.

A typology of different means of accessing data about social media platforms

You can find out more about our efforts to map these approaches in our annotated slide deck which introduces a work-in-progress typology of different access models. This work builds on the ODI Data Access Map. Feedback from presenting this typology at the International AAAI Conference on Web and Social Media suggests that this type of guidance is needed within research circles. The typology can help public-interest researchers understand the range of different ways they can already gain access to certain types of social media data as well as explore approaches that might serve as inspiration for future access initiatives.

Third, our work on the typology of access models helped us identify one major approach to accessing social media data that researchers seem to be increasingly gravitating toward: scraping or crawling ‘public data’ - eg data that is published on news websites and social media platforms. A challenge frequently raised in our initial discussions with researchers and stakeholders was the lack of clarity around what constitutes ‘public data’, and therefore a lack of clarity around what constitutes fair collection and use of that data. They felt that in some ways this lack of clarity is limiting – or intentionally being used to limit – access to important data. The lack of clarity around what researchers are allowed to scrape also potentially leaves them in a legal and ethical grey area when collecting or accessing this type of data. This can lead to a chilling effect, deterring important public-interest research due to fears of costly litigation. To help address some of these challenges, we worked to develop a proof-of-concept Delphi survey that could serve as the foundation for future consensus-generation exercises. In order to continue gathering viewpoints and evidence, the survey will remain open for the time being. Please feel free to submit a response and share it with your communities if interested.

You can learn more about our development of the Delphi survey in our short project write-up. We aim to conduct further consensus-generation exercises with partners across the globe, with the long-term goal of helping different stakeholders (eg different types of researchers, regulators, policymakers, digital platforms/ publishers, industry bodies and customers/ users) begin to agree on what should and shouldn't be considered 'public data' and establish early-stage guidelines for ethical use.

What we produced

Written outputs

Conference sessions and presentations

  • Panel proposal selected for UN World Data Forum 2024: ‘Global perspectives on enabling access to platform data for public-interest research’ (Scheduled for November 2024)
  • Co-delivered a tutorial at the International AAAI Conference on Web and Social Media
    • Tutorial web page: ‘Scraping Reddit the Right Way: A Guide to Legal and Ethical Data Collection with RedditHarbor’
    • Tutorial slide deck: ‘Tutorial at the International AAAI Conference on Web and Social Media’
  • Participated in the ‘Right To Research’ panel at the Computers, Privacy and Data Protection conference
  • Presented an examination of global data sharing trends in health at the AI Executive Program: Digital Healthcare conference
  • Contributed to the co-design of global data sharing systems at ‘Leveraging EUDR as an opportunity to build more inclusive and sustainable supply chains’
  • Submitted a proposal for a panel at Internet Governance Forum 2024 comparing data governance models across the globe (pending).

What we plan to do next

We see multiple avenues for this work to continue. First, the positive feedback we received on our public data survey suggests a strong desire amongst researchers for further clarity on how to ethically and legally collect and use public data. Work in this area could have a long-term goal of generating consensus amongst a wide range of different stakeholders, eg platform representatives, researchers and policymakers. However, given that our initial research suggests that different research communities hold differing views about public data and its legal and ethical use, an initial phase of further work could focus on generating consensus amongst researchers first. Ultimately, by generating consensus, the research community may be able to come together to more effectively petition platforms and/ or policymakers for clearer access protocols and licensing with regards to public data.

Finally, we see the need for research into other pressing global challenges that were out of scope for this project, including:

  1. How to enable researchers to access data currently held in disparate regions and jurisdictions by a range of different actors. For instance, through developing networks and infrastructure to enable natural history museums to connect their collections to facilitate climate research, clarifying the provenance and lineage of important datasets used in research and AI training or facilitating the sharing of datasets produced through academic research experiments with researchers working in other fields and other regions.
  2. How to enable researchers to connect, share findings and trace impact across academia and the private, public and third sectors. For instance, through supporting the citation of ‘grey literature’ within academic research and supporting the development of infrastructure such as identifiers, concept schemas and search platforms that are interoperable across diverse research sectors.
  3. How to break down silos created when research findings and data are held behind paywalls. For instance, through supporting open science and open access movements or initiatives aimed at increasing transparency around licensing or developing free alternatives to paywalled research.
  4. How to help fill in gaps in important research datasets to ensure that people and communities across the world are able to benefit from research and innovation. For instance, by supporting projects aimed at ensuring that less-used languages are included in the datasets used to train natural language models, or increasing the representatives and use of spatial demographic datasets.