At the ODI, our mission is to ‘make data work for everyone’. But what does making AI data work for everyone look like in practice? We share our experiences and reflections from a year of our data-centric AI programme, and our plans and expectations for the year to come.
This year we have been ‘doing the data work’ in AI – a famously sidelined task – and we have had the pleasure of meeting and collaborating with many people who understand its importance. Data-centric AI has historically been a more technical concept. However, we know from over a decade of experience that the data behind our societies’ systems requires thoughtful attention and collaborative work. This work needs a wide range of professionals and communities to develop the needed policies, resources, and ultimately data infrastructure. This need is very clear in the data foundational to AI systems, given its key importance in real-world AI impacts.
What we’ve done
The first step in launching the programme in November 2023, was to diagnose the problems we needed to address in the world of data and AI. We developed and shaped our programme around a roadmap for what we think needs to happen: to make data AI-ready, to make AI data accessible and usable, and to make AI systems use data responsibly. We then set about researching, designing and building some of the approaches that will help to make these a reality. This work includes:
- Producing a literature review to understand the current landscape of research on the role of data in AI systems.
- Developing an understanding of the transparency of AI data needed by different users and launching the first iteration of our AI Data Transparency Index.
- Researching how data is represented in AI policies across the world and sharing recommendations to address emerging gaps and challenges.
- Establishing how generative AI tools can play a role in data discovery and use.
- Researching the current and potential role of government as an AI data provider.
- Working together with ML Commons on a meta-data standard for making machine learning datasets interoperable and usable in responsible and trustworthy ways across different contexts.
Through this work, we also learned more about the enablers and key elements for a data-centric approach to AI in action.
A second priority was to bring together diverse audiences to collaborate tackling our shared challenges, which all require cross-sector input. This process has helped us to build a better understanding of the evolving AI data landscape and what intervention should look like, as we look ahead to the new year. This led us to collate our findings in outputs including:
- Our data-centric AI white paper, based around five main policy interventions.
- Input to national and international policy plans, including the UK’s AI Action Plan, and sharing our vision for the emerging National Data Library.
- A ‘data for AI taxonomy’ to help people define and understand the different types of data used in AI systems and the roles they play.
- Data-centric AI webinars (which we are excited to continue into the new year).
- An expert event to discuss and challenge perspectives.
What we’ve seen and what needs to happen
We have seen rising attention on the importance of data in AI systems. Data issues have been the central theme at the many AI events and conferences across sectors, in the headlines on data deals by major AI companies, and in the increasing emergence of policy and practical solutions.
There is both more need, and increasing capability, to do research to capture the state of the AI data ecosystem, enabling us to build more ambitious and targeted remedies. For example, momentum has led to AI developers relaying information about their work on platforms like Hugging Face, for example through the sharing of model cards and data cards - and emerging policies like mandating the UK government’s Algorithmic Transparency Reporting Standard in the public sector. This creates better evidence of AI development work, which opens up new possibilities for research. This work can include understanding what data supply chains do and could look like, who is participating and represented in the ecosystems that underlie our AI data, and where intervention is currently taken up or still needed.
We see important work on building, documenting, maintaining and improving critical AI datasets. This work can and should be a community effort, building on appropriate governance models to empower communities. Incorporating diverse perspectives into these processes can help to assess biases, limitations and other areas for improvement. To avoid extractive practices, this is work that takes time. To enable inclusion in access and benefit from technology, there are many issues to address, for example, ensuring under-represented languages datasets are curated to support the development of AI.
On another note, the demand for high-quality AI data to continue creating better and better performing AI models across different applications is forcing changes in the approaches to data access and data stewardship. With model providers searching for opportunities to purchase data and companies that hold data, different arrangements such as data agreements and new stewardship models are already starting to emerge. Aside from the unlikely possibility of a technological solution for making AI systems use data drastically more efficiently, we know there is a need for data access models to radically change if companies want to keep seeing ‘AI progress’ at a similar rate (and for new solutions to be equitable and responsible). Alternatively, we will need to think carefully and differently about what AI progress should look like.
Our plans for 2025
Beyond an ambition to address some of the above demands, we want to share a few more specific plans for the new year. Now is a good time to create real traction in transparency and accountability across AI ecosystems. The last years have seen momentum in the development and adoption of different transparency mechanisms from shared ways of classifying and documenting ‘AI incidents’, to machine-readable data and model documentation approaches and beyond. There also seems to be increased motivation from those developing AI systems to respond - and it is, therefore, a moment when new methods of assessment, solutions and analyses can have real-world impact. We plan to build on our existing research exploring user-centric approaches to transparency, extending the approach to other user types. We look forward to discussing this research at a tutorial at a top AI conference in February and a webinar in the same month. We’d love to collaborate with partners and prospective fellows who are interested in this challenge.
Another area of focus will be thinking more about what makes data ‘AI-ready’ means across different contexts and across different data holders, something we have already started to explore for the example of the National Data Library. Meanwhile, we are also looking to develop further research to learn more about data workers in the AI supply chain.
Conclusion
With some of the dust settling on ‘AI hype’ and the emergence of new and important models and tools, we need to be ready to fully turn back to the nuts and bolts of what really makes AI systems work for us. A lot of this work is in carefully building the data ecosystems underlying them, while keeping an eye on the horizon as things change.
Thanks to everyone who has been along for the ride with us this year. We’re proud of the progress we have made and always looking for collaborators and funding to help us keep moving forward.
As always, if you’d like to learn more about our work, or chat about a potential collaboration please get in touch at [email protected].