Data volume and variety on the web is growing enormously, and with it so-called ‘big data’ technologies and services. However, we are not yet at the point of being able to use all data to its full potential. This can be attributed, at least in part, to the numerous barriers encountered when trying to find and use data.
Data volume and variety on the web is growing enormously, and with it so-called ‘big data’ technologies and services. However, we are not yet at the point of being able to use all data to its full potential. This can be attributed, at least in part, to the numerous barriers encountered when trying to find and use data.
Data is typically grouped into datasets, many of which are published on the web. However, these are distributed in data catalogues available on various different data portals.
No centralised way of searching for data has yet been developed that would compensate for the siloed nature of data portals. The structure of data catalogues varies; to search within them is dependent on inconsistent metadata and to search within the data itself is rarely, if ever, possible. Regional and national data catalogues, are hardly ever comprehensive in their coverage, often focusing on government and public sector data, at the expense of commercial, third-sector data, and the 'long-tail' of data fragments published on the web. It is difficult, therefore, to have an overview of where data exists and what it can be used for.
To find and to select data can be a complicated, tedious and very inconsistent process. In the course of my research, I will map the specific problems within this area and will aim to keep the whole interaction process with data – from a user’s perspective – in mind.
When trying to narrow down the requirements for using web data, the first barrier is identifying whether relevant data exists, and if so, where to find it. Therefore the initial focus of this project is on people discovering data, be that of entire datasets or individual data points within a data set. This starts by thinking about the type of information people could want when trying to find data.
What type of questions do we ask of data? What type of answers could we want (if we were not limited by technical barriers)? How does this differ by individual, by task or by context?
How do different people search for data on the web?
I will develop a number of personas, representing typical types of users of web data, to illustrate the different approaches they may take to data discovery, and how they may be better supported.
One example for a search scenario that could provide very useful results from data sources is ‘exploratory search’ – a search process in which the information need, and therefore the specific data requirements, might not be completely predefined. For example, a person doesn’t know exactly what to search for because they are not familiar enough with the topic. This type of search is typically open-ended and involves a learning experience by comparing, synthesising and evaluating results.
Once an adequate dataset is found, it needs to be accessed in order to be used. Access itself can be a question of data openness, the provision of certain formats, or an understanding of the connected licences. Only once a dataset is accessible can people start to think about how to find meaning in the data itself, though they will often face issues such as poor or inconsistent data quality, misleading naming conventions, inaccurate metadata content or datasets that are not updated.
To understand the breadth of issues associated with finding, selecting and using data, I will be taking an interdisciplinary approach. This will involve drawing on literature and techniques from various academic disciplines, such as Cognitive Science, Psychology, Computer Science, Information Science, Human Computer Interaction and Information Retrieval.
This project is not likely to provide a perfect solution that supports data discovery for all types of user and search scenarios, but it will contribute to solving associated issues within that area, with a focus on the user side of the system. This can be an important step in the development of more effective question answering with web data.
Laura Koesten is a PhD student at the ODI. Laura’s PhD research project is part of ‘Answering questions using web data’ (WDAqua ITN), a collaborative project that aims to develop question answering services for the public and private sector, using data available on the web. It is part of the European Commission’s H2020 programme for research and innovation.
If you have ideas or experience in open data that you'd like to share, pitch us a blog or tweet us at @ODIHQ.