The WDAqua project (Answering Questions using Web Data) is a collaborative research project across different European countries, researching how to turn web data into successful services. It is part of the European Commission’s H2020 programme for research and innovation.
Two PhD students from WDAqua, Laura and Emilia, are based at the ODI's offices, and are supervised by the ODI (Dr Jeni Tennison) together with the University of Southampton (Dr Elena Simperl).
About the research
Laura's research – Human computer interaction and data search
The use of structured and semi structured data as a source of information and as a basis for decision making processes, alongside other sources, is not yet fully realised. To discover, filter, and rank within the web of data, different technologies and principles opposed to the traditional web are needed. With the increasing efforts to use web data for information retrieval, challenges connected to human interaction with data arise.
Published data is heterogeneous concerning formats, structures, licenses, portals (storage), metadata, quality, size; and, therefore, not always easy to find. The discovery of data and of datasets can be difficult for non-technical users, but can present challenges also to experts. When looking at the whole interaction process of the user with data, numerous factors potentially contribute to the success of a user’s task. The user can be a person involved in constructing or designing adequate tools, as well as a person trying to “get an answer to a question”. As a first step a literature survey is looking at the process of Question Answering and the discovery of data as a whole – considering the user at three distinct stages: before the query is asked, during the query processing and after the query. Resulting from this higher level picture a gap analysis determining the importance of a user’s perspective in these stages will be conducted. Subsequently we will use an experimental mixed methods design, consisting of semi-structured interviews as well as a search log analysis for dataset search, and possibly other observational methods; depending on the focus of the resulting research questions. This will create a better understanding of the challenges connected to data search and use when it comes to a user’s assessment of data or of an answer to a question. This can inform the development of data discovery tools, such as question answering systems, especially in determining the presentation of results or answers.
Emilia Kacprzak's research – improving dataset searchability
Open datasets are typically catalogued on official portals under defined lists of categories, and accompanied by short metadata. Despite the search functions provided by such catalogues, it is often not possible for an ordinary user to find relevant pieces of information quickly. This can be caused by non-intuitive or limited data descriptions, misleading naming conventions, incorrect assignment of categories to datasets, a user’s lack of in-depth knowledge of the subject, or simply that the search is only conducted over the metadata records rather than the data itself. Functional and non-functional requirements are not sufficiently met by current solutions.
In order to improve the current process of data discovery the data should be indexed and ranked in a meaningful way in a search operation. A first investigation will explore how to apply lossless compression/summarisation techniques to tabular data to create input records for a search index.
The final outcome from the proposed improvements is a better understanding of the data structure. Aggregating and summarising methods with support of indexing and ranking algorithms can improve the coarse process of obtaining information on every level and result in a better quality of service for the end users and to authorities releasing data.
Read more about the WDAqua project