Bringing order to data search

Have you ever tried to find a particular piece of data about your neighbourhood? ODI PhD student Emilia Kacprzak sets out the aims of WDAqua: a project she is working on to help people find data more easily by improving how datasets can be searched

null Applying techniques used in Web search engines could help improve accuracy of data search engine results. CC BY 2.0, uploaded by Gary J. Wood.

In recent years, there has been a significant growth in the amount of data that is produced by organisations such as companies or public institutions, but also by devices connected to the web, like smartphones and sensors. Data generated by devices has started a new chapter in Internet history – the Internet of Things (IoT).

It all sounds very exciting but have you ever tried to find a particular piece of data about your neighbourhood or city? Wouldn’t it be easier to have all data from your area open and accessible? Data about everything: city facilities or the location of police stations, pharmacies, and hospitals open during the night? How about list of shops open on Sundays? Meticulous lists of spending and investments within the world’s capital cities?

WDAqua is a collaborative project funded by the European Commission’s Horizon 2020 programme, with the aim of helping people find answers to their questions more easily, using data from the Web. WDAqua gathers researchers and a group of 15 students among five institutions around Europe. Each student focuses on a different approach of answering questions. The hope is that WDAqua will encourage more organisations to open their data, and more people to use open data and ask for data that should be open to them.

What history has taught us

This new data phenomenon echoes the history of the Web. The very first approach to searching for webpages was to group them into catalogues. Users, in order to find the required information, needed to go through the catalogue they expected information could belong to, and judge the content by themselves. This was a slow, laborious process.

Catalogues were created by people, and the way that webpages were organised into categories was judged subjectively. Over time, the size and number of catalogues were growing bigger and bigger, and browsing them to find the information became more difficult.

Data growth forced the invention of new solutions, creating the Web in the shape we currently know. Archie, the first internet search engine, was created in 1990. It searched through FTP archives which could be readily browsed manually. Through the years, Web search engines improved their algorithms to finally get to the stage they are at today.

Data search today

The catalogue structure applied by open data portals (for example data.gov.uk) is reminiscent of the beginnings of the Web. Datasets are catalogued under different keywords and can be searched based on brief descriptions provided by the person sharing data on a portal. The data itself is typically not taken into consideration in the searching process. This shows the potential and direction of possible improvements in data searching and information retrieval.

Applying the techniques currently used in Web search engines could be the next step on the way to improve accuracy of the search results presented by data search engines.

Techniques that could be adapted for data search

To find content corresponding to a user’s queries, search engines need to perform three basic actions: crawling, indexing and ranking.

Crawling Crawling is a process of data discovery. It is looking for new or updated content so that the graph of all the information available on each resource is always up-to-date. In terms of datasets, the crawling phase is only necessary at the beginning, in case of any update of existing data and on upload of new files.

Indexing Indexing is a process of creating a list of all the words and phrases relevant to people searching for the resource in order to be able to access them faster. The indexing process can include all the content of the webpage or dataset.

Ranking Creating indexes and keywords maps is not enough to search effectively. Every phrase associated with a resource may have a different relevance. To determine this importance, every piece of data that was extracted has to be assigned a weight that can be used in ranking.

In terms of datasets, keywords based on their occurrence or position within the dataset or metadata, and in comparison to their occurrence in a whole data corpus, should be given appropriate weight.

The characteristics of data gathered in a dataset are different than on an ordinary webpage. Datasets have a structured form which allows search engines to generate various numerical statistics, whereas semistructured or unstructured data is more difficult to summarise in the same way, because it is more unpredictable.

For example, depending on the type of data in a column, different summaries can be computed and stored with the keywords map. If the data is a date or a value the range of values can be stored; if the data is a value describing spendings or savings the sum of all of them could be stored, and so on.

I am really excited to be working on improving the understanding of general data structure in order to improve the effectiveness of dataset search. Creating meaningful data about data (metadata) is a key step towards improving the process of data discovery. With the support of already known concepts, the process of obtaining information can be improved on every level and result in a better quality of service for the end-users and for authorities releasing data.

Emilia Kacprzak is a PhD student at the ODI. Emilia's PhD research project is part of ‘Answering questions using web data’ (WDAqua ITN), a collaborative project that aims to develop question answering services for the public and private sector, using data available on the Web. It is part of the European Commission’s H2020 programme for research and innovation.

If you have ideas or experience in open data that you'd like to share, pitch us a blog or tweet us at @ODIHQ