Friday lunchtime lectures are for everyone and are free to attend. You bring your lunch, we provide tea and coffee, an interesting talk, and enough time to get back to your desk.
Data standards almost always have two fundamental components: syntax and semantics.
The syntax is established by a schema, which specifies how the data should be structured. Should it be saved as XML, JSON, CSV? Should dates be written YYYY-MM-DD? What order should the columns be in? These types of questions are essential and catered for in most well-established data standards.
Semantics are more abstract. Questions arise, like: what does this data tell me about the world? What does this number mean? What is the purpose of that transaction? Answering these is more variable and often less formal than the syntax. The most common and trusted technique is to create a codelist – a dictionary of terms which establishes the semantics for a given data point. They can be simple sets of terms – detailing transaction types which can be published in a standard – all the way to rich hierarchical taxonomies. They all associate a code with a meaning, to allow people to communicate with them.
This talk explores the changing role of codelists in a time when textual data analysis is advancing rapidly. The root question is this: is it still best to start giving meaning to your data by looking for a codelist and a code, or can we use description text itself, trusting modern data analysis to do the heavy lifting?
First, we’ll consider and contextualise the use of codelists within standards like Open Contracting, IATI, and 360Giving. Then we’ll look at how meaning can be established in text where no standard exists – online reviews for restaurants, for example – by applying machine learning (live demo alert!). Finally, we’ll apply the same approach to description text found in established open data standards and compare our results.
About the speaker
Rory Scott is a member of the Open Data Services Cooperative, working with national governments, multilateral organisations, and civil society to better share, understand, and use open data about international development and humanitarian financing. He does this primarily by talking to people about their data, writing Python and R code, and wrangling spreadsheets.