What is Data Science, Really?

When I told friends and family about my new job as a ‘data scientist’ at the Open Data Institute, most of them said “what’s a data scientist?”. Most of them also asked “what’s open data?”, but that’s already well covered elsewhere.

Much has been written in trying to define the term ‘data science’. Unfortunately, even seminal (and otherwise excellent) articles bury a concise definition of the term among descriptions of the characteristics, methods or responsibilities of data scientists, rather than defining the discipline itself. So much has already been written that I hesitate to add more. But I read these definitions, find them lacking, and feel compelled to attempt my own. Deep in the article linked above is this passage:

“Data science isn’t just about the existence of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid.”

Unfortunately, this simply describes ‘science’, and in that redundancy lies the problem with the term ‘data science’ as it is typically used.

For me there is a precedent in naming sciences, that follows the pattern ‘{topic} science’. Therefore ‘biological science’ is the scientific study of organisms and biological processes; it’s not ‘doing science using biology’. Similarly ‘computer science’ is the scientific study of computing rather than simply ‘doing science using computers’.

If we apply the same analysis to data science, the discipline should amount to the scientific study of data, not ‘doing science with data’; there’s nothing new in that, but this is how many seem to interpret the term. There are certainly many data scientists delivering novel insights to their organisations through rigorous data analysis, often using tools, techniques or data volumes that were previously inconceivable – but if done rigorously this amounts to ‘behavioural science’ more than the science of data itself.

This ambiguity is one of the reasons that I love the term ‘growth hacker’. Not only does it capture the aims of the role but also something of the methods used. The same can’t necessarily be said of ‘data scientist’, unfortunately.

So where does that leave me as Data Scientist at the ODI? If I apply the common interpretation of data science then there is a whole world of opportunity in the wealth of open data being released by government and others. ‘Doing science’ with that data is an opportunity to study issues and answer questions that could impact all members of our society for the better, which is an incredible challenge and privilege. It also raises some interesting challenges relative to data science in a corporate world, where you may own and control the data through its entire lifecycle, but I’ll deal with that in a future post.

If I apply my alternative interpretation of data science as the scientific study of data and how it manifests in the 21st century, then a very different range of possibilities emerges, where we study the characteristics, interconnections, and applications of data in order to maximise the value that can be derived from it.

Naturally I’m delighted that both of these directions are on the agenda at the ODI. These are exciting times to be a data scientist, whatever your interpretation!