Fig Data: 11 Tips on How to Handle Big Data in R (and 1 Bad Pun)
In our latest project, Show me the Money, we used close to 14 million rows to analyse regional activity of peer-to-peer lending in the UK. Conventional tools such as Excel fail (limited to 1,048,576 rows), which is sometimes taken as the definition of Big Data. Then again, I recently heard one of the leading experts, Ken Cukier, say: “there is no definition of big data”.
Image: Flickr (CC BY 2.0)-NaN Palmero
What follows is a, technical, collection of tips. It assumes you are familiar with base R, how to install packages and how to do basic operations. Two interactive introductions to R are: DataMind and Code School.
In this particular case I managed to load the complete data into RAM for a standard analysis in R, i.e. no Hadoop or similar. Here are 11 tips when you deal with f–ing irritating, granular data. This blog post is about Fig Data: data that may not be ”big” but the problems of size are thinly veiled. I won’t cover parallel techniques in R, but you can find a starting point with programming with big data in R and a mind map. Just because you can load your data into RAM does not guarantee you a smooth workflow. Computers are powerful, but it is easy to reach their limits. The scale factor, that is a rough estimates of the potential time you could save, is sometimes up to 1,000 or even beyond. A factor of 60 means your code takes 1 minute to run instead of 1 hour.
The 11 tips
Think in vectors. This means that for-loops (“do this for x times”) are generally a bad idea in R. The R Inferno has a chapter on why and how to “vectorise”. Especially if you’re a beginner or come from another programming language for-loops might be tempting. Resist and try to speak R. The apply family may be a good starting point, but of course do not avoid loops simply for the sake of avoiding loops.
Use the fantastic data.table package. Its advantage lies in speed and efficiency. Developed by Matthew Dowle it introduces a way of handling data, similar to the data.frame class. In fact, it includes data.frame, but some of the syntax is different. Luckily, the documentation (and the FAQ) are excellent.
Read csv-files with the fread function instead of read.csv (read.table). It is faster in readinga file in table format andgives you feedback on progress. However, it comes with a big warning sign “not for production use yet”. One trick it uses is to read the first, middle, and last 5 rows to determine column types. Side note: ALL functions that take longer than 5 seconds should have progress bars. Writing your own function? –> txtProgressBar.
Parse POSIX dates with the very fastpackage fasttime. Let me say that again:very fast. (Though the dates have to be in a standard format.)
Avoid copying data.frames and remove,
rm(yourdatacopy), some in your workflow. You’ll figure this out anyway when you run out of space.
Merge data.frames with the superior rbindlist – data.table we meet again. As the community knows: “rbindlist is an optimized version of
do.call(rbind, list(...)), which is known for being slow when using rbind.data.frame.”
Regular expressions can be slow, too. On one occassion I found a simpler way and used the stringr package instead; it was a lot faster.
No R tips collection would be complete with a hat tip to Hadley Wickham. Besides stringr, I point you to bigvis, a package for visualising big data sets.
Use a random sample for your exploratory analysis or to test code. Here is a code snippet that will give you a convenient function:
row.sample(yourdata, 1000)will reduce your massive file to a random sample of 1,000 observations.
Related to the previous point is reading only a subset of the data you want to analyse.
read.csv()for example has a
nrowsoption, which only reads the first x number of lines. This is also a good idea of getting your header names. The preferred option, a random sample, is more difficult and probably needs a ‘workaround’ as described here.
Export your data set directly as gzip. Writing a compressed csv file is not entirly trivial, but stackoverflow has the answer. Revolution Analytics also has some benchmark times for compressions in R.
Image: Flickr (CC BY 2.0)-David Bleasdale