When I started my career in data science, I was in the common position of having familiarity with technologies like R, Python, and SQL, but much less with big data technologies. I remember feeling intimidated by big data; there were lots of different technologies named after animals or making some sort of pun I wasn’t clued up enough to understand.
Flash forward 18 months and with experience, some parts of the big data landscape felt a bit more familiar. I’d had a play around with Cloudera Impala using SQL syntax to query big data, and sat in on the sparklyr workshop at EARL Boston 2017, finding myself pleasantly surprised to find how much of the familiar syntax from R’s dplyr package I could use to interact with Spark from R.
Whilst I was comfortable with sparklyr, I still felt that I was in need of a bit of background knowledge about Spark and distributed computing in general. Whilst I’m a data scientist and not a data engineer, I believe that a bit of knowledge of what’s going on under the hood has really helped cement my understanding of Spark, and so I’ve created this series of blog posts to capture what I found useful.
My assumptions about anyone reading this series of blog posts:
You have a basic understanding of data science.
You are interested in big data processing but don’t know a huge amount about it.
You have basic familiarity with R and the dplyr package
A few definitions…
Before we get started, I need to define a few terms:
Node – one single computer.
Cluster – a group of computers connected via a network
Distributed computing – multiple nodes are combined in a cluster in order to increase efficiency and power. Data is partitioned and distributed across the cluster, and so can be processed in smaller chunks. Distributed computing is at the root of big data processing and allows complex processing of datasets that a single machine would simply not have enough resources to process alone.
How is big data processed?
There are a number of steps that are common across different big data processing frameworks, as defined by the MapReduce programming model. This refers to a specialised version of the split-apply-combine strategy for data analysis which you may be more familiar with.
Split – divide the file(s) into multiple partitions which are distributed across the different nodes in the cluster.
Apply – an operation such as a user-defined function is run on each node. This bit is the ‘Map’ in ‘MapReduce’.
Combine – the separate results from each node are brought back together into a single set of results. This bit is the ‘Reduce’ in ‘MapReduce’. Before the data can be reduced, it may need to be shuffled and sorted to make sure the relevant data points are adjacent, if, for example, we want to group the data.
And that’s the basics of the theory covered. In my next post, I’m going to talk about the technologies used to implement these frameworks.