Basic Spark Part 1 - Umm, remind me about big data please?

Nic Crane

2018/02/02

When I started my career in data science, I was in the common position of having familiarity with technologies like R, Python, and SQL, but much less with big data technologies. I remember feeling intimidated by big data; there were lots of different technologies named after animals or making some sort of pun I wasn’t clued up enough to understand.

Flash forward 18 months and with experience, some parts of the big data landscape felt a bit more familiar. I’d had a play around with Cloudera Impala using SQL syntax to query big data, and sat in on the sparklyr workshop at EARL Boston 2017, finding myself pleasantly surprised to find how much of the familiar syntax from R’s dplyr package I could use to interact with Spark from R.

Whilst I was comfortable with sparklyr, I still felt that I was in need of a bit of background knowledge about Spark and distributed computing in general. Whilst I’m a data scientist and not a data engineer, I believe that a bit of knowledge of what’s going on under the hood has really helped cement my understanding of Spark, and so I’ve created this series of blog posts to capture what I found useful.

My assumptions about anyone reading this series of blog posts:

A few definitions…

Before we get started, I need to define a few terms:

How is big data processed?

There are a number of steps that are common across different big data processing frameworks, as defined by the MapReduce programming model. This refers to a specialised version of the split-apply-combine strategy for data analysis which you may be more familiar with.

And that’s the basics of the theory covered. In my next post, I’m going to talk about the technologies used to implement these frameworks.