What is tidy eval and why should I care?

Nic Crane

2018/03/29

I’m going to begin this post somewhat backwards, and start with the conclusion: tidy eval is important to anyone who writes R functions and uses dplyr and/or tidyr.

I’m going to load a couple of packages, and then show you exactly why.

library(dplyr)
library(rlang)

Data wrangling with base R

Here’s an example function I have written in base R. Its purpose is to take a data set, and extract values from a single column that match a specific value, with both input and output both being in data frame format.

wrangle_data <- function(data, column, val){
  
  data[data[[column]]==val, column, drop = FALSE]
  
}

wrangle_data(iris, "Species", "versicolor") %>%
  head()
##       Species
## 51 versicolor
## 52 versicolor
## 53 versicolor
## 54 versicolor
## 55 versicolor
## 56 versicolor

It works, but it’s not great; the code is clunky and hard to decipher at a quick glance. This is where using dplyr can help.

Data wrangling with dplyr

If I was to run the same code outside of the context of a function, I might do something like this:

one_col <- select(iris, Species)
filter(one_col, Species == "versicolor")  %>%
  head()
##      Species
## 1 versicolor
## 2 versicolor
## 3 versicolor
## 4 versicolor
## 5 versicolor
## 6 versicolor

This has worked, but how can we turn this into a function?

I may naively attempt the solution below

wrangle_data <- function(data, column, val){
  one_col <- select(data, column)
  filter(one_col, column == val)
}

wrangle_data(iris, "Species", "versicolor")  %>%
  head()
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(column)` instead of `column` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
## [1] Species
## <0 rows> (or 0-length row.names)

However, this doesn’t work and returns 0 rows. This is due to a special quirk of dplyr that makes typical usage easier, but we need to be aware of when writing functions. This snippet from the programming vignette in dplyr explains it best:

"Most dplyr functions use non-standard evaluation (NSE). This is a catch-all term that means they don’t follow the usual R rules of evaluation. Instead, they capture the expression that you typed and evaluate it in a custom way. This has two main benefits for dplyr code:

  • Operations on data frames can be expressed succinctly because you don’t need to repeat the name of the data frame. For example, you can write filter(df, x == 1, y == 2, z == 3) instead of df[df\(x == 1 & df\)y ==2 & df$z == 3, ].
  • dplyr can choose to compute results in a different way to base R. This is important for database backends because dplyr itself doesn’t do any work, but instead generates the SQL that tells the database what to do."

In other words, because dplyr functions evaluate things differently to base R, by using a concept called quoting, we have to work with them a bit differently. I’d recommend checking out this RStudio webinar and the dplyr programming vignette for more detail.

Let’s go back to the previous example.

wrangle_data <- function(data, column, val){
  one_col <- select(data, column)
  filter(one_col, column == val)
}

wrangle_data(iris, "Species", "versicolor")  %>%
  head()
## [1] Species
## <0 rows> (or 0-length row.names)

This doesn’t work as select() and filter() are looking for a column called “column” in their inputs, and failing to find them. Therefore we must use tidy evaluation to override this behaviour.

Tidy evaluation with dplyr

Using the !! (bang bang) operator and sym() from the rlang package, we can change this behaviour to make a version of our function which will run.

wrangle_data <- function(x, column, val){
  
  one_col <- select(x, !!sym(column))
  filter(one_col, !!sym(column) == val)
}

wrangle_data(iris, "Species", "versicolor") %>%
  head()
##      Species
## 1 versicolor
## 2 versicolor
## 3 versicolor
## 4 versicolor
## 5 versicolor
## 6 versicolor

I’m not going to go into detail about these functions here, but if you want more information, check out my blog posts on using tidy eval in dplyr here and here.

In conclusion, whilst tidy eval is not necessary for all uses of dplyr or tidyr, it quickly becomes an extremely handy tool when working with these packages within the context of function. There are some great resources about tidy eval out there, and as ever, I welcome feedback on this blog post via Twitter.