The arrow
package allows you to take advantage of the power of the Acero execution engine for data manipulation and analysis. The code in arrow
provides bindings to dplyr
verbs and tidyverse
functions, so that you can use these interfaces without having to understand the inner workings of Acero. But what if you actually want to know more?
In the latest release of arrow
, version 9.0.0, the show_exec_plan()
function is introduced. This function allows you to see the execution plan generated from your code. For example:
library(arrow)
library(dplyr)
mtcars %>%
arrow_table() %>%
filter(mpg > 20) %>%
mutate(wt_kg = (wt*1000) * 0.453592) %>%
show_exec_plan()
## ExecPlan with 4 nodes:
## 3:SinkNode{}
## 2:ProjectNode{projection=[mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, "wt_kg": multiply_checked(multiply_checked(wt, 1000), 0.453592)]}
## 1:FilterNode{filter=(mpg > 20)}
## 0:TableSourceNode{}
This functionality is similar to that of dplyr::explain()
and you’ll get the same result whether you call arrow::show_exec_plan()
, dplyr::explain()
or dplyr::show_query()
on an arrow_dplyr_query
object.
Shout out to Dragoș Moldovan-Grünfeld who made the PR to implement this function!
See sections below for definitions of the terms mentioned above.
What is Acero?
Acero is an Arrow-native query execution engine, developed as part of the Arrow C++ / R libraries.
What is a query execution engine?
A query execution engine is a piece of software which allows users to execute queries against a data source. It typically has multiple steps, which could include things like taking the query and parsing it into an algebraic format, re-ordering and optimising the query in order to run it in the most efficient way, and actually running the query.
What is an ExecPlan?
Queries to be run on Acero are specified as execution plans aka ExecPlans. These are directed graphs which express what operations need to take place via nodes - called ExecNodes in this case. The graph for the ExecPlan in the code example earlier in this post looks like this.
The ExecNodes in the plan above are:
TableSourceNode - used to input the data
FilterNode - where we apply our filter to only retain rows where the value of
mpg
is greater than 20ProjectNode - this then specifies which columns we want, including the new column
wt_kg
as a compute expressionSinkNode - this provides the output