NEDs Workshop (18th July 2024)
Arrow contributor!
Today we’re going to cover:
Repository URL: https://github.com/thisisnic/introtoarrowneds
If your download stops partway through, you can stop and resume from the same place.
A multi-language toolbox for accelerated data interchange and in-memory processing
Arrow is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another
In-memory columnar format: a standardized, language-agnostic specification for representing structured, table-like datasets in-memory.
Arrow’s Columnar Format is Fast
Arrow has a rich data type system, including direct analogs of many R data types
<dbl>
== <double>
<chr>
== <string>
or <utf8>
<int>
== <int32>
Arrow scans 👀 1MB of the file(s) to impute or “guess” the data types
📚 arrow vs readr blog post: https://thisisnic.github.io/2022/11/21/type-inference-in-readr-and-arrow/
seattle_csv <- open_dataset(
sources = "./data/seattle-library-checkouts.csv",
col_types = schema(ISBN = string()),
format = "csv"
)
schema(seattle_csv)
Schema
UsageClass: string
CheckoutType: string
MaterialType: string
CheckoutYear: int64
CheckoutMonth: int64
Checkouts: int64
Title: string
ISBN: string
Creator: string
Subjects: string
Publisher: string
PublicationYear: string
FileSystemDataset (query)
MaterialType: string
IsBook: bool (ends_with(MaterialType, {pattern="BOOK", ignore_case=false}))
See $.data for the source Arrow object
Nothing is pulled into memory yet!
seattle_csv |>
head(20) |>
mutate(IsBook = endsWith(MaterialType, "BOOK")) |>
select(MaterialType, IsBook) |>
collect()
# A tibble: 20 × 2
MaterialType IsBook
<chr> <lgl>
1 BOOK TRUE
2 BOOK TRUE
3 EBOOK TRUE
4 BOOK TRUE
5 SOUNDDISC FALSE
6 BOOK TRUE
7 BOOK TRUE
8 EBOOK TRUE
9 BOOK TRUE
10 EBOOK TRUE
11 BOOK TRUE
12 BOOK TRUE
13 BOOK TRUE
14 AUDIOBOOK TRUE
15 BOOK TRUE
16 EBOOK TRUE
17 SOUNDDISC FALSE
18 VIDEODISC FALSE
19 SOUNDDISC FALSE
20 AUDIOBOOK TRUE
seattle_csv |>
filter(endsWith(MaterialType, "BOOK")) |>
group_by(CheckoutYear) |>
summarise(Checkouts = sum(Checkouts)) |>
arrange(CheckoutYear) |>
collect()
# A tibble: 18 × 2
CheckoutYear Checkouts
<int> <int>
1 2005 2129128
2 2006 3385869
3 2007 3679981
4 2008 4156859
5 2009 4500788
6 2010 4389760
7 2011 4484366
8 2012 4696376
9 2013 5394411
10 2014 5606168
11 2015 5784864
12 2016 5915722
13 2017 6280679
14 2018 6831226
15 2019 7339010
16 2020 5549585
17 2021 6659627
18 2022 6301822
seattle_csv |>
filter(endsWith(MaterialType, "BOOK")) |>
group_by(CheckoutYear) |>
summarise(Checkouts = sum(Checkouts)) |>
arrange(CheckoutYear) |>
collect() |>
system.time()
user system elapsed
13.120 1.162 11.797
42 million rows – not bad, but could be faster….
file <- list.files(seattle_parquet_dir,
recursive = TRUE,
full.names = TRUE)
file.size(file) / 10 ** 9
[1] 4.423348
Parquet about half the size of the CSV file on-disk 💾
open_dataset(seattle_parquet_dir,
format = "parquet") |>
filter(endsWith(MaterialType, "BOOK")) |>
group_by(CheckoutYear) |>
summarise(Checkouts = sum(Checkouts)) |>
arrange(CheckoutYear) |>
collect() |>
system.time()
user system elapsed
5.379 0.267 0.840
42 million rows – much better! But could be even faster….
Dividing data into smaller pieces, making it more easily accessible and manageable
Have you partitioned your data or used partitioned data before today?
seattle_parquet_part <- "./data/seattle-library-checkouts"
sizes <- tibble(
files = list.files(seattle_parquet_part,
recursive = TRUE),
size_GB = round(file.size(file.path(seattle_parquet_part, files)) / 10**9, 3)
)
sizes
# A tibble: 18 × 2
files size_GB
<chr> <dbl>
1 CheckoutYear=2005/part-0.parquet 0.114
2 CheckoutYear=2006/part-0.parquet 0.172
3 CheckoutYear=2007/part-0.parquet 0.186
4 CheckoutYear=2008/part-0.parquet 0.204
5 CheckoutYear=2009/part-0.parquet 0.224
6 CheckoutYear=2010/part-0.parquet 0.233
7 CheckoutYear=2011/part-0.parquet 0.25
8 CheckoutYear=2012/part-0.parquet 0.261
9 CheckoutYear=2013/part-0.parquet 0.282
10 CheckoutYear=2014/part-0.parquet 0.296
11 CheckoutYear=2015/part-0.parquet 0.308
12 CheckoutYear=2016/part-0.parquet 0.315
13 CheckoutYear=2017/part-0.parquet 0.319
14 CheckoutYear=2018/part-0.parquet 0.306
15 CheckoutYear=2019/part-0.parquet 0.302
16 CheckoutYear=2020/part-0.parquet 0.158
17 CheckoutYear=2021/part-0.parquet 0.24
18 CheckoutYear=2022/part-0.parquet 0.252
seattle_parquet_part <- "./data/seattle-library-checkouts"
open_dataset(seattle_parquet_part,
format = "parquet") |>
filter(endsWith(MaterialType, "BOOK")) |>
group_by(CheckoutYear) |>
summarise(Checkouts = sum(Checkouts)) |>
arrange(CheckoutYear) |>
collect() |>
system.time()
user system elapsed
4.853 0.245 0.520
42 million rows – not too shabby!
filter()
How long does it take to calculate the number of books checked out in each month of 2021?
open_dataset(sources = "./data/seattle-library-checkouts.csv",
col_types = schema(ISBN = string()),
format = "csv") |>
filter(CheckoutYear == 2021, endsWith(MaterialType, "BOOK")) |>
group_by(CheckoutMonth) |>
summarize(TotalCheckouts = sum(Checkouts)) |>
arrange(desc(CheckoutMonth)) |>
collect() |>
system.time()
user system elapsed
13.460 1.611 12.442
How long does it take to calculate the number of books checked out in each month of 2021?
open_dataset("./data/seattle-library-checkouts",
format = "parquet") |>
filter(CheckoutYear == 2021, endsWith(MaterialType, "BOOK")) |>
group_by(CheckoutMonth) |>
summarize(TotalCheckouts = sum(Checkouts)) |>
arrange(desc(CheckoutMonth)) |>
collect() |>
system.time()
user system elapsed
0.353 0.034 0.085
Currently being written - preview available online soon!
Some resources for PyArrow users (thanks to Alenka Frim for adding this content!):
PyArrow resources: