This Week on the Arrow Dev Mailing List

2021/04/27

Inspired by “This Week in Ballista”, and my desire to keep up-to-date with everything happening in Arrow, here’s my first “This Week in Arrow”, summarising the main occurrences on the dev mailing list from 20th - 27th April 2021.

Release candidate 1

There were votes on a few release candidates. RC1 was blocked due to a number of issues. There was an update which changed the default memory pool to prefer mimalloc instead of jemalloc on MacOs, as this had been shown to lead to better peformance on macOS. This code change hadn’t fully achieved this and so a further change was made to add this update.

There was a bug in a new API whereby if there were errors during scanning, ScanBatches() hung instead of erroring. Code changes were added to ensure that errors were raised.

Gandiva LRU cache replacement

There was a discussion around replacing the LRU cache in Gandiva. Whereas LRU (least recently used) discards things based on last usage, the proposed new cache algorithm also takes into account the building time for different expressions.

Copying Rust to new repos

The Rust components are being copied across to the new repositories and there was discussion around filtering git history via git-filter-repo, updating the CI, adding integration tests, and getting ready to accept PRs in the new repos.

File extension

There was a discussion around registering the Arrow format with IANA as a media type. There were 2 different types being discuss; streaming and the file itself, and the general opinion appreared to be that they should be different types.

Random number generation

Random number generation was slow on ARM64, with significant differences between different compilers (with clang being much slower than gcc). It was caused by soft-float math, and resolved by supplying the “-ffast-math” argument to clang.

Rest parquet2

One contributor has been experimenting with re-writing the Rust parquet implementation which doesn’t use “unsafe” (keyword used to allow functionality that doesn’t guarantee memory safety), improving performance, and other things. There is discussion of moving their implementation to an official Apache repo and people suggest how to go about it.

Python-datafusion

Discussion around adding python-datafusion into the project. Python-datafusion allows the use of Datafusion from Python. Questions are asked around whether the plan is to move it into the monorepo or kept as a separate apache repo, and the author suggests at least doing releases separately so it can have independent versioning from pyarrow and not automatically bundle it with pyarrow.

compute::isin rejects duplicates

One of the C++ compute functions, isin has parameter value_set, which raises an error if it contains duplicate values, in arrow 4.0.0. A user notes that this behaviour is different in Arrow 2.0.0 and a ticket is opened to change this behaviour.

pyarrow custom metadata

A user asks a question about the user of custom metadata in PyArrow - this is a feature in pandas that is not fully implemented in PyArrow. There is also clarification that there are not plans to migrate all features from pandas to pyarrow, and further clarification that pyarrow is intended to be a back-end whereas pandas is both back-end and front-end. A ticket was opened around adding examples of using custom metadata.

Release candidate 3

The release candidate was verified by various people in multiple formats on multiple platforms, the vote was carried, and a draft blog post about the release was started.

Miscellaneous