perf: bypass values.value(i) for inline strings in ArrowBytesViewMap#22172
perf: bypass values.value(i) for inline strings in ArrowBytesViewMap#22172RyanJamesStewart wants to merge 2 commits into
Conversation
For inline strings (len <= 12), insert_if_new_inner called values.value(i) in the new-value branch even though the same bytes are already packed in view_u128 (the ByteView inline format is [len:u32 LE][data:12 bytes zero-padded]). Add append_inline_view helper and branch on len <= 12 to push view_u128 directly, skipping the views-buffer round-trip. The stored view is byte-identical to make_view(value, 0, 0) for inline inputs; the existing test suite confirms correctness. Closes apache#20054
alamb
left a comment
There was a problem hiding this comment.
Thank you @RyanJamesStewart -- this looks like a nice find and the code looks good to me. I had a few suggestions. Let me know what you think
The unsafe bit I think we need to do before merging this in. I also think the benchmark is probably not worth checking in
| @@ -0,0 +1,91 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
While this benchmark is cool, I am not sure we need to commit it to the repo as I think its utility will be low after this PR has been merged (I think it is unlikely to get run regularly)
| /// ByteView (length field in the low 32 bits is <= 12). | ||
| /// | ||
| /// Returns the view that was stored (identical to the argument). | ||
| fn append_inline_view(&mut self, view: u128) -> u128 { |
There was a problem hiding this comment.
Given this assumption I think this function should be marked as unsafe (I think it is correct, but the compiler can't verify it)
| fn append_inline_view(&mut self, view: u128) -> u128 { | |
| unsafe fn append_inline_view(&mut self, view: u128) -> u128 { |
Then this will force us to annotate the callsite with unsafe as well where it can be more easily verified that the input was a valid inline view so this function is safe
|
FYI @Dandandan |
|
run benchmarks |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/binary-view-map-inline-bypass (1364643) to 18a219c (merge-base) diff using: tpcds File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/binary-view-map-inline-bypass (1364643) to 18a219c (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/binary-view-map-inline-bypass (1364643) to 18a219c (merge-base) diff using: tpch File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpch — base (merge-base)
tpch — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagetpcds — base (merge-base)
tpcds — branch
File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
|
run benchmark clickbench_partitioned |
|
(I am trying to see if the last run was noise) |
|
🤖 Benchmark running (GKE) | trigger CPU Details (lscpu)Comparing perf/binary-view-map-inline-bypass (1364643) to 18a219c (merge-base) diff using: clickbench_partitioned File an issue against this benchmark runner |
|
🤖 Benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usageclickbench_partitioned — base (merge-base)
clickbench_partitioned — branch
File an issue against this benchmark runner |
Addresses alamb's review on PR apache#22172: * Mark `append_inline_view` as `unsafe fn` (binary_view_map.rs:411). The function relies on the caller passing a valid inline ByteView (len <= 12 in the low 32 bits + 12 zero-padded value bytes); the compiler can't verify that invariant from the signature. Marking it unsafe makes the safety contract explicit and forces the single call site in `insert_if_new_inner` to annotate, where the `if len <= 12` branch establishes the invariant. # Safety paragraph on the function spells out what goes wrong if a non-inline view is passed (downstream `views` consumers interpret the high 96 bits as [buffer_index, offset], which is unsound). * Drop benches/binary_view_map_insert.rs and its [[bench]] entry in the crate's Cargo.toml. alamb's note: the bench is unlikely to be run regularly post-merge and its utility drops after the fix lands. Behavior unchanged. 8/8 binary_view_map tests pass; clippy + fmt clean.
|
v2 pushed (72bea57):
Behavior unchanged. 8/8 binary_view_map tests pass; clippy + fmt clean. |
The run was noise |
alamb
left a comment
There was a problem hiding this comment.
Thank you @RyanJamesStewart
| // SAFETY: the enclosing `if len <= 12` branch establishes | ||
| // that view_u128 is a valid inline ByteView. Its low 32 | ||
| // bits encode `len` (<= 12) and the next 12 bytes are the | ||
| // value bytes the source array produced for this row, so | ||
| // every required invariant of `append_inline_view` holds. |
There was a problem hiding this comment.
this seems overly verbose -- I think the justification is that view_u128 was a valid view and len <= 12
| // SAFETY: the enclosing `if len <= 12` branch establishes | |
| // that view_u128 is a valid inline ByteView. Its low 32 | |
| // bits encode `len` (<= 12) and the next 12 bytes are the | |
| // value bytes the source array produced for this row, so | |
| // every required invariant of `append_inline_view` holds. | |
| // SAFETY: view_u128 was a valid view, and the encosing`len <= 12` | |
| // ensures it is inline |
|
Will plan to merge this in the next day or two to give some others time to review if they want |
Which issue does this PR close?
Closes #20054.
Rationale for this change
ArrowBytesViewMap::insert_if_new_inneris the hot loop behindCOUNT DISTINCTandGROUP BYoverStringViewArrayandBinaryViewArraycolumns. The new-value branch callsvalues.value(i)to reconstruct bytes from the views buffer, but for inline strings (len <= 12) those bytes are already packed in the localview_u128. The Arrow inline ByteView format stores[len:u32 LE][data:12 bytes zero-padded]inside the u128, sovalues.value(i)performs an avoidable buffer deref per first-seen distinct inline value.Dandandan flagged the same gap in the PR #19975 review thread that spawned this issue: "We should avoid this for inlined values (and it's the same as above
input_value, so now it does it twice." The fix has not landed in the five months since.The equivalence is bit-level: for an inline input,
make_view(value, 0, 0)produces a u128 byte-identical to the inputview_u128(same length in low 4 bytes, same data in high 12 bytes, zero-padded the same way). Pushingview_u128directly intoself.viewsproduces the same stored state as the round trip throughvalues.value(i)andappend_value.AI-assistance disclosure: this PR was developed with AI assistance. The equivalence argument, the choice to defer the non-inline K-collision case to a separate PR, and the decision to add a path-specific bench (since the existing
aggregate_vectorized.rsexercisesByteViewGroupValueBuilder, not this map) are decisions I understand end-to-end and can justify in review. No unknowns are masked.What changes are included in this PR?
fn append_inline_view(&mut self, view: u128) -> u128inbinary_view_map.rs.insert_if_new_innernew-value path branches onlen <= 12: inline case extracts bytes fromview_u128.to_le_bytes()[4..4 + len as usize], calls the new helper, and skips thevalues.value(i)plusmake_viewround trip. The non-inline path is unchanged.datafusion/physical-expr-common/benches/binary_view_map_insert.rsexercising the patched code path onStringViewArraywith all-distinct inline strings, plus a registration entry in the crate'sCargo.toml.The non-inline K-collision case (where
values.value(i)is called K+1 times when there are K hash and prefix collisions) is a separate concern and is deferred to a future PR if it proves worth pursuing.Are these changes tested?
cargo test -p datafusion-physical-expr-common --lib binary_view_map: 8 of 8 existing tests pass. The existing suite covers inline, non-inline, binary, and non-UTF-8 paths; any byte-order or off-by-one error in the inline view extraction would surface as a wrong hash bucket or a wrong lookup result.cargo clippy -p datafusion-physical-expr-common --all-targets -- -D warnings: clean.The new bench measures
ArrowBytesViewSet_insert_if_newon all-distinctStringViewArrayinputs at len=8 (worst case: every row hits the new-value branch). Results on a Ryzen 9 9950X:Criterion reports "Performance has improved" at all three sizes with p = 0.00. The delta narrows at 100K because the hash map itself starts dominating when all values are distinct; production workloads with repeated values benefit proportionally to the first-seen-distinct count.
Are there any user-facing changes?
No public API changes.
append_inline_viewis private to the crate;insert_if_new_innerkeeps its signature. Theapi changelabel is not required.