[SQL] Support INTERSECT/EXCEPT ALL by rewriting into JOIN of inputs with ROW_NUMBER#6438
[SQL] Support INTERSECT/EXCEPT ALL by rewriting into JOIN of inputs with ROW_NUMBER#6438mihaibudiu wants to merge 3 commits into
Conversation
3a3cdae to
63a50ba
Compare
mythical-fred
left a comment
There was a problem hiding this comment.
Two issues — one real bug, one prose nit. The rewrite rule itself is clean and well-documented; good test coverage including NULL semantics and multi-way INTERSECT ALL.
Also: commit subject says "INSERSECT" — worth fixing with git rebase -i before merge.
mythical-fred
left a comment
There was a problem hiding this comment.
Re-review for EXCEPT ALL scope expansion (prior approval was at 26fa626 for INTERSECT ALL only).
Verdict: LGTM
The EXCEPT ALL implementation mirrors the (already-reviewed) INTERSECT ALL one and is correct.
Correctness notes
ExceptAllToJoinRule: LEFT JOIN A vs B onIS NOT DISTINCT FROMover data cols +$rn, filtered byb.$rn IS NULL. A row(v, k)from A survives iff B has no row(v, k), i.e. iffk > count_B(v). Surviving k's arecount_B(v)+1 .. count_A(v), giving exactlymax(0, count_A(v) - count_B(v))copies. Multiset-difference semantics. ✓- Field-offset arithmetic checks out: left
$rnat positionfieldCount, right$rnat2*fieldCount + 1. ✓ - Nice reuse of
IntersectAllToJoinRule.augmentWithRowNumberfromExceptAllToJoinRule. ✓ ExceptAllFoldRule(n>2 → binary) is defensive: only fires if Calcite produced an n-aryMinusupstream (e.g. viaMinusMergeRule). Harmless if it never matches; the parser-emitted left-associative nested binaries are handled byExceptAllToJoinRuledirectly.CalciteToDBSPCompiler.visitMinus/visitIntersectcorrectly converted fromUnimplementedExceptiontoInternalCompilerError— good fail-loud if the rule didn't fire.
Test coverage gap (non-blocking, please consider)
SetOpTests.java has a thorough testIntersectAll* series (basic, minCounts, empty, null, nullBothSides, multiColumn, threeWay) but no symmetric testExceptAll* series. EXCEPT ALL is only covered by two inline cases in calciteExcept (basic + nulls). Worth adding tests for:
- empty left / empty right
- multi-column rows
- three-way left-associative
A EXCEPT ALL B EXCEPT ALL C count_B(v) > count_A(v)edge (result has 0 copies)count_B(v) == count_A(v)edge (all consumed)
Not a blocker — calciteExcept plus the planner symmetry with INTERSECT ALL give reasonable confidence.
Minor
IS_NOT_DISTINCT_FROMon$rnis fine (ROW_NUMBER is non-null so it reduces to=); a plain=on$rnwould be slightly clearer but won't change planning.
Approved.
| // Right side contains all left values: result is empty. | ||
| this.qst(""" | ||
| SELECT * FROM (VALUES (1), (2)) AS T(x) | ||
| EXCEPT |
There was a problem hiding this comment.
Non-blocking: please add testExceptAll* tests symmetric to the testIntersectAll* set: empty inputs, multi-column, three-way (left-associative), and the count_B(v) >= count_A(v) boundary cases. Currently EXCEPT ALL is covered only by the two inline cases in calciteExcept, which is much thinner than the INTERSECT ALL coverage in this same file.
| builder.push(inputs.get(0)); | ||
|
|
||
| // UNION ALL inputs[1..n-1] into a single right-side relation. | ||
| builder.push(inputs.get(1)); |
There was a problem hiding this comment.
FYI: Calcite's SQL parser emits A EXCEPT ALL B EXCEPT ALL C as nested binary Minus nodes, so this rule only fires if something upstream (e.g. MinusMergeRule) merges them into an n-ary node. I don't see MinusMergeRule registered in the optimizer pipeline, so this rule may be dead code today. Harmless either way — leaving it in is reasonable defensive coding — but worth a one-line comment about the precondition, or a test that exercises an n-ary Minus to make sure the path is alive.
|
Needs #6472 to be merged first |
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
…_NUMBER Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
Signed-off-by: Mihai Budiu <mbudiu@feldera.com>
mythical-fred
left a comment
There was a problem hiding this comment.
Re-APPROVE on ea2ce79.
Since the prior review at af46f21:
- Rebased onto 92f03be (EXCEPT/INTERSECT nullability fix merged into main). Rebase brings extra null-input filtering into
CalciteToDBSPCompilerfor n-ary Minus/Intersect, plus the correspondingfilterNonNullFieldssignature refactor — orthogonal to this PR, already approved on its own merit. - New
[SQL] Use stepWeightOne in more testscommit: purely mechanical replacement ofstep(... empty ...)withstepWeightOne(...)across 8 test files. Empty-expected blocks shrink fromr | weight\n----tor\n---. No behavioural change, no risk. - New SetOpTests cases pick up my prior nit asking for a symmetric
testExceptAll*ladder via thecheckSetOpNullTwoIntshelper (16-combination NULL cross-product for both EXCEPT and INTERSECT) plus targetedtestExceptMultiField,testExceptNullExcluded,testIntersectMultiField. Good coverage.
Carry forward: my earlier soft note about ExceptAllFoldRule possibly being dead code (no MinusMergeRule in the pipeline today) stands but is non-blocking.
Fixes #6436
Fixes #5483
The first commit cleans up some tests.
The second commit implements INTERSECT ALL.
The third commit implementsEXCEPT ALL.
INTERSECT ALLis eliminated using a Calcite rewrite rule, and converted into a window query for each input and a join tree.For two inputs A and B with columns c0, c1, ..., cn the transformation produces:
The algorithm for EXCEPT ALL is very similar, just using a left join.
Describe Manual Test Plan
Ran all the new Java tests.
Checklist