Fix cdc_order_by with multiple fields.#6346
Conversation
PR #6245 caused a regression for `cdc_order_by` which prevented parsing of order by clauses with multiple fields. That was because we didn't use the right DF function to parse it. This corrects the mistakes and adds a series of tests. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
mythical-fred
left a comment
There was a problem hiding this comment.
Clean fix for the #6245 regression. The added validator (parse_comma_separated(parse_order_by_expr) + expect_token(EOF)) is the right shape, the new parse_cdc_order_by honors DataFusion's nulls_max default per direction, and the test ladder — validator shape table, parser defaults table, Rust multi-key integration with a same-__feldera_ts tiebreak, Python upsert with lsn tiebreak — covers exactly the original failure mode plus the directional/null-placement edges. A couple of small nits below, none blocking.
|
|
||
| order_exprs | ||
| .into_iter() | ||
| .map(|order_expr| { |
There was a problem hiding this comment.
df.parse_sql_expr(&order_expr.expr.to_string()) round-trips the key expression through a string render of sqlparser's AST and reparses it via DataFusion. That works for everything in the test ladder, but it is a fragile contract — anything that formats ambiguously (quoting rules across dialects, function-call serialization, etc.) can silently change semantics between the two parses. Since you already have an order_expr.expr: sqlparser::ast::Expr, the more direct path is to use df.parse_sql_expr only once on the whole clause (e.g. via a tiny helper that hands the sub-Expr to DataFusion's SQL-to-logical-expr translator), or to push the per-key reparse far enough to fail loudly if the round trip drops anything. Not a blocker, but worth a TODO so future-you doesn't have to discover this the hard way.
| /// against df's schema. | ||
| /// | ||
| /// `cdc_order_by` is the body of an ORDER BY clause: a comma-separated list | ||
| /// of keys, each optionally annotated with ASC / DESC and NULLS |
There was a problem hiding this comment.
validate_sql_order_by lives in adapterlib::utils::datafusion and parse_cdc_order_by re-implements the same prefix here (Parser::new(GenericDialect).try_with_sql + parse_comma_separated + EOF check), just with deltalake's sqlparser re-export. They have to stay in lock-step for validate_cdc_order_by to actually pre-validate what build_cdc_dataframe will accept. Worth extracting the &str -> Vec<OrderByExpr> step into a shared helper in adapterlib so the two sites can never drift.
| let expect = |id: u32, label: &str| TestStruct { | ||
| id, | ||
| b: false, | ||
| i: None, |
There was a problem hiding this comment.
Soft: the test expect closure hardcodes b: false and i: None, and only varies s. Fine here, but if you ever extend TestStruct you'll thank yourself for a TestStruct::with_s(id, s) constructor in the test helper module instead.
PR #6245 caused a regression for
cdc_order_bywhich prevented parsing of order by clauses with multiple fields.That was because we didn't use the right DF function to parse it.
This corrects the mistakes and adds a series of tests, we didn't have any that run cdc_order_by with more than one field.
Describe Manual Test Plan
Checklist
Breaking Changes?
Not anymore.