Skip to content

[py]: delta input snapshot test for column mapping#6376

Merged
swanandx merged 2 commits into
mainfrom
col-mapping-test
Jun 5, 2026
Merged

[py]: delta input snapshot test for column mapping#6376
swanandx merged 2 commits into
mainfrom
col-mapping-test

Conversation

@swanandx

@swanandx swanandx commented Jun 2, 2026

Copy link
Copy Markdown
Member

Test delta input connector in snapshot mode on a table with column mapping (columnMapping.mode='name') with rename/add/drop schema evolution, asserting snapshot reads resolve values through the column-mapping metadata.

Factor the shared fixture orchestration out of the deletion-vectors test into tests.utils.ensure_delta_spark_fixture (used by both), move the Spark builders under tests/platform/fixtures/

Describe Manual Test Plan

Ran it locally

Checklist

  • Integration tests added/updated

Breaking Changes?

Adding a test, no breaking change

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice extraction. The ensure_delta_spark_fixture helper is the right level of abstraction — both the DV and column-mapping tests now read like a one-liner, the heavy uv+Spark stack stays gated behind cache-miss, and the _delta_log-last upload ordering invariant is preserved as a single method on DeltaTestLocation instead of being copy-pasted. EXPECTED_ROWS exported from fixtures/column_mapping.py so the test and the builder can't drift is exactly right.

A few non-blocking observations:

  1. The column-mapping test's default is_present is DeltaTestLocation.table_exists, i.e. "any Delta log entries here at all". If a previous, differently-shaped fixture ever lands at the same stable_subpath you'd silently reuse it. FIXTURE_VERSION = "column_mapping_v1" is the intended escape hatch, but a stronger predicate (e.g. "log v6 exists AND has a drop-column metadata op") would catch the case where someone tweaks the builder and forgets to bump the version. The DV test does exactly this via _log_has_dv_entries. Worth either documenting "you must bump FIXTURE_VERSION on any builder change" in column_mapping.py, or pointing the predicate at something the schema-evolution path uniquely produces.

  2. ALTER TABLE ... DROP COLUMN under column mapping requires Delta writer v5 + reader v2 (which you set) plus the columnMapping table feature; on Spark 4.x Delta 4.x this is fine, but the delta-spark>=4.2,<5 pin in _FIXTURE_BUILDER's docstring is what guarantees it. If anyone ever relaxes that pin, the fixture will start failing in a confusing way. A one-line note in the builder ("requires delta-spark>=4.x for ALTER COLUMN under column mapping") would help future-you.

  3. ensure_delta_spark_fixture(builder_args=(...)) stringifies every arg via str(arg), so a None becomes the literal "None". Today both callers pass ints, but a single-line note in the docstring ("args are stringified verbatim — pass primitives, not None") would head off the eventual footgun.

  4. Tiny: test_delta_input_column_mapping.py builds a SQL fragment that re-implements connector JSON wrapping that's near-identical to the DV test (json.dumps([...]).replace("'", "''")). If you have a third call site coming, that's a candidate for a tests.utils helper too — not for this PR.

Approve.

Comment thread python/tests/utils.py Outdated
raise
return dt.count()

def table_exists(self) -> bool:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the comment is right I would call this function "delta_log_exists"

Comment thread python/tests/utils.py Outdated
"""Place a locally-built Delta table tree onto this location's backend.

Some fixtures can only be produced on the local filesystem (e.g. a
PySpark-written table) yet must be read back from the configured

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"configured"? This comment seems to imply a bunch of stuff that I don't understand.

Comment thread python/tests/utils.py
Comment thread python/tests/utils.py Outdated
if not present(loc):
raise RuntimeError(
f"Delta fixture at {loc.uri} is still absent after upload — "
"partial upload, or content-stripping middleware?"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this guess about who did it help in any way?
Is it always right?

Parquet column names (``col-<uuid>``) that differ from the logical names, so a
correct snapshot read must resolve every value through the column-mapping
metadata in the Delta log. PySpark seeds the fixture (it is the only writer
that supports column-mapping rename/drop); the builder lives in

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently supports

f"""
CREATE TABLE {t} (id BIGINT, name STRING, amount DOUBLE)
USING delta
TBLPROPERTIES ('delta.columnMapping.mode' = 'name',

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to test the id mode as well.

swanandx added 2 commits June 4, 2026 15:45
Test delta input connector in snapshot mode on a table with column mapping
(columnMapping.mode='name') with rename/add/drop schema evolution,
asserting snapshot reads resolve values through the column-mapping
metadata.

Factor the shared fixture orchestration out of the deletion-vectors test
into tests.utils.ensure_delta_spark_fixture (used by both), move the
Spark builders under tests/platform/fixtures/

Signed-off-by: Swanand Mulay <73115739+swanandx@users.noreply.github.com>
@swanandx swanandx force-pushed the col-mapping-test branch from 3b98419 to bda9434 Compare June 4, 2026 11:40
@swanandx swanandx requested review from mihaibudiu and ryzhyk and removed request for gz June 4, 2026 11:41

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-APPROVE on bda9434.

All four prior nits addressed:

  1. Default presence check renamed table_exists -> delta_log_exists (utils.py), and the column-mapping path now has its own FIXTURE_VERSION bumped on builder changes — the dependency between builder mutation and cache invalidation is now spelled out in the builder module docstring.
  2. Builder docstring now states the delta-spark 4.x requirement explicitly ("DROP COLUMN under column mapping requires delta-spark 4.x (Delta writer v5, reader v2)").
  3. builder_args stringification semantics documented on ensure_delta_spark_fixture — each arg goes through str() verbatim, None would render as the literal "None". Cheap, clear, no validator wedged in.
  4. Coverage expanded by adding the mode='id' case via a _run_column_mapping_snapshot_test helper rather than pytest.mark.parametrize, with an inline rationale (parametrize would collide pipeline names under pytest -n). Good call.

Bonus cleanup: --no-project on uv run keeps the builder hermetic; the failure message no longer speculates about content-stripping middleware.

LGTM.

@swanandx swanandx added this pull request to the merge queue Jun 5, 2026
Merged via the queue into main with commit b84da38 Jun 5, 2026
1 check passed
@swanandx swanandx deleted the col-mapping-test branch June 5, 2026 07:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants