Skip to content

[py]: delta input snapshot test for deletion vectors#6347

Merged
swanandx merged 1 commit into
mainfrom
delta-dv-snapshot-test
Jun 2, 2026
Merged

[py]: delta input snapshot test for deletion vectors#6347
swanandx merged 1 commit into
mainfrom
delta-dv-snapshot-test

Conversation

@swanandx

@swanandx swanandx commented May 29, 2026

Copy link
Copy Markdown
Member

check commit message plz

Describe Manual Test Plan

Ran it locally with FELDERA_HOST=http://localhost:8080 uv run pytest -vv tests/platform/test_delta_input_deletion_vectors.py

Checklist

  • Unit tests added/updated
  • Integration tests added/updated
  • Documentation updated
  • Changelog updated

Breaking Changes?

No, just adding test

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread python/tests/platform/test_delta_input_deletion_vectors.py Outdated
Comment thread python/tests/platform/test_delta_input_deletion_vectors.py
Comment thread python/tests/utils.py Outdated
@swanandx swanandx force-pushed the delta-dv-snapshot-test branch 2 times, most recently from f7c48f9 to 7f19065 Compare May 29, 2026 18:50
@swanandx swanandx requested a review from gz May 29, 2026 18:51

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-APPROVE on 7f19065 (force-push since my prior APPROVE on 768642e).

Material changes vs the version I approved:

  • New test test_delta_input_cdc_multi_key_order_by exercises the multi-key cdc_order_by parse path with a deliberate write order that flips the winner unless the second key (lsn asc) is honored. Solid construction — same-__feldera_ts, loser-first-on-disk is exactly the right adversarial setup.
  • DeltaTestLocation.stable flag plus the updated cleanup() docstring correctly avoid wiping the cached DV fixture between runs, and the rewritten "no-op on CI/MinIO" prose addresses gz's "what is a leaked key" thread.
  • Workflow exposes CI_K8S_MINIO_* env vars so the platform job can reach the test bucket. Comment explains the mapping; OK.

Soft nits, none blocking:

  • _dv_fixture_builder.py pins delta-spark>=4.2,<5 in two places (the docstring and the uv run --with call). Easy drift target — consider a single _DELTA_SPARK_SPEC constant in the test module passed through to the subprocess.
  • The new multi-key test relies on parquet row order from write_deltalake matching the order rows were supplied. If a future deltalake bump starts row-shuffling on write, the test silently becomes tautological. A short comment "if this stops failing without the lsn asc tiebreaker, suspect deltalake row ordering" would warn future readers.

APPROVE stands.

Adds an end-to-end test that the Feldera delta_table_input connector
honors deletion vectors on snapshot reads: stage a Delta table with
200 rows, soft-delete half via a DV-aware DELETE, then assert the
connector surfaces only the 100 active rows.

PySpark is the only tool that can write DVs (delta-rs and the deltalake
wheel can't), so the fixture builder runs in a `uv run --with
delta-spark` subprocess. That subprocess only fires on cache miss, so
Merge Queue runs that reuse the cached MinIO fixture never install
Spark. Bump FIXTURE_VERSION to force a rebuild.

Also fixes DeltaTestLocation.cleanup() to leave the directory alone
when stable_subpath is set, otherwise the local cache was wiped after
every run.

if stable_subpath is set, we should not prefix it with pipeline name which
contains commit hash, hence we use _fixtures prefix if its set

Signed-off-by: Swanand Mulay <73115739+swanandx@users.noreply.github.com>
@swanandx swanandx force-pushed the delta-dv-snapshot-test branch from 0582fa4 to 3447159 Compare May 31, 2026 11:28

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Force-push picks up the gz/swanandx feedback cleanly: DeltaTestLocation.stable preserves the cached fixture across runs, the builder is split out as _dv_fixture_builder.py (underscore prefix keeps pytest from collecting it), the MinIO credentials are surfaced to the OSS platform job, and the staging/upload ordering (data files first, then _delta_log last) keeps a mid-upload reader from observing a log referencing a missing parquet. Fixture-versioned via FIXTURE_VERSION — bump invalidates cached MinIO copies. subprocess.run(..., check=True) is the right call for the Spark builder.

@swanandx swanandx added this pull request to the merge queue Jun 2, 2026
Merged via the queue into main with commit e83db97 Jun 2, 2026
1 check passed
@swanandx swanandx deleted the delta-dv-snapshot-test branch June 2, 2026 11:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants