[py]: delta input snapshot test for deletion vectors#6347
Conversation
f7c48f9 to
7f19065
Compare
mythical-fred
left a comment
There was a problem hiding this comment.
Re-APPROVE on 7f19065 (force-push since my prior APPROVE on 768642e).
Material changes vs the version I approved:
- New test
test_delta_input_cdc_multi_key_order_byexercises the multi-keycdc_order_byparse path with a deliberate write order that flips the winner unless the second key (lsn asc) is honored. Solid construction — same-__feldera_ts, loser-first-on-disk is exactly the right adversarial setup. DeltaTestLocation.stableflag plus the updatedcleanup()docstring correctly avoid wiping the cached DV fixture between runs, and the rewritten "no-op on CI/MinIO" prose addresses gz's "what is a leaked key" thread.- Workflow exposes
CI_K8S_MINIO_*env vars so the platform job can reach the test bucket. Comment explains the mapping; OK.
Soft nits, none blocking:
_dv_fixture_builder.pypinsdelta-spark>=4.2,<5in two places (the docstring and theuv run --withcall). Easy drift target — consider a single_DELTA_SPARK_SPECconstant in the test module passed through to the subprocess.- The new multi-key test relies on parquet row order from
write_deltalakematching the order rows were supplied. If a future deltalake bump starts row-shuffling on write, the test silently becomes tautological. A short comment "if this stops failing without thelsn asctiebreaker, suspect deltalake row ordering" would warn future readers.
APPROVE stands.
7f19065 to
0582fa4
Compare
Adds an end-to-end test that the Feldera delta_table_input connector honors deletion vectors on snapshot reads: stage a Delta table with 200 rows, soft-delete half via a DV-aware DELETE, then assert the connector surfaces only the 100 active rows. PySpark is the only tool that can write DVs (delta-rs and the deltalake wheel can't), so the fixture builder runs in a `uv run --with delta-spark` subprocess. That subprocess only fires on cache miss, so Merge Queue runs that reuse the cached MinIO fixture never install Spark. Bump FIXTURE_VERSION to force a rebuild. Also fixes DeltaTestLocation.cleanup() to leave the directory alone when stable_subpath is set, otherwise the local cache was wiped after every run. if stable_subpath is set, we should not prefix it with pipeline name which contains commit hash, hence we use _fixtures prefix if its set Signed-off-by: Swanand Mulay <73115739+swanandx@users.noreply.github.com>
0582fa4 to
3447159
Compare
mythical-fred
left a comment
There was a problem hiding this comment.
LGTM. Force-push picks up the gz/swanandx feedback cleanly: DeltaTestLocation.stable preserves the cached fixture across runs, the builder is split out as _dv_fixture_builder.py (underscore prefix keeps pytest from collecting it), the MinIO credentials are surfaced to the OSS platform job, and the staging/upload ordering (data files first, then _delta_log last) keeps a mid-upload reader from observing a log referencing a missing parquet. Fixture-versioned via FIXTURE_VERSION — bump invalidates cached MinIO copies. subprocess.run(..., check=True) is the right call for the Spark builder.
check commit message plz
Describe Manual Test Plan
Ran it locally with
FELDERA_HOST=http://localhost:8080 uv run pytest -vv tests/platform/test_delta_input_deletion_vectors.pyChecklist
Breaking Changes?
No, just adding test