[migrations] Spark-to-Feldera migration tool PoC. by wilmaontherun · Pull Request #5837 · feldera/feldera

wilmaontherun · 2026-03-16T17:50:22Z

CLI tool using LLM to translate and syntactically validate Spark SQL programs to Feldera SQL.

Requires Anthropic API key in felderize/.env

Describe Manual Test Plan

No automated tests yet. Tested manually using examples in the demo folder.

mihaibudiu · 2026-03-19T16:27:01Z

We should build a library with compatibility functions that people can just reuse, especially if they can be written in SQL.

mihaibudiu · 2026-03-25T17:04:18Z

@anandbraman I think we should both review this PR

mihaibudiu

Maybe @anandbraman and I should make the fixes I recommend and take over this project.

anandbraman · 2026-03-27T15:38:26Z

+
+| Function | Notes |
+|----------|-------|
+| `split_part(str, delim, n)` | Feldera's `SPLIT_PART` treats the delimiter as a regex — special chars like `.` match any character and produce wrong results. Negative indices not supported. Always mark unsupported. |


Minor correction - feldera split_part delimiter is not a regex.

mihaibudiu · 2026-05-21T23:21:35Z

+| `a \| b` | Bitwise OR — `\|` is a parse error in Feldera |
+| `a & b` | Bitwise AND — parses but "Not yet implemented" |
+| `a ^ b` | Bitwise XOR — parses but "Not yet implemented" |
+| `shiftleft(a, n)` / `shiftright(a, n)` / `shiftrightunsigned(a, n)` | Bitwise shift — no equivalent |


These can be implemented using multiplication/division and concatenation for BINARY...

discussed: for now, unsupported

mihaibudiu · 2026-05-21T23:23:47Z

+
+### Spark colon-syntax (semi-structured field access)
+
+Databricks/Spark SQL allows `col:field` as shorthand for extracting a JSON sub-field from a VARCHAR/STRING column containing JSON. Translate it as PARSE_JSON bracket access:


I think this can now be supported, someone added it to Calcite, I will check whether we support it.

Discussed. We keep it as unsupported for now

mythical-fred

High-level pass on the draft. Saving the full review for when this leaves draft; here are the architectural questions worth answering before then.

1. Package layout is going to confuse people. pyproject.toml says [tool.setuptools.package-dir] felderize = "spark", so the on-disk directory is python/felderize/spark/ but the import name is felderize and every internal import is from felderize.X. That mapping works but is not discoverable — anyone landing in python/felderize/spark/translator.py will wonder why a file inside spark/ imports from felderize.constants rather than from .constants. Either:

rename the dir to python/felderize/felderize/ and make spark/ an actual felderize.spark subpackage (and leave room for felderize.trino, etc.), or
drop the multi-dialect ambition and pick one layout cleanly.

The README line "Spark SQL is currently the only supported dialect. Support for additional dialects is planned." is the tell — if a second dialect is coming, the package layout should already reflect it.

2. Docs are fetched live from main. DEFAULT_DOCS_BASE_URL = raw.githubusercontent.com/feldera/feldera/main/docs.feldera.com/docs/sql. The user has just downloaded a pinned-version compiler JAR via felderize download-compiler, and the LLM gets HEAD docs. When HEAD adds (or removes) a feature, the prompt and the validator disagree and the repair loop burns LLM tokens chasing a phantom. Pin the docs URL to the same release tag the JAR came from (or to the felderize release), and let FELDERA_DOCS_BASE_URL override for development.

3. AnthropicClient is the only impl behind the LLMClient ABC. Either add a second backend now (the project title is dialect-agnostic; the model should not be tied to a single vendor) or drop the ABC and stop pretending. Right now create_client(config) is return AnthropicClient(config) — pure ceremony.

4. max_tokens = 64000 default. Most Anthropic models reject this without the appropriate output-128k beta header; users who point FELDERIZE_MODEL at anything other than your internal endpoint will get a 400. Either default to 16k and let power users raise it, or send the beta header explicitly when above 32k. Document the gotcha.

5. Cost / token budget. Worst-case path is: pass 1 = 1 + max_retries LLM calls; pass 2 (with_docs=True) = another 1 + max_retries. With defaults that is 8 calls, each up to 64k output tokens, each carrying the full ~1275-line skills doc plus relevant Feldera docs in the prompt. On a cold cache that is real money per query. At minimum, expose a --max-cost-usd or --max-llm-calls cap and log running token totals to stderr (the per-call log is already there — sum them).

6. Zero tests. Hard rule, even on a PoC living next to python/feldera/ and python/dbt-feldera/ which both ship tests. A lot of this surface is pure-Python and trivially unit-testable with fake LLM responses:

translator.trim_schema_to_referenced (regex schema pruner) — straightforward in/out cases.
translator.split_combined_sql — round-trip with several fixtures.
translator._parse_response — markdown-fenced JSON, raw JSON, malformed.
translator._build_result — Status logic on empty query / unsupported list.
prompts/* builders — assert templates render with known inputs.
skills.load_skills and _load_user_rules — frontmatter parsing edge cases (good and bad YAML), priority ordering.
docs._SPARK_ALIASES / _EXTRA_PATTERNS — assert the right doc categories are selected for sample SQL.
feldera_client.validate_sql — fake compiler binary returning canned stderr; assert errors are parsed correctly.

The whole validate→repair loop can be exercised with a stub LLMClient that returns scripted responses; that is the highest-value test target because it is where regressions will silently degrade output quality.

7. Validation loop has a redundant tail. _translate_with_repair runs for attempt in range(max_retries): validate; if errors, repair, then outside the loop runs another validate_sql(full_sql). That tail only matters when the last iteration's repair produced fresh SQL that was never validated — which is precisely the path it covers, so the logic is correct, but the structure makes it hard to see. Refactor into a while attempts <= max_retries shape and you can collapse the duplicate. Tests (see 6) would have caught the mental load here.

8. python/felderize/.env is the only .env location. Config.from_env loads Path(__file__).resolve().parent.parent / ".env". That hardwires config to the source checkout. If anyone ever pip install felderize from PyPI, that path lands inside site-packages/. Also try Path.cwd() / ".env" (and prefer it) so per-project config works and the package is relocatable.

9. Skills doc is 1275 lines. That file is the product — the repair loop's quality ceiling is set by it. Worth saying out loud in the PR description: how was it curated, who maintains it as Feldera SQL evolves, what is the plan for keeping it in sync with the compiler? If the answer is "we eyeball it", that is fine for a PoC but should be a known limitation in the README.

10. Manual-only test plan is acceptable for PoC. No behavior-change-without-test blocker fires because nothing else in the repo depends on this code. But the PoC will not graduate without (6).

Nothing here is a draft blocker — looking forward to seeing this evolve.

Felderize is a tool to convert SQL programs from other dialects to Feldera SQL. The initial implementation supports Spark SQL only. Felderize is implemented as a CLI tool that requires an Anthropic API key. See python/felderize/README for details. Later we will release a skill that can be used with Claude without a CLI tool or an API key. Signed-off-by: wilma <wilmaontherun@gmail.com>

mihaibudiu · 2026-05-25T18:16:08Z

+
+#### ⚠️ Behavioral differences (Spark vs Feldera)
+
+**[GBD-SAFE-CAST]** `SAFE_CAST` is not a perfect equivalent of Spark's `TRY_CAST` for all types. For REAL/DOUBLE/BOOLEAN targets, `SAFE_CAST` returns the type default (`0` for numeric, `false` for boolean) instead of NULL on failure — there is no exact NULL-on-failure equivalent for those types in Feldera.


SAFE_CAST and CAST should have the exact same semantics except that SAFE_CAST never panics, it returns NULL where CAST panics.

Now, the CAST semantics may differ between Spark and Feldera, but the SAFE_CAST should never differ from CAST

With the addition of felderize, setuptools now sees both feldera and felderize as top-level packages while building the python docs. In this commit, we tell pyproject.toml to only look at packages "feldera" and "feldera.*". This fixes the docs build that previously failed in CI. Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

wilmaontherun requested review from gz, mihaibudiu and ryzhyk March 16, 2026 17:50

wilmaontherun force-pushed the felderize branch 4 times, most recently from 4d43839 to 046f795 Compare March 16, 2026 18:56

mihaibudiu reviewed Mar 16, 2026

View reviewed changes

wilmaontherun force-pushed the felderize branch from 6bd2d52 to 14e7cc6 Compare March 20, 2026 03:07

mihaibudiu requested a review from anandbraman March 25, 2026 17:04

mihaibudiu reviewed Mar 25, 2026

View reviewed changes

anandbraman reviewed Mar 27, 2026

View reviewed changes

wilmaontherun force-pushed the felderize branch 13 times, most recently from 3087437 to d272110 Compare May 19, 2026 05:38

mihaibudiu reviewed May 21, 2026

View reviewed changes

mythical-fred reviewed May 29, 2026

View reviewed changes

wilmaontherun force-pushed the felderize branch from d767244 to ca7edcc Compare June 1, 2026 20:26

wilmaontherun force-pushed the felderize branch from 846cea1 to 55f69b1 Compare June 8, 2026 05:35

wilmaontherun marked this pull request as ready for review June 8, 2026 05:36

wilmaontherun enabled auto-merge June 8, 2026 05:37

mihaibudiu approved these changes Jun 8, 2026

View reviewed changes

wilmaontherun added this pull request to the merge queue Jun 8, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 8, 2026

abhizer enabled auto-merge June 9, 2026 19:35

abhizer added this pull request to the merge queue Jun 9, 2026

Merged via the queue into main with commit 40ce5cb Jun 10, 2026
1 check passed

abhizer deleted the felderize branch June 10, 2026 00:57


		### Spark colon-syntax (semi-structured field access)

		Databricks/Spark SQL allows `col:field` as shorthand for extracting a JSON sub-field from a VARCHAR/STRING column containing JSON. Translate it as PARSE_JSON bracket access:


		#### ⚠️ Behavioral differences (Spark vs Feldera)

		[GBD-SAFE-CAST] `SAFE_CAST` is not a perfect equivalent of Spark's `TRY_CAST` for all types. For REAL/DOUBLE/BOOLEAN targets, `SAFE_CAST` returns the type default (`0` for numeric, `false` for boolean) instead of NULL on failure — there is no exact NULL-on-failure equivalent for those types in Feldera.

Conversation

wilmaontherun commented Mar 16, 2026

Describe Manual Test Plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihaibudiu commented Mar 19, 2026

Uh oh!

mihaibudiu commented Mar 25, 2026

Uh oh!

mihaibudiu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anandbraman Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mihaibudiu May 21, 2026

Choose a reason for hiding this comment

Uh oh!

wilmaontherun Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mihaibudiu May 21, 2026

Choose a reason for hiding this comment

Uh oh!

wilmaontherun Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mihaibudiu May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants