Skip to content

[migrations] Spark-to-Feldera migration tool PoC.#5837

Merged
abhizer merged 2 commits into
mainfrom
felderize
Jun 10, 2026
Merged

[migrations] Spark-to-Feldera migration tool PoC.#5837
abhizer merged 2 commits into
mainfrom
felderize

Conversation

@wilmaontherun

Copy link
Copy Markdown
Collaborator

CLI tool using LLM to translate and syntactically validate Spark SQL programs to Feldera SQL.

Requires Anthropic API key in felderize/.env

Describe Manual Test Plan

No automated tests yet. Tested manually using examples in the demo folder.

@wilmaontherun wilmaontherun force-pushed the felderize branch 4 times, most recently from 4d43839 to 046f795 Compare March 16, 2026 18:56
Comment thread python/felderize/spark/data/docs/sql/array.md Outdated
Comment thread python/felderize/spark/data/samples/array_lambda.md Outdated
Comment thread python/felderize/spark/data/samples/null_safe_equality.md Outdated
Comment thread python/felderize/spark/data/skills/function-reference/SKILL.md Outdated
Comment thread python/felderize/spark/data/skills/function-reference/SKILL.md Outdated
Comment thread python/felderize/spark/data/skills/query-rewrite/SKILL.md Outdated
Comment thread python/felderize/spark/data/skills/query-rewrite/SKILL.md Outdated
Comment thread python/felderize/spark/data/skills/query-rewrite/SKILL.md Outdated
Comment thread python/felderize/spark/data/skills/query-rewrite/SKILL.md Outdated
Comment thread python/felderize/felderize/translator.py
@mihaibudiu

Copy link
Copy Markdown
Contributor

We should build a library with compatibility functions that people can just reuse, especially if they can be written in SQL.

@mihaibudiu

Copy link
Copy Markdown
Contributor

@anandbraman I think we should both review this PR

@mihaibudiu mihaibudiu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe @anandbraman and I should make the fixes I recommend and take over this project.

Comment thread python/felderize/README.md Outdated
Comment thread python/felderize/spark/data/samples/aggregation_date_trunc.md Outdated
Comment thread python/felderize/spark/data/samples/array_map_functions.md Outdated
Comment thread python/felderize/spark/data/samples/json_extraction.md Outdated
Comment thread python/felderize/spark/data/samples/json_extraction.md Outdated
Comment thread python/felderize/spark/data/skills/spark_skills.md Outdated
Comment thread python/felderize/spark/data/skills/spark_skills.md Outdated
Comment thread python/felderize/felderize/cli.py
Comment thread python/felderize/spark/cli.py Outdated
Comment thread python/felderize/spark/docs.py Outdated

| Function | Notes |
|----------|-------|
| `split_part(str, delim, n)` | Feldera's `SPLIT_PART` treats the delimiter as a regex — special chars like `.` match any character and produce wrong results. Negative indices not supported. Always mark unsupported. |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor correction - feldera split_part delimiter is not a regex.

Image

@wilmaontherun wilmaontherun force-pushed the felderize branch 13 times, most recently from 3087437 to d272110 Compare May 19, 2026 05:38
Comment thread python/felderize/spark/demo_sql/arithmetic_combined.sql Outdated
Comment thread python/felderize/felderize/prompts/force_query_prompt.md
Comment thread python/felderize/spark/prompts/retry_preamble.md Outdated
Comment thread python/felderize/felderize/spark_sql/windows_schema.sql
Comment thread python/felderize/README.md
Comment thread python/felderize/spark/skills/spark_skills.md Outdated
| `a \| b` | Bitwise OR — `\|` is a parse error in Feldera |
| `a & b` | Bitwise AND — parses but "Not yet implemented" |
| `a ^ b` | Bitwise XOR — parses but "Not yet implemented" |
| `shiftleft(a, n)` / `shiftright(a, n)` / `shiftrightunsigned(a, n)` | Bitwise shift — no equivalent |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These can be implemented using multiplication/division and concatenation for BINARY...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed: for now, unsupported

Comment thread python/felderize/spark/skills/spark_skills.md Outdated

### Spark colon-syntax (semi-structured field access)

Databricks/Spark SQL allows `col:field` as shorthand for extracting a JSON sub-field from a VARCHAR/STRING column containing JSON. Translate it as PARSE_JSON bracket access:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can now be supported, someone added it to Calcite, I will check whether we support it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed. We keep it as unsupported for now

Comment thread python/felderize/spark/skills/spark_skills.md Outdated

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High-level pass on the draft. Saving the full review for when this leaves draft; here are the architectural questions worth answering before then.

1. Package layout is going to confuse people. pyproject.toml says [tool.setuptools.package-dir] felderize = "spark", so the on-disk directory is python/felderize/spark/ but the import name is felderize and every internal import is from felderize.X. That mapping works but is not discoverable — anyone landing in python/felderize/spark/translator.py will wonder why a file inside spark/ imports from felderize.constants rather than from .constants. Either:

  • rename the dir to python/felderize/felderize/ and make spark/ an actual felderize.spark subpackage (and leave room for felderize.trino, etc.), or
  • drop the multi-dialect ambition and pick one layout cleanly.

The README line "Spark SQL is currently the only supported dialect. Support for additional dialects is planned." is the tell — if a second dialect is coming, the package layout should already reflect it.

2. Docs are fetched live from main. DEFAULT_DOCS_BASE_URL = raw.githubusercontent.com/feldera/feldera/main/docs.feldera.com/docs/sql. The user has just downloaded a pinned-version compiler JAR via felderize download-compiler, and the LLM gets HEAD docs. When HEAD adds (or removes) a feature, the prompt and the validator disagree and the repair loop burns LLM tokens chasing a phantom. Pin the docs URL to the same release tag the JAR came from (or to the felderize release), and let FELDERA_DOCS_BASE_URL override for development.

3. AnthropicClient is the only impl behind the LLMClient ABC. Either add a second backend now (the project title is dialect-agnostic; the model should not be tied to a single vendor) or drop the ABC and stop pretending. Right now create_client(config) is return AnthropicClient(config) — pure ceremony.

4. max_tokens = 64000 default. Most Anthropic models reject this without the appropriate output-128k beta header; users who point FELDERIZE_MODEL at anything other than your internal endpoint will get a 400. Either default to 16k and let power users raise it, or send the beta header explicitly when above 32k. Document the gotcha.

5. Cost / token budget. Worst-case path is: pass 1 = 1 + max_retries LLM calls; pass 2 (with_docs=True) = another 1 + max_retries. With defaults that is 8 calls, each up to 64k output tokens, each carrying the full ~1275-line skills doc plus relevant Feldera docs in the prompt. On a cold cache that is real money per query. At minimum, expose a --max-cost-usd or --max-llm-calls cap and log running token totals to stderr (the per-call log is already there — sum them).

6. Zero tests. Hard rule, even on a PoC living next to python/feldera/ and python/dbt-feldera/ which both ship tests. A lot of this surface is pure-Python and trivially unit-testable with fake LLM responses:

  • translator.trim_schema_to_referenced (regex schema pruner) — straightforward in/out cases.
  • translator.split_combined_sql — round-trip with several fixtures.
  • translator._parse_response — markdown-fenced JSON, raw JSON, malformed.
  • translator._build_result — Status logic on empty query / unsupported list.
  • prompts/* builders — assert templates render with known inputs.
  • skills.load_skills and _load_user_rules — frontmatter parsing edge cases (good and bad YAML), priority ordering.
  • docs._SPARK_ALIASES / _EXTRA_PATTERNS — assert the right doc categories are selected for sample SQL.
  • feldera_client.validate_sql — fake compiler binary returning canned stderr; assert errors are parsed correctly.

The whole validate→repair loop can be exercised with a stub LLMClient that returns scripted responses; that is the highest-value test target because it is where regressions will silently degrade output quality.

7. Validation loop has a redundant tail. _translate_with_repair runs for attempt in range(max_retries): validate; if errors, repair, then outside the loop runs another validate_sql(full_sql). That tail only matters when the last iteration's repair produced fresh SQL that was never validated — which is precisely the path it covers, so the logic is correct, but the structure makes it hard to see. Refactor into a while attempts <= max_retries shape and you can collapse the duplicate. Tests (see 6) would have caught the mental load here.

8. python/felderize/.env is the only .env location. Config.from_env loads Path(__file__).resolve().parent.parent / ".env". That hardwires config to the source checkout. If anyone ever pip install felderize from PyPI, that path lands inside site-packages/. Also try Path.cwd() / ".env" (and prefer it) so per-project config works and the package is relocatable.

9. Skills doc is 1275 lines. That file is the product — the repair loop's quality ceiling is set by it. Worth saying out loud in the PR description: how was it curated, who maintains it as Feldera SQL evolves, what is the plan for keeping it in sync with the compiler? If the answer is "we eyeball it", that is fine for a PoC but should be a known limitation in the README.

10. Manual-only test plan is acceptable for PoC. No behavior-change-without-test blocker fires because nothing else in the repo depends on this code. But the PoC will not graduate without (6).

Nothing here is a draft blocker — looking forward to seeing this evolve.

Felderize is a tool to convert SQL programs from other
dialects to Feldera SQL. The initial implementation supports
Spark SQL only. Felderize is implemented as a CLI tool that
requires an Anthropic API key. See python/felderize/README
for details.

Later we will release a skill that can be used with Claude
without a CLI tool or an API key.

Signed-off-by: wilma <wilmaontherun@gmail.com>
@wilmaontherun wilmaontherun marked this pull request as ready for review June 8, 2026 05:36
@wilmaontherun wilmaontherun enabled auto-merge June 8, 2026 05:37
Comment thread python/felderize/spark/skills/spark_skills.md Outdated

#### ⚠️ Behavioral differences (Spark vs Feldera)

**[GBD-SAFE-CAST]** `SAFE_CAST` is not a perfect equivalent of Spark's `TRY_CAST` for all types. For REAL/DOUBLE/BOOLEAN targets, `SAFE_CAST` returns the type default (`0` for numeric, `false` for boolean) instead of NULL on failure — there is no exact NULL-on-failure equivalent for those types in Feldera.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAFE_CAST and CAST should have the exact same semantics except that SAFE_CAST never panics, it returns NULL where CAST panics.

Now, the CAST semantics may differ between Spark and Feldera, but the SAFE_CAST should never differ from CAST

Comment thread python/felderize/spark/skills/spark_skills.md Outdated
@wilmaontherun wilmaontherun added this pull request to the merge queue Jun 8, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 8, 2026
With the addition of felderize, setuptools now sees both feldera and
felderize as top-level packages while building the python docs.

In this commit, we tell pyproject.toml to only look at packages
"feldera" and "feldera.*".

This fixes the docs build that previously failed in CI.

Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>
@abhizer abhizer enabled auto-merge June 9, 2026 19:35
@abhizer abhizer added this pull request to the merge queue Jun 9, 2026
Merged via the queue into main with commit 40ce5cb Jun 10, 2026
1 check passed
@abhizer abhizer deleted the felderize branch June 10, 2026 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants