Conversation
4d43839 to
046f795
Compare
|
We should build a library with compatibility functions that people can just reuse, especially if they can be written in SQL. |
6bd2d52 to
14e7cc6
Compare
|
@anandbraman I think we should both review this PR |
mihaibudiu
left a comment
There was a problem hiding this comment.
Maybe @anandbraman and I should make the fixes I recommend and take over this project.
|
|
||
| | Function | Notes | | ||
| |----------|-------| | ||
| | `split_part(str, delim, n)` | Feldera's `SPLIT_PART` treats the delimiter as a regex — special chars like `.` match any character and produce wrong results. Negative indices not supported. Always mark unsupported. | |
3087437 to
d272110
Compare
| | `a \| b` | Bitwise OR — `\|` is a parse error in Feldera | | ||
| | `a & b` | Bitwise AND — parses but "Not yet implemented" | | ||
| | `a ^ b` | Bitwise XOR — parses but "Not yet implemented" | | ||
| | `shiftleft(a, n)` / `shiftright(a, n)` / `shiftrightunsigned(a, n)` | Bitwise shift — no equivalent | |
There was a problem hiding this comment.
These can be implemented using multiplication/division and concatenation for BINARY...
There was a problem hiding this comment.
discussed: for now, unsupported
|
|
||
| ### Spark colon-syntax (semi-structured field access) | ||
|
|
||
| Databricks/Spark SQL allows `col:field` as shorthand for extracting a JSON sub-field from a VARCHAR/STRING column containing JSON. Translate it as PARSE_JSON bracket access: |
There was a problem hiding this comment.
I think this can now be supported, someone added it to Calcite, I will check whether we support it.
There was a problem hiding this comment.
Discussed. We keep it as unsupported for now
mythical-fred
left a comment
There was a problem hiding this comment.
High-level pass on the draft. Saving the full review for when this leaves draft; here are the architectural questions worth answering before then.
1. Package layout is going to confuse people. pyproject.toml says [tool.setuptools.package-dir] felderize = "spark", so the on-disk directory is python/felderize/spark/ but the import name is felderize and every internal import is from felderize.X. That mapping works but is not discoverable — anyone landing in python/felderize/spark/translator.py will wonder why a file inside spark/ imports from felderize.constants rather than from .constants. Either:
- rename the dir to
python/felderize/felderize/and makespark/an actualfelderize.sparksubpackage (and leave room forfelderize.trino, etc.), or - drop the multi-dialect ambition and pick one layout cleanly.
The README line "Spark SQL is currently the only supported dialect. Support for additional dialects is planned." is the tell — if a second dialect is coming, the package layout should already reflect it.
2. Docs are fetched live from main. DEFAULT_DOCS_BASE_URL = raw.githubusercontent.com/feldera/feldera/main/docs.feldera.com/docs/sql. The user has just downloaded a pinned-version compiler JAR via felderize download-compiler, and the LLM gets HEAD docs. When HEAD adds (or removes) a feature, the prompt and the validator disagree and the repair loop burns LLM tokens chasing a phantom. Pin the docs URL to the same release tag the JAR came from (or to the felderize release), and let FELDERA_DOCS_BASE_URL override for development.
3. AnthropicClient is the only impl behind the LLMClient ABC. Either add a second backend now (the project title is dialect-agnostic; the model should not be tied to a single vendor) or drop the ABC and stop pretending. Right now create_client(config) is return AnthropicClient(config) — pure ceremony.
4. max_tokens = 64000 default. Most Anthropic models reject this without the appropriate output-128k beta header; users who point FELDERIZE_MODEL at anything other than your internal endpoint will get a 400. Either default to 16k and let power users raise it, or send the beta header explicitly when above 32k. Document the gotcha.
5. Cost / token budget. Worst-case path is: pass 1 = 1 + max_retries LLM calls; pass 2 (with_docs=True) = another 1 + max_retries. With defaults that is 8 calls, each up to 64k output tokens, each carrying the full ~1275-line skills doc plus relevant Feldera docs in the prompt. On a cold cache that is real money per query. At minimum, expose a --max-cost-usd or --max-llm-calls cap and log running token totals to stderr (the per-call log is already there — sum them).
6. Zero tests. Hard rule, even on a PoC living next to python/feldera/ and python/dbt-feldera/ which both ship tests. A lot of this surface is pure-Python and trivially unit-testable with fake LLM responses:
translator.trim_schema_to_referenced(regex schema pruner) — straightforward in/out cases.translator.split_combined_sql— round-trip with several fixtures.translator._parse_response— markdown-fenced JSON, raw JSON, malformed.translator._build_result— Status logic on empty query / unsupported list.prompts/*builders — assert templates render with known inputs.skills.load_skillsand_load_user_rules— frontmatter parsing edge cases (good and bad YAML), priority ordering.docs._SPARK_ALIASES/_EXTRA_PATTERNS— assert the right doc categories are selected for sample SQL.feldera_client.validate_sql— fake compiler binary returning canned stderr; assert errors are parsed correctly.
The whole validate→repair loop can be exercised with a stub LLMClient that returns scripted responses; that is the highest-value test target because it is where regressions will silently degrade output quality.
7. Validation loop has a redundant tail. _translate_with_repair runs for attempt in range(max_retries): validate; if errors, repair, then outside the loop runs another validate_sql(full_sql). That tail only matters when the last iteration's repair produced fresh SQL that was never validated — which is precisely the path it covers, so the logic is correct, but the structure makes it hard to see. Refactor into a while attempts <= max_retries shape and you can collapse the duplicate. Tests (see 6) would have caught the mental load here.
8. python/felderize/.env is the only .env location. Config.from_env loads Path(__file__).resolve().parent.parent / ".env". That hardwires config to the source checkout. If anyone ever pip install felderize from PyPI, that path lands inside site-packages/. Also try Path.cwd() / ".env" (and prefer it) so per-project config works and the package is relocatable.
9. Skills doc is 1275 lines. That file is the product — the repair loop's quality ceiling is set by it. Worth saying out loud in the PR description: how was it curated, who maintains it as Feldera SQL evolves, what is the plan for keeping it in sync with the compiler? If the answer is "we eyeball it", that is fine for a PoC but should be a known limitation in the README.
10. Manual-only test plan is acceptable for PoC. No behavior-change-without-test blocker fires because nothing else in the repo depends on this code. But the PoC will not graduate without (6).
Nothing here is a draft blocker — looking forward to seeing this evolve.
Felderize is a tool to convert SQL programs from other dialects to Feldera SQL. The initial implementation supports Spark SQL only. Felderize is implemented as a CLI tool that requires an Anthropic API key. See python/felderize/README for details. Later we will release a skill that can be used with Claude without a CLI tool or an API key. Signed-off-by: wilma <wilmaontherun@gmail.com>
|
|
||
| #### ⚠️ Behavioral differences (Spark vs Feldera) | ||
|
|
||
| **[GBD-SAFE-CAST]** `SAFE_CAST` is not a perfect equivalent of Spark's `TRY_CAST` for all types. For REAL/DOUBLE/BOOLEAN targets, `SAFE_CAST` returns the type default (`0` for numeric, `false` for boolean) instead of NULL on failure — there is no exact NULL-on-failure equivalent for those types in Feldera. |
There was a problem hiding this comment.
SAFE_CAST and CAST should have the exact same semantics except that SAFE_CAST never panics, it returns NULL where CAST panics.
Now, the CAST semantics may differ between Spark and Feldera, but the SAFE_CAST should never differ from CAST
With the addition of felderize, setuptools now sees both feldera and felderize as top-level packages while building the python docs. In this commit, we tell pyproject.toml to only look at packages "feldera" and "feldera.*". This fixes the docs build that previously failed in CI. Signed-off-by: Abhinav Gyawali <22275402+abhizer@users.noreply.github.com>

CLI tool using LLM to translate and syntactically validate Spark SQL programs to Feldera SQL.
Requires Anthropic API key in felderize/.env
Describe Manual Test Plan
No automated tests yet. Tested manually using examples in the demo folder.