python: add support for adhoc query as pyarrow table by monochromatti · Pull Request #5814 · feldera/feldera

monochromatti · 2026-03-13T06:51:09Z

Ran tests locally against a running Feldera API.

From python/:

Full Python SDK suite (excluding tests/runtime_aggtest):
- uv run python -m pytest tests/ --ignore=tests/runtime_aggtest -ra
- Local result: 122 passed, 45 skipped
Targeted reruns:
- uv run python -m pytest tests/platform/test_shared_pipeline.py::TestPipeline::test_adhoc_query_arrow -q
- uv run python -m pytest tests/unit/test_query_as_arrow.py -q

Checklist

Unit tests added/updated
Integration tests added/updated
Documentation updated
Changelog updated

Breaking Changes?

Mark if you think the answer is yes for any of these components:

OpenAPI / REST HTTP API / feldera-types / manager (What is a breaking change?)
Feldera SQL (Syntax, Semantics)
feldera-sqllib (incl. dependencies fxp, etc.) (What is a breaking change?)
Python SDK (What is a breaking change?)
fda (CLI arguments)
Adapters (including configuration)
Storage Format / Checkpoints
Others (specify)

Describe Incompatible Changes

None.

Summary

This PR adds Arrow IPC query support to the Python SDK so ad-hoc query results can be consumed as streamed PyArrow record batches.

What changed

Added a new client API:
- FelderaClient.query_as_arrow(pipeline_name, query) -> Generator[pyarrow.RecordBatch, None, None]
Added a pipeline convenience method:
- Pipeline.query_arrow(query) -> Generator[pyarrow.RecordBatch, None, None]
Added optional Arrow dependency extra:
- pip install "feldera[arrow]"
Updated Python README with Arrow installation guidance
Added unit and platform tests for Arrow IPC query behavior

Notes

The Arrow response is consumed from an HTTP stream (stream=True) and yielded batch-by-batch.
Users can materialize a pyarrow.Table when desired via pyarrow.Table.from_batches(...).

mythical-fred

LGTM — but see inline: there is an existing open PR covering the same feature.

gz · 2026-03-13T16:39:29Z

hi @monochromatti this looks good thanks a lot for your contribution. @abhizer can you review this

monochromatti · 2026-03-13T16:43:58Z

I'd like input on whether to return Generator[pyarrow.RecordBatch, ...] or a pyarrow.Table directly. The latter is the current state of the PR, but after some thinking it feels like generating batches is more in style with similar existing functionality and better suited for big payloads.

abhizer

Thank you!

As a heads up, the reason we didn't merge the prior PR is because the server intermittently sent bad data and we were unable to figure out why.

abhizer · 2026-03-13T17:07:47Z

I'd like input on whether to return Generator[pyarrow.RecordBatch, ...] or a pyarrow.Table directly

We normally return a generator, and it might be a good idea to keep this behavior consistent.

mihaibudiu · 2026-03-18T18:33:50Z

@monochromatti please re-request a review from @abhizer when this is ready again

abhizer

Thank you!

monochromatti · 2026-04-04T10:26:43Z

Rebased on main to solve a uv.lock conflict

mythical-fred

LGTM

monochromatti · 2026-04-04T14:20:37Z

Sorry I might be missing something, but the PR still requires an approval to run CI?

abhizer · 2026-04-04T14:28:07Z

Done!

mythical-fred · 2026-04-05T07:23:53Z

The "Pre Merge Queue Tasks" CI failure looks transient — the failing step is the Rust build check, but this PR has no Rust changes. The same step failed and then passed for other PRs around the same time. Could someone re-trigger CI?

mythical-fred · 2026-04-06T07:38:48Z

CI is still showing a failure on "Pre Merge Queue Tasks" from Apr 4 — looks like nobody re-triggered it yet. Could someone queue a fresh run? This is a Python-only PR and that step has been transiently failing for unrelated Rust check reasons.

abhizer · 2026-04-06T15:21:48Z

You might have to run "ruff format" for it to pass the pre merge queue.

mythical-fred

LGTM

Signed-off-by: Mattias Matthiesen <mattias.matthiesen@eviny.no>

monochromatti · 2026-04-08T05:44:24Z

Updated PR body and solved uv.lock conflict (exclude-newer timestamp). @abhizer

abhizer · 2026-04-08T14:18:31Z

Thank you!

gz · 2026-04-10T17:02:52Z

@abhizer can we merge this?

gz · 2026-04-12T04:58:40Z

@monochromatti there is unfortunately some issue in the arrow streaming that caused some CI tests fail non-deterministically. i reverted for now, but we should be able to bring it back once we fix this

gz · 2026-04-12T04:59:43Z

#4287 tracking issue

revert) stream_arrow_query used a synchronous Arrow IPC StreamWriter to an async mpsc by spawning one tokio task per std::io::Write::write call: let handle = TOKIO.spawn(async move { tx.send(bytes).await }); self.handles.push(handle); Each StreamWriter::write(&batch) makes ~6 sequential write_all calls The spawned tasks have no ordering relation; on a multi-thread tokio runtime they race to send into the receiver, so bytes arrive in arbitrary order and the resulting Arrow IPC stream gets corrupted. The fix is to not call sync Write from inside an async future at all. stream_arrow_query now hands StreamWriter a Vec<u8> and drains the buffer between batches via std::mem::take(writer.get_mut()), then yields a single ordered Bytes chunk per batch. Memory cost is bounded by one record batch; behaviour matches stream_json_query, which has always used this shape. ChannelWriter retains its AsyncFileWriter impl for the parquet path (AsyncArrowWriter awaits each write future before issuing the next, so ordering there is already safe); the racy std::io::Write impl, the handles vec, and the cfg(test) reordering shim are all removed. Refs: #3923 #3792 #4287 #4226 #5814 #4240 Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

PR #5814 introduced pyarrow as an optional extra (feldera[arrow]), gated behind a lazy import that surfaced a 'pip install feldera[arrow]' We make this a non-optional import here because this is suppsed to become the default format anyways going forward. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

revert) stream_arrow_query used a synchronous Arrow IPC StreamWriter to an async mpsc by spawning one tokio task per std::io::Write::write call: let handle = TOKIO.spawn(async move { tx.send(bytes).await }); self.handles.push(handle); Each StreamWriter::write(&batch) makes ~6 sequential write_all calls The spawned tasks have no ordering relation; on a multi-thread tokio runtime they race to send into the receiver, so bytes arrive in arbitrary order and the resulting Arrow IPC stream gets corrupted. The fix is to not call sync Write from inside an async future at all. stream_arrow_query now hands StreamWriter a Vec<u8> and drains the buffer between batches via std::mem::take(writer.get_mut()), then yields a single ordered Bytes chunk per batch. Memory cost is bounded by one record batch; behaviour matches stream_json_query, which has always used this shape. ChannelWriter retains its AsyncFileWriter impl for the parquet path (AsyncArrowWriter awaits each write future before issuing the next, so ordering there is already safe); the racy std::io::Write impl, the handles vec, and the cfg(test) reordering shim are all removed. Refs: #3923 #3792 #4287 #4226 #5814 #4240 Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

PR #5814 introduced pyarrow as an optional extra (feldera[arrow]), gated behind a lazy import that surfaced a 'pip install feldera[arrow]' We make this a non-optional import here because this is suppsed to become the default format anyways going forward. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

…tically Adds a unit test that drives StreamWriter through the existing ChannelWriter (the sync std::io::Write adapter that spawns one tokio task per write call) and verifies the receiver-side StreamReader can parse the byte stream back. With four record batches and a four-worker tokio runtime the test fails reliably with the same errors report in issues: ParseError("Unable to get root as message: RangeOutOfBounds { range: 655360..655364, .. }") IpcError("Expected schema message, found empty stream.") Refs: #3923 #3792 #4287 #4226 #5814

revert) stream_arrow_query used a synchronous Arrow IPC StreamWriter to an async mpsc by spawning one tokio task per std::io::Write::write call: let handle = TOKIO.spawn(async move { tx.send(bytes).await }); self.handles.push(handle); Each StreamWriter::write(&batch) makes ~6 sequential write_all calls The spawned tasks have no ordering relation; on a multi-thread tokio runtime they race to send into the receiver, so bytes arrive in arbitrary order and the resulting Arrow IPC stream gets corrupted. The fix is to not call sync Write from inside an async future at all. stream_arrow_query now hands StreamWriter a Vec<u8> and drains the buffer between batches via std::mem::take(writer.get_mut()), then yields a single ordered Bytes chunk per batch. Memory cost is bounded by one record batch; behaviour matches stream_json_query, which has always used this shape. ChannelWriter retains its AsyncFileWriter impl for the parquet path (AsyncArrowWriter awaits each write future before issuing the next, so ordering there is already safe); the racy std::io::Write impl, the handles vec, and the cfg(test) reordering shim are all removed. Refs: #3923 #3792 #4287 #4226 #5814 #4240 Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

PR #5814 introduced pyarrow as an optional extra (feldera[arrow]), gated behind a lazy import that surfaced a 'pip install feldera[arrow]' We make this a non-optional import here because this is suppsed to become the default format anyways going forward. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

…tically Adds a unit test that drives StreamWriter through the existing ChannelWriter (the sync std::io::Write adapter that spawns one tokio task per write call) and verifies the receiver-side StreamReader can parse the byte stream back. With four record batches and a four-worker tokio runtime the test fails reliably with the same errors report in issues: ParseError("Unable to get root as message: RangeOutOfBounds { range: 655360..655364, .. }") IpcError("Expected schema message, found empty stream.") Refs: feldera#3923 feldera#3792 feldera#4287 feldera#4226 feldera#5814

…eldera#4226 feldera#5814 revert) stream_arrow_query used a synchronous Arrow IPC StreamWriter to an async mpsc by spawning one tokio task per std::io::Write::write call: let handle = TOKIO.spawn(async move { tx.send(bytes).await }); self.handles.push(handle); Each StreamWriter::write(&batch) makes ~6 sequential write_all calls The spawned tasks have no ordering relation; on a multi-thread tokio runtime they race to send into the receiver, so bytes arrive in arbitrary order and the resulting Arrow IPC stream gets corrupted. The fix is to not call sync Write from inside an async future at all. stream_arrow_query now hands StreamWriter a Vec<u8> and drains the buffer between batches via std::mem::take(writer.get_mut()), then yields a single ordered Bytes chunk per batch. Memory cost is bounded by one record batch; behaviour matches stream_json_query, which has always used this shape. ChannelWriter retains its AsyncFileWriter impl for the parquet path (AsyncArrowWriter awaits each write future before issuing the next, so ordering there is already safe); the racy std::io::Write impl, the handles vec, and the cfg(test) reordering shim are all removed. Refs: feldera#3923 feldera#3792 feldera#4287 feldera#4226 feldera#5814 feldera#4240 Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

PR feldera#5814 introduced pyarrow as an optional extra (feldera[arrow]), gated behind a lazy import that surfaced a 'pip install feldera[arrow]' We make this a non-optional import here because this is suppsed to become the default format anyways going forward. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>

monochromatti force-pushed the arrow-ipc-sdk branch 2 times, most recently from 4065f37 to edcaa7e Compare March 13, 2026 06:54

mythical-fred approved these changes Mar 13, 2026

View reviewed changes

Comment thread python/feldera/rest/feldera_client.py

monochromatti mentioned this pull request Mar 13, 2026

py: support arrow_ipc format for adhoc queries #4226

Closed

gz requested a review from abhizer March 13, 2026 16:39

abhizer approved these changes Mar 13, 2026

View reviewed changes

monochromatti force-pushed the arrow-ipc-sdk branch from edcaa7e to dd5c74e Compare March 18, 2026 13:06

monochromatti force-pushed the arrow-ipc-sdk branch 2 times, most recently from 379bfe8 to 5f06e6a Compare March 24, 2026 12:16

monochromatti requested a review from abhizer March 24, 2026 12:18

abhizer approved these changes Apr 2, 2026

View reviewed changes

abhizer changed the title ~~arrow ipc sdk~~ python: add support for adhoc query as pyarrow table Apr 2, 2026

monochromatti force-pushed the arrow-ipc-sdk branch from 5f06e6a to 2cc02ae Compare April 4, 2026 10:15

monochromatti requested a review from mythical-fred April 4, 2026 10:17

mythical-fred approved these changes Apr 4, 2026

View reviewed changes

monochromatti force-pushed the arrow-ipc-sdk branch from 2cc02ae to d0a2187 Compare April 7, 2026 05:40

mythical-fred approved these changes Apr 7, 2026

View reviewed changes

monochromatti added 3 commits April 8, 2026 07:43

[python] Add optional arrow dependency and installation docs

95d8faa

Signed-off-by: Mattias Matthiesen <mattias.matthiesen@eviny.no>

[python] Add Arrow IPC query API to client and pipeline

8d285cf

Signed-off-by: Mattias Matthiesen <mattias.matthiesen@eviny.no>

[python] Add tests for Arrow IPC query results

541b6b7

Signed-off-by: Mattias Matthiesen <mattias.matthiesen@eviny.no>

monochromatti force-pushed the arrow-ipc-sdk branch from d0a2187 to 541b6b7 Compare April 8, 2026 05:43

monochromatti requested a review from abhizer April 8, 2026 05:45

abhizer approved these changes Apr 8, 2026

View reviewed changes

abhizer added this pull request to the merge queue Apr 10, 2026

Merged via the queue into feldera:main with commit 0be9804 Apr 10, 2026
1 check passed

Conversation

monochromatti commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Breaking Changes?

Describe Incompatible Changes

Summary

What changed

Notes

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gz commented Mar 13, 2026

Uh oh!

monochromatti commented Mar 13, 2026

Uh oh!

abhizer left a comment

Choose a reason for hiding this comment

Uh oh!

abhizer commented Mar 13, 2026

Uh oh!

mihaibudiu commented Mar 18, 2026

Uh oh!

abhizer left a comment

Choose a reason for hiding this comment

Uh oh!

monochromatti commented Apr 4, 2026

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

monochromatti commented Apr 4, 2026

Uh oh!

abhizer commented Apr 4, 2026

Uh oh!

mythical-fred commented Apr 5, 2026

Uh oh!

mythical-fred commented Apr 6, 2026

Uh oh!

abhizer commented Apr 6, 2026

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

monochromatti commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhizer commented Apr 8, 2026

Uh oh!

gz commented Apr 10, 2026

Uh oh!

Uh oh!

gz commented Apr 12, 2026

Uh oh!

gz commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

monochromatti commented Mar 13, 2026 •

edited

Loading

monochromatti commented Apr 8, 2026 •

edited

Loading