[dbsp] Better handle errors encountered listing storage files. by blp · Pull Request #6296 · feldera/feldera

blp · 2026-05-21T21:31:24Z

The storage interface StorageBackend::list did not distinguish between errors that meant that some or all of the files in a directory were not listed and errors that meant that an individual file's metadata could not be obtained; it reported them all the same way, through its return value.

This commit changes that. Now, errors that could lose files are still reported through the return value, and errors that mean that the metadata for an individual file cannot be obtained are reported through the callback. This allows the caller to more intelligently interpret the errors.

This commit also improves error logging: the POSIX backend now logs warnings for errors obtaining metadata (up to a limit), and the checkpointer code logs the number of directory entry errors encountered during GC.

Describe Manual Test Plan

Just ran the unit tests.

Checklist

Unit tests added/updated
Integration tests added/updated
Documentation updated
Changelog updated

blp · 2026-05-21T21:31:53Z

@gz This is a followup to your suggestion to make sure we're logging storage errors.

mythical-fred

Clean separation between "could-lose-files" errors (return value) and "per-entry metadata" errors (callback) — the API shape is the right one and the GC log now actually surfaces the count. Trait doc on list is good. Three things worth addressing before this lands, plus two nits inline.

Test gap (the main thing). The new behavior — per-entry file_type: Err(_), the rate-limited warn ladder, and the loop continuing after a readdir entry error to return the last one — has no test. The existing gc_startup test only exercises the all-Ok happy path. Suggest a thin wrapper around MemoryBackend (similar to the PosixWrapper you already added in the checkpointer test module) that lets a test inject Err(_) for chosen entries, then assert: (a) n_errors ≥ 1 appears in the GC log, (b) Class::Other fallback path is taken, (c) on entries with Err(NotFound) cb is not called.

Posix readdir-entry errors are now returned, not skipped. Previously a readdir entry Err set result = Err(e) and continued the loop — same as today — but the loop now returns the last such error rather than the first. Worth a sentence in the trait doc ("returns the last one") which you already added — good — but please also note that errors are accumulated (cb continues for subsequent OK entries) so callers can't rely on Err meaning "listing stopped". The current doc reads correctly to me; flagging only because the changelog/commit message doesn't mention this.

Posix NotFound race for metadata is silently swallowed. That matches prior behavior (the status.json.mut → status.json rename case), but now a metadata NotFound results in neither a warn nor a callback nor a bump to any counter. Fine for the documented race, but if a real file vanishes between readdir and stat you'll never know. Consider counting NotFound-skipped entries separately and including the count in the same GC log line — it's free and would catch unexpected churn.

mythical-fred · 2026-05-21T22:05:43Z

+                                _ => debug!("I/O error listing {parent}: {e}"),
+                            }
+                        }
+                        cb(entry);


Style nit, non-blocking: this checks &entry.file_type twice (once for the NotFound short-circuit, once for the warn). Reads cleaner as a single match:

match &entry.file_type { Err(e) if e.kind() == ErrorKind::NotFound => continue, Err(e) => match warnings.next().unwrap() { /* … */ }, Ok(_) => {} } cb(entry);

mythical-fred · 2026-05-21T22:05:44Z

+        let entries = path.read_dir().map_err(|e| {
+            StorageError::stdio(e.kind(), "readdir", self.fs_path(parent).display())
+        })?;
+        let mut warnings = 0usize..;


Minor: using 0usize.. as a counter so you can do warnings.next().unwrap() is a bit cute — the unwrap is technically unreachable (you'd need usize::MAX entries first) but it reads as fallible. A plain let mut n_warnings = 0usize; then n_warnings += 1; match n_warnings { 1..20 => warn!(…), 20 => warn!(…bridge…), _ => debug!(…) } says the same thing without the never-fires .unwrap(). Holzmann would prefer no unwrap on a hot path.

mythical-fred · 2026-05-21T22:05:44Z

-            .list(&StoragePath::default(), &mut |path, file_type| {
+            .list(&StoragePath::default(), &mut |DirEntry { name, file_type}| {
+                let file_type = file_type.unwrap_or_else(|_| {
+                    n_errors += 1;


An Err(_) here collapses any per-entry metadata failure into Class::Other + Disposition::KeepUnexpected. That's reasonable — refuse to delete what you can't classify — but the GC log line will show those entries as Keeping unexpected Other … even though the real reason is "could not stat". Worth a one-line debug log at the point of the unwrap_or_else (debug!("Cannot determine type of {name}: {e}, treating as Other")) so operators can correlate the n_errors count with concrete paths.

mythical-fred · 2026-05-21T22:05:44Z

            })?;
        info!(
-            "GC kept {}/{}/{} expected files/directories/other, kept {}/{}/{} unexpected, and deleted {}/{}/{} unused",
+            "GC kept {}/{}/{} expected files/directories/other, kept {}/{}/{} unexpected, and deleted {}/{}/{} unused; {n_errors} error(s) reading directory entries",


Nit: the GC log line is now wide enough that I'd flip it to fielded logging (info!(expected_files = …, n_errors, "GC summary")) so log-aggregators can index on n_errors and alert on > 0. Format-string version stringifies a number that operators will want to filter on.

gz · 2026-05-22T06:04:44Z

+                                20 => warn!(
+                                    "I/O error listing {parent} (further warnings will be at debug level): {e}"
+                                ),
+                                _ => debug!("I/O error listing {parent}: {e}"),


that's neat

mythical-fred

Clean follow-up to #6289 — splitting listing-level vs per-entry-metadata errors is exactly the right shape. Two small notes inline. Approving.

mythical-fred · 2026-05-22T10:01:47Z

+                    } else {
+                        if let Err(e) = &entry.file_type {
+                            match warnings.next().unwrap() {
+                                0..20 => warn!("I/O error listing {parent}: {e}"),


Nit: the 0..20 / 20 warn-cap is a magic number duplicated across two match arms — worth extracting a const MAX_LIST_ERROR_WARNINGS: usize = 20; at module scope so the threshold isn't buried inside a range pattern. The warnings.next().unwrap() on an open-ended usize range is fine but a plain usize counter would read more naturally here (no unwrap, no iterator misdirection).

mythical-fred · 2026-05-22T10:01:47Z

+            cb(DirEntry {
+                name,
+                file_type: Ok(StorageFileType::File { size }),
+            });


The new contract — that cb may receive a DirEntry whose file_type is Err(..) — is not exercised by any unit test. The memory backend always returns Ok(File { size }), and the posix path that produces an Err is hard to drive deterministically. Consider adding a tiny test-only backend (or a knob on MemoryBackend) that returns an Err file_type for a planted entry, so the checkpointer's n_errors/Class::Other accounting and per-call-site matches!(.., Ok(..)) checks have at least one regression test. Non-blocking, but the whole point of this PR is the new error path — it should be covered.

mythical-fred

Re-APPROVE on ed18822d (rebase + warn-sampling refactor).

Net PR diff vs prior approved tip 77f30f8944:

checkpointer.rs: ADHOC_TEMP_DIR → DATAFUSION_TEMP_DIR rename — pure pickup of the constant rename merged in main.
posixio_impl.rs: warn-sampling rewritten from open-ended 0usize.. + .next().unwrap() to bounded 0usize..20 + .next_back(). Semantics preserved exactly — first 20 entries log at WARN, the transition message fires on the 20th (Some(0)), further entries log at DEBUG (None). Cleaner: no .unwrap(), no 0..20 match guard, no risk of panicking if the range were ever changed.

Not blocking but worth noting: the bound is now duplicated between the range literal (..20) and the message text. If someone ever needs to tune the threshold, they'll have to change two places. A named const (MAX_LIST_WARN_LOGS = 20) used in both would be tidier.

No other changes in PR-only diff (storage/lib.rs, memory_impl.rs, journal.rs are byte-identical to prior tip). Prior approve stands.

blp · 2026-05-28T02:40:56Z

Not blocking but worth noting: the bound is now duplicated between the range literal (..20) and the message text.

The message doesn't mention the bound.

The storage interface `StorageBackend::list` did not distinguish between errors that meant that some or all of the files in a directory were not listed and errors that meant that an individual file's metadata could not be obtained; it reported them all the same way, through its return value. This commit changes that. Now, errors that could lose files are still reported through the return value, and errors that mean that the metadata for an individual file cannot be obtained are reported through the callback. This allows the caller to more intelligently interpret the errors. This commit also improves error logging: the POSIX backend now logs warnings for errors obtaining metadata (up to a limit), and the checkpointer code logs the number of directory entry errors encountered during GC. Signed-off-by: Ben Pfaff <blp@feldera.com>

blp requested a review from gz May 21, 2026 21:31

blp self-assigned this May 21, 2026

blp added QA Testing and quality assurance rust Pull requests that update Rust code labels May 21, 2026

blp added the storage Persistence for internal state in DBSP operators label May 21, 2026

blp force-pushed the log-gc branch from 841f092 to 77f30f8 Compare May 21, 2026 21:47

mythical-fred reviewed May 21, 2026

View reviewed changes

gz approved these changes May 22, 2026

View reviewed changes

mythical-fred approved these changes May 22, 2026

View reviewed changes

gz added this pull request to the merge queue May 22, 2026

gz removed this pull request from the merge queue due to a manual request May 22, 2026

blp force-pushed the log-gc branch from 77f30f8 to ed18822 Compare May 26, 2026 21:22

mythical-fred approved these changes May 27, 2026

View reviewed changes

blp added this pull request to the merge queue May 28, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026

blp force-pushed the log-gc branch from ed18822 to 3c417d1 Compare May 28, 2026 17:31

blp added this pull request to the merge queue May 29, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 29, 2026

blp added this pull request to the merge queue May 29, 2026

Merged via the queue into main with commit 87a7f4f May 29, 2026
1 check passed

blp deleted the log-gc branch May 29, 2026 20:13

Conversation

blp commented May 21, 2026

Describe Manual Test Plan

Checklist

Uh oh!

blp commented May 21, 2026

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

mythical-fred May 21, 2026

Choose a reason for hiding this comment

Uh oh!

mythical-fred May 21, 2026

Choose a reason for hiding this comment

Uh oh!

mythical-fred May 21, 2026

Choose a reason for hiding this comment

Uh oh!

mythical-fred May 21, 2026

Choose a reason for hiding this comment

Uh oh!

gz May 22, 2026

Choose a reason for hiding this comment

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

mythical-fred May 22, 2026

Choose a reason for hiding this comment

Uh oh!

mythical-fred May 22, 2026

Choose a reason for hiding this comment

Uh oh!

mythical-fred left a comment

Choose a reason for hiding this comment

Uh oh!

blp commented May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants