Skip to content

fix profiler regression#6331

Merged
gz merged 5 commits into
mainfrom
webconsole-format
May 28, 2026
Merged

fix profiler regression#6331
gz merged 5 commits into
mainfrom
webconsole-format

Conversation

@gz

@gz gz commented May 27, 2026

Copy link
Copy Markdown
Contributor

Describe Manual Test Plan

Loaded new and old bundles in profiler view manually.

@Karakatiza666 Karakatiza666 self-requested a review May 27, 2026 16:40
@gz gz changed the title Webconsole format fix profiler regression May 27, 2026

@Karakatiza666 Karakatiza666 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The fix is clean, backward-compatible

@gz gz enabled auto-merge May 27, 2026 16:43
@gz gz added this pull request to the merge queue May 27, 2026

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The two-regex split (legacy <ISO>_ prefix vs new <ISO>/ directory) plus the (?:^|\/) anchor cleanly covers both flat and nested bundle layouts, and collectionDate correctly rewrites the new --separated time back to : for the JS Date parser. Inline doc comments are great.

Non-blocking observations:

  • This whole file is now pure-function regex + date parsing — exactly the shape that wants unit tests. profiler-layout has no test runner today, so this PR can't add them without scaffolding vitest first. Worth a tracking issue (feat: add vitest to profiler-layout) so the next regex tweak isn't another manual roundtrip. The new bundle format is locked to a literal trailing Z (no +HH:MM offset); if support-bundle ever emits a non-Z UTC marker the regex silently drops the collection on the floor — a 3-line vitest would catch that.
  • collectionDirRegex accepts (?:\.\d+)?Z(?:-\d+)?. The collision suffix -N is unbounded; not a real concern, just noting that -9999999999 would parse. Capping at \d{1,3} would match the support-bundle producer's expected output.
  • The legacy regex [^/_]+ will happily match a timestamp like 2026-01-19T12-55-54Z (the new format!) as a legacy key. Order matters: collectionDirRegex is tried first, so this is benign — worth a one-line comment in collectionKey noting the order is load-bearing.

Manual test plan covers the user-facing path; I'm not blocking on the missing unit tests because the infra isn't there.

@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 27, 2026
@gz gz added this pull request to the merge queue May 27, 2026
@gz gz removed this pull request from the merge queue due to a manual request May 27, 2026
@gz gz added this pull request to the merge queue May 27, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 27, 2026
@ryzhyk ryzhyk added this pull request to the merge queue May 27, 2026
gz and others added 2 commits May 27, 2026 17:45
1414c0f changed the support bundle format, putting things into folders.
The console relied on the format to display a profile.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 28, 2026
RuntimeHandle::join shut down the tokio merger runtime with:

```
if let Some(rt) = self.runtime.inner().tokio_merger_runtime.lock().unwrap().take() {
    drop(rt);
}
```

The MutexGuard returned by lock() lives until the end of the
body, so the mutex stayed locked across drop.

Dropping the tokio runtime blocks until every blocking-pool task
yields. Those tasks are merger workers, and MergeJob::run (in
trace::spine_async) calls Runtime::tokio_merger_runtime() to
spawn a child task for a newly-used spine level. That accessor locks
the same mutex. A merger task entering this call during the drop
window blocked on the mutex.

The race became reachable with commit a886ecf "[dbsp] Lazily start
merger tasks in spine" which switched dbsp from spawning every level
up-front on foreground worker threads to spawning on demand from
inside merger tasks.

Reproduced on nexmark Q19 by holding the mutex for 10s after take()
and forcing a tokio_merger_runtime() call from inside the merger
loop.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
@gz gz force-pushed the webconsole-format branch from dcd95e5 to 387166a Compare May 28, 2026 01:53

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two real fixes, both look correct. Blocking on testing per our hard rule (behavior change → must land with a test). Detail inline.

Comment thread crates/dbsp/src/circuit/runtime.rs
Comment thread crates/dbsp/src/trace/spine_async.rs
Comment thread crates/dbsp/src/circuit/runtime.rs
gz added 2 commits May 27, 2026 19:05
Runtime::tokio_merger_runtime used to .expect() on the slot, so a
caller that ran after RuntimeHandle::join had taken the runtime out
would panic with "tokio merger runtime has been shut down".

Since commit a886ecf "[dbsp] Lazily start merger tasks in spine",
a merger task can call this accessor at any time during its lifetime,
when this happens after the shutdown logic dropped the tokio merger
runtime it creates a unnecessary panic during shutdown which needs
to be handled gracefully.

Before that commit every level was spawned up-front from foreground
worker threads, so the only caller of this accessor was the
foreground worker, never the merger runtime itself, and the race
could not occur.

Return Option<Handle> instead. MergeJob::run, returns early when the
slot has already been taken; workers exiting set `request_exit`,
so the merger has nothing to do anyway.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Covers both fixes from PR #6331.

Verified by reverting each fix in turn:
- Lock-scope reverted: test times out at 30s.
- Lock-scope reverted plus 10s amplifier sleep in `join`: same
  30s timeout.

Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
@gz gz force-pushed the webconsole-format branch from 387166a to d6ca088 Compare May 28, 2026 02:37
@gz gz enabled auto-merge May 28, 2026 02:38
@gz gz added this pull request to the merge queue May 28, 2026

@mythical-fred mythical-fred left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-APPROVE on d6ca088b. Blocker addressed: test_drop_does_not_deadlock_with_active_merger_lookup (commit d6ca088) covers both fixes deterministically — no sleep, no time-based assertions, only recv_timeout for deadlock detection (the 30s/5s budgets are upper bounds, not waits). The accessor-hammer + side-thread drop(handle) shape reliably loses the race on the buggy code: the worker holds the mutex across the runtime drop, the merger task spinning inside tokio_merger_runtime() blocks on the same mutex, and the blocking-pool teardown deadlocks against it. With both fixes in place, take() releases the slot, the accessor returns None, and the task exits via the clean-exit channel. Good test.

Spine_async tracing::trace! on the torn-down path is fine — trace is the right level for a benign race observation.

Prior APPROVE on profiler-layout dcd95e5da7 still stands; the two dbsp commits stack cleanly on top.

Merged via the queue into main with commit da71240 May 28, 2026
1 check passed
@gz gz deleted the webconsole-format branch May 28, 2026 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants