fix profiler regression#6331
Conversation
Karakatiza666
left a comment
There was a problem hiding this comment.
LGTM. The fix is clean, backward-compatible
mythical-fred
left a comment
There was a problem hiding this comment.
LGTM. The two-regex split (legacy <ISO>_ prefix vs new <ISO>/ directory) plus the (?:^|\/) anchor cleanly covers both flat and nested bundle layouts, and collectionDate correctly rewrites the new --separated time back to : for the JS Date parser. Inline doc comments are great.
Non-blocking observations:
- This whole file is now pure-function regex + date parsing — exactly the shape that wants unit tests.
profiler-layouthas no test runner today, so this PR can't add them without scaffolding vitest first. Worth a tracking issue (feat: add vitest to profiler-layout) so the next regex tweak isn't another manual roundtrip. The new bundle format is locked to a literal trailingZ(no+HH:MMoffset); if support-bundle ever emits a non-Z UTC marker the regex silently drops the collection on the floor — a 3-line vitest would catch that. collectionDirRegexaccepts(?:\.\d+)?Z(?:-\d+)?. The collision suffix-Nis unbounded; not a real concern, just noting that-9999999999would parse. Capping at\d{1,3}would match the support-bundle producer's expected output.- The legacy regex
[^/_]+will happily match a timestamp like2026-01-19T12-55-54Z(the new format!) as a legacy key. Order matters:collectionDirRegexis tried first, so this is benign — worth a one-line comment incollectionKeynoting the order is load-bearing.
Manual test plan covers the user-facing path; I'm not blocking on the missing unit tests because the infra isn't there.
1414c0f changed the support bundle format, putting things into folders. The console relied on the format to display a profile. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Signed-off-by: Karakatiza666 <bulakh.96@gmail.com>
RuntimeHandle::join shut down the tokio merger runtime with:
```
if let Some(rt) = self.runtime.inner().tokio_merger_runtime.lock().unwrap().take() {
drop(rt);
}
```
The MutexGuard returned by lock() lives until the end of the
body, so the mutex stayed locked across drop.
Dropping the tokio runtime blocks until every blocking-pool task
yields. Those tasks are merger workers, and MergeJob::run (in
trace::spine_async) calls Runtime::tokio_merger_runtime() to
spawn a child task for a newly-used spine level. That accessor locks
the same mutex. A merger task entering this call during the drop
window blocked on the mutex.
The race became reachable with commit a886ecf "[dbsp] Lazily start
merger tasks in spine" which switched dbsp from spawning every level
up-front on foreground worker threads to spawning on demand from
inside merger tasks.
Reproduced on nexmark Q19 by holding the mutex for 10s after take()
and forcing a tokio_merger_runtime() call from inside the merger
loop.
Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
mythical-fred
left a comment
There was a problem hiding this comment.
Two real fixes, both look correct. Blocking on testing per our hard rule (behavior change → must land with a test). Detail inline.
Runtime::tokio_merger_runtime used to .expect() on the slot, so a caller that ran after RuntimeHandle::join had taken the runtime out would panic with "tokio merger runtime has been shut down". Since commit a886ecf "[dbsp] Lazily start merger tasks in spine", a merger task can call this accessor at any time during its lifetime, when this happens after the shutdown logic dropped the tokio merger runtime it creates a unnecessary panic during shutdown which needs to be handled gracefully. Before that commit every level was spawned up-front from foreground worker threads, so the only caller of this accessor was the foreground worker, never the merger runtime itself, and the race could not occur. Return Option<Handle> instead. MergeJob::run, returns early when the slot has already been taken; workers exiting set `request_exit`, so the merger has nothing to do anyway. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
Covers both fixes from PR #6331. Verified by reverting each fix in turn: - Lock-scope reverted: test times out at 30s. - Lock-scope reverted plus 10s amplifier sleep in `join`: same 30s timeout. Signed-off-by: Gerd Zellweger <mail@gerdzellweger.com>
mythical-fred
left a comment
There was a problem hiding this comment.
Re-APPROVE on d6ca088b. Blocker addressed: test_drop_does_not_deadlock_with_active_merger_lookup (commit d6ca088) covers both fixes deterministically — no sleep, no time-based assertions, only recv_timeout for deadlock detection (the 30s/5s budgets are upper bounds, not waits). The accessor-hammer + side-thread drop(handle) shape reliably loses the race on the buggy code: the worker holds the mutex across the runtime drop, the merger task spinning inside tokio_merger_runtime() blocks on the same mutex, and the blocking-pool teardown deadlocks against it. With both fixes in place, take() releases the slot, the accessor returns None, and the task exits via the clean-exit channel. Good test.
Spine_async tracing::trace! on the torn-down path is fine — trace is the right level for a benign race observation.
Prior APPROVE on profiler-layout dcd95e5da7 still stands; the two dbsp commits stack cleanly on top.
Describe Manual Test Plan
Loaded new and old bundles in profiler view manually.