Skip to content

Fix concurrent transition double-execution and ungraceful locking#13

Merged
dereuromark merged 1 commit into
masterfrom
fix-concurrency-locking
May 22, 2026
Merged

Fix concurrent transition double-execution and ungraceful locking#13
dereuromark merged 1 commit into
masterfrom
fix-concurrency-locking

Conversation

@dereuromark

Copy link
Copy Markdown
Owner

Fixes findings 1 & 2 of #12.

Problem

Under concurrent access, transitions were not safe:

  • Double execution: 100 parallel transition() calls on the same entity ran the transition's command 3–13 times (scaling with contention) instead of once. StateMachineEngine::getCurrentState() reads the in-memory state and the engine never re-reads/row-locks the DB; the behaviour lock is released after each transition, so any caller that loaded the row in the original from state re-applies the transition once it takes its turn on the lock.
  • Ungraceful locking: ~85–95% of concurrent callers died with an uncaught QueryException: Duplicate entry ... workflow_locks_unique instead of TransitionResult::locked(), because LockManager::acquire() did check-then-save() and never caught the unique collision. The unconditional deleteAll([workflow, table, id]) between the check and the insert (no expiry filter) could also delete a concurrent fresh lock.

Fix

LockManager:

  • The unique index (workflow_name, entity_table, entity_id) is treated as the real mutex; the INSERT is the atomic acquire. A unique-violation collision now returns null (→ locked) instead of throwing.
  • The global expired-lock sweep is removed from the acquire hot path (a range DELETE on the lock table under high insert concurrency causes InnoDB gap-lock deadlocks). This entity's own expired lock is reclaimed on demand via a scoped point delete; cluster-wide GC remains available via deleteExpired().
  • Transient deadlocks (SQLSTATE 40001) during acquire are also mapped to null (try again later).

WorkflowBehavior:

  • After acquiring the lock, the persisted state field is re-read onto the entity (refreshWorkflowState()), so the engine evaluates the transition against the authoritative current state. A stale caller is then correctly blocked. Only the state field is touched and its dirty flag cleared so beforeSave does not treat the refresh as a direct state change; no-op for unpersisted entities.

Verification

Reproduced and verified with a multi-process harness on MySQL 8 / InnoDB (100 and 200 parallel workers transitioning the same entity through a guarded, side-effecting transition):

before after
destructive command executions (must be 1) 3–13 1
uncaught exceptions 85–142 0
losers exception graceful locked / blocked

Added WorkflowBehaviorConcurrencyTest as a single-process regression (a stale handle is blocked after a concurrent update); it fails without the behaviour change and passes with it. Full suite green (297 tests); cs-check clean; stan clean for the changed files (one unrelated pre-existing TimeoutScheduler notice on PHP 8.4 remains, addressed separately).

Note on the 5.4 Lock component

cakephp/cakephp#19370 (5.4) adds Lock::synchronized(), a clean atomic distributed mutex that would be a good future backend for LockManager (and resolves the lock-acquire concerns natively). It does not by itself address the stale in-memory state (the refreshWorkflowState part is still required regardless of lock backend). This PR is a 5.2/5.3-compatible fix using the existing workflow_locks table; migrating the backend to the Lock component can follow once 5.4 is the baseline.

…xecution

LockManager::acquire() now treats the unique index as the atomic acquire: a
concurrent INSERT collision (unique violation) or a transient deadlock maps to
a graceful null (locked) instead of an uncaught exception. The unconditional
global expired-lock sweep is removed from the hot path (it caused InnoDB
gap-lock deadlocks under load); this entity's own expired lock is reclaimed on
demand via a scoped point delete.

WorkflowBehavior re-reads the persisted state field while holding the lock, so a
caller acting on a stale in-memory entity can no longer re-apply a transition
another caller already applied (lost update / double execution of commands).

Adds a regression test proving a stale entity is blocked after a concurrent
update.
@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 81.35593% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.41%. Comparing base (b52d8e7) to head (3d62135).

Files with missing lines Patch % Lines
src/Service/LockManager.php 76.31% 9 Missing ⚠️
src/Model/Behavior/WorkflowBehavior.php 90.47% 2 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@             Coverage Diff              @@
##             master      #13      +/-   ##
============================================
+ Coverage     70.34%   70.41%   +0.06%     
- Complexity     1123     1144      +21     
============================================
  Files            62       62              
  Lines          4020     4076      +56     
============================================
+ Hits           2828     2870      +42     
- Misses         1192     1206      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dereuromark dereuromark merged commit 5bdf560 into master May 22, 2026
8 checks passed
@dereuromark dereuromark deleted the fix-concurrency-locking branch May 22, 2026 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants