Tags: sqlrush/pgrac
Tags
v0.87.0-stage4.12 — spec-4.12 cooperative write-fence (split-brain re… …covery guard) Stage 4 第 13 功能点。reconfig 宣告节点 dead 后,被 fence 的节点(stale / lease-expired / self-fenced)对共享存储的写一律 fail-closed(53R51 ERROR / critical-section PANIC),杜绝 split-brain 双写。Cooperative fence —— 非硬件 fence(SCSI-3 PR / STONITH = AD-013 / 4.12b forward)。 D1 durable fence marker(骑 voting-slot _reserved1,CRC 保护,无 ABI 变) D2 qvotec poll 选权威 marker(quorum-majority P0a + epoch-order P0b)+ per-disk preserve(R13 反放大)+ token 刷新(lease-published-last) D3 pure judge + token region(L110 fail-closed) D4 reconfig→qvotec marker-before-publish(LMON latch-wake 同步握手 + ≥majority fdatasync + ack,核心 8.A 序;非 ack 不 publish 不 recovery) D5 6 个 cluster_smgr 写入口唯一 helper(L240;CritSection PANIC / 否则 53R51) D6 recovery/rejoin/startup direct durable check(Option B Oracle-aligned:apply 全 redo + authority 时刻拒 post-fence WAL;rejoin self-fence non-serving NON-FATAL) D7 region + 2 GUC(enforcement default off / lease 6000)+ 2 wait event + dump category + L254 baseline sweep D8 t/269 surface TAP + 多节点 fence-firing 诚实 forward enforcement default OFF(opt-in;安全 default-ON 需 steady-state baseline-marker 子系统,defer 4.12b)。heapam.c / xlog.c 零改动。errcode 53R51。 PR-level fast-gate 全绿 run 27585326483 5/5。
v0.86.1-stage4.11 — spec-4.11 review hardening (P1 torn-tail + P2 cou… …nter/disarm) External review (3 findings, all verified) on v0.86.0-stage4.11 online thread recovery. P1 (real defect): the replay engine re-read the dead thread's legitimate crash-point torn tail past the validated boundary and failed closed -- so a clean validated window BLOCKED unless a CHECKPOINT straddle sat just past it. Fix: stop at the validated boundary. WITH the fix the real 2-node e2e (t/267) RECOVERS to DONE (the happy-path previously believed unreachable -- corrects the earlier root-cause that blamed validated_end). P2a: count window-derivation BLOCKED in the failclosed counter. P2b: disarm the worker exit callback after a clean return. GUC cluster.online_thread_recovery stays off (opt-in).
v0.86.0-stage4.11 — spec-4.11 online thread recovery (Stage 4 #12) Survivor online-replays a dead WAL thread's data + visibility to shared storage in the reconfig freeze window (Q10-B per-rmgr apply-through matrix, fail-closed 8.A), then GRD/GCS/PCM unfreeze. 2-node scope; >2-node FEATURE_NOT_SUPPORTED. D0-D7 across increments 1/2/3a/3b-1/3b-2/3b-3/3b-4a/3b-4b/3b-4c. errcode 53RA4; wait event ClusterThreadRecovery; catversion 202606152.
v0.85.0-stage4.10 — Online Single-Block Recovery (Stage 4 #11) On a corrupt / lost-write block read, rebuild that block from WAL (latest own-thread FPI + own-thread deltas) on a detached page and durably install it, instead of a full-database PANIC; fail-closed when the block cannot be rebuilt exactly (8.A) -- never a silent wrong-block install. D0-D7: backend single-block redo-apply infrastructure (FPI restore + RM_GENERIC + heap delta matrix, proven byte-for-byte by a crash-recovery differential in t/256), read-path bufmgr hook + two GUCs (cluster.online_block_recovery, cluster.block_recovery_on_unrecoverable), crash-safe durable install (own-thread pd_lsn <= flush -> no new WAL), own-thread / single-node / permanent-relation gates, and observability counters (blocks_recovered, recovery_failclosed). heapam.c unmodified. fast-gate run 27498109225 (5/5 green). Spec: spec-4.10-online-block-recovery.md
v0.84.0-stage4.9 — Post-Recovery PI/CR Correctness Acceptance (Stage … …4 #10) Acceptance / composition hard-gate spec (mirror spec-3.17), pure acceptance, zero product code: D0 measure-first found all 5 safety gates already live + sound in the serve path (no composition gap). t/254 (16 tests) proves the composition across crash/restart/remaster; codereview (P0 false-pass + 2 P1) fixed -- honest gate framing (cluster mechanisms unit-proven, e2e PG-native baseline, real 2-node shared-storage survivor read 8.A-safe). fast-gate run 27486112131 (5/5 green). Spec: spec-4.9-pi-cr-recovery-acceptance.md
v0.83.0-stage4.8 — Undo/TT recovery (Stage 4 #9) D0-D7 + D7-A index-aware physical rollback of 2PC-aborted DELETE writes, plus D7-A review P1 fixes (durable-head hygiene + full TT-identity revert gate) and the D5 over-fail-closed liveness leg corrected to a diag (rule 8.A safe). fast-gate run 27484431772 (5/5 green). Spec: spec-4.8-undo-tt-recovery.md
v0.82.0-stage4.7 — GCS/PCM warm recovery (spec-4.7 D0-D7) Stage 4 functional milestone: block-protocol coordination state is rebuilt after a node death / restart and remastered blocks become serviceable again via a live survivor, fail-closed throughout (no stale-page serve, no double grant). D0 measure-first + D1 RECOVERING/53R9L gate + D2 GCS_BLOCK_REDECLARE wire + survivor scan + master rebuild + D3 not-double-X + D5 redo-before- unfreeze LSN gate + D6 gcs_recovery observability + D7 recovery-aware GCS routing remaster (dead-master recovery routing only). Two-round code review (agent + manual) P0/P1 closed. fast-gate green: run 27466950814 (5/5). Spec: spec-4.7-gcs-pcm-warm-recovery.md (D0-D7 + Impl note v0.1-v0.4)
v0.81.0-stage4.7a — cross-node GCS/PCM block state coherence (spec-4.7a) Healthy-state Cache Fusion data-plane PCM coherence fix (spec-4.7 D0 measure- first surfaced: cross-node bulk DML deadlocked 53R90). D2 hold-until-revoked acquire gate + kill-switch GUC cluster.gcs_block_local_cache. D3 master self-holder convergence (idempotent re-grant + 3 self-forward/invalidate guards + S->X-from-self grant). D4 cross-node X contention bounded fail-closed (FEATURE_NOT_SUPPORTED 0A000; writer-transfer deferred to spec-2.36/4.7/Stage 6) -- both the remote-dispatch gate AND the local-master acquire path (B), so HG7 "no hang" holds on every round-trip. B (this re-tag): the first tag's nightly went red -- D2 hold-until-revoked hung t/247's merged-recovery window builder (two nodes write the same relation pre-crash, the holder never released). Fix: local-master acquire bounded-fail-closes on a live remote holder; t/247 sets the GUC off (merge-window builder, not a writer-transfer test). + Rule 8.A trigger test test_pcm_b_local_master_remote_x_holder_fail_closed. Verified: cluster_unit 91/91 (test_cluster_pcm_lock 35); t/247+248+249+250+252 (139); CF18 clean; Opus review SHIP, 0 P0 / 0 P1. fast-gate 27461770853 (5/5) @ f9e197e. Unblocks spec-4.7 (warm recovery).
spec-4.6: recovery-aware GRD/GES remaster closure Dead-master shards become survivable: failure-driven remaster of the GRD master map + cooperative holder rebind under the reconfig epoch, with an episode-epoch-coherent rebuild barrier and a cluster-wide REDECLARE_DONE gate before the post-barrier stale sweep. - D0 gap-pin t/249 (measure-first XFAIL -> flipped) - D2 deterministic failure-driven remaster + per-shard generation (Q3-C: wire token untouched) - D1 P0-P7 recovery ordering (LMON tick driver, shmem cursor) - D3 backend-cooperative holder rebind (PROCSIG_CLUSTER_GRD_REDECLARE; old->new holder rebind, ack only after master ack; cluster-wide scope per P0#3) - D4 request-path fail-closed: requester freeze gate + S4 real reject mapping + 53R9I/53R9J/53R9K + GCS dead-master block guard - D5 grd_recovery dump category (13 counters) + ClusterGrdShardRemaster wait event (Reconfig class) - P0#3 cluster gate: GES_REQ_OPCODE_REDECLARE_DONE(13), P6 sweeps only after EVERY survivor announces its barrier - Fable review P0-1: episode-epoch-coherent barrier (mid-episode epoch bump aborts to IDLE; prevents cross-node double grant) - user review P0-2: S4 master-side reject DEFAULT-DENY (no fall-through to S5 promote) Acceptance: t/249 2-node 36 legs + t/250 3-node 36 legs (L7 no-double-grant 3-node mandatory) = 72 assertions, no flake. Retag (L48): first tag @ 56aef99 nightly caught 20 wait-event/ category count ripples living only in nightly-run TAPs; exhaustive by-pattern sweep at a87c693. PR-level: fast-gate run 27448143551 all jobs green. Spec: spec-4.6-recovery-aware-grd-ges-remaster.md (FROZEN + Hardening v1.0.1)
spec-4.5a: shared-storage data backend + 跨实例 CR/TT 一次性闭环 (G1-G6) cluster_fs 共享存储后端(id 3)+ sentinel + merged k-way recovery 真 apply-through + 跨实例可见性授权闭环。详见 pgrac CHANGELOG v0.79.0。 G1-G6 + P1 hardening(materialized foreign ITL slot pin + wrap-qualified authority + marker-gated STALE)。验收: t/248 57 断言 L0-L16 + t/247 9/9 + pins 回归 + cluster_unit 91 + PG 219。 双轮 codereview(Sonnet+Fable 5,Fable 抓 P1 STALE marker gate ship-blocker)。 L48 retag: 首 tag f9c8ff0 nightly 抓 standby-escape P1 误伤(物理 standby 回放 primary 全态,本地 CLOG 即权威)→ !RecoveryInProgress() standdown 修。 fast-gate run 27410881285 全 6 job 绿。 Spec: spec-4.5a-shared-storage-data-backend.md (FROZEN v1.0)
PreviousNext