Skip to content

feat(utils): add tool to recover unapplied WAL data as Parquet files#7161

Open
javier wants to merge 8 commits into
masterfrom
jv/wal_to_parquet
Open

feat(utils): add tool to recover unapplied WAL data as Parquet files#7161
javier wants to merge 8 commits into
masterfrom
jv/wal_to_parquet

Conversation

@javier

@javier javier commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds io.questdb.cliutil.WalToParquet, a strict read-only forensic utility that walks every WAL table under a QuestDB data root and exports each un-purged segment as a Parquet file. Designed for the case where a table is suspended or its committed partitions are corrupt and the operator wants to extract whatever still lives in the WAL before scrubbing.

No CairoEngine is booted, no writes touch the source tree, no exclusive locks. Safe to run against a live, running QuestDB instance under snapshot semantics: anything visible in _txnlog at the start of the walk is in scope, anything appended after is ignored.

Output

Each WAL segment becomes one Parquet file alongside three JSON sidecars per table:

  • <tableName>__manifest.json -- every segment the tool considered (written, skipped, partial), every structural-change transaction in seqTxn order, the txnlog format version, maxTxn and appliedSeqTxn watermarks, per-segment reasons, and each written segment's structureVersion.
  • <tableName>__sql_log.json -- every non-DATA transaction (UPDATE, ALTER TABLE, TRUNCATE, view/mat-view events) with (walId, segmentId, segmentTxn, seqTxn, commitTimestamp, type, sql, error). Cross-reference seqTxn against the _txnSeq_ shoulder column in the Parquet files.
  • <tableName>__schemas.json -- column list at each distinct structureVersion observed across the table's WAL segments, sorted ascending. Operators recreate the table at the right version via Parquet -> manifest segment -> structureVersion -> schemas[version] before loading rows.

Per-row provenance (shoulder columns)

Default-on (--no-shoulder to opt out): _wal_id, _segment_id, _segment_txn, _txnSeq_ (= seqTxn), _commit_ts, _recovery_status_.

_recovery_status_ is unapplied / applied_unpurged / unknown, derived from each row's _txnSeq_ against the table's appliedSeqTxn watermark read directly from _txn at the table root (TxReader semantics: version-parity record selection + seqTxn + abs(lagTxnCount)).

Tiered fallback

Suffix Trigger
(none) Tier 1 happy path: _txnlog, _event, _meta, all column files intact.
__tier3.parquet Segment _event missing/corrupt. Row count comes from the txnlog's per-txn row counts (V2 sequencer format only). New-in-WAL symbols are clamped to NULL (only the base symbol-table snapshot recovers).
__tier2.parquet Whole-table _txnlog missing/corrupt. Filesystem scan of wal*/N/; cross-segment seqTxn ordering is unknown; manifest.structuralChanges and manifest.sqlStatements are empty in tier-2 because both lists come from the txnlog walk.
__tier2__tier3.parquet Both.

If _event is gone AND the txnlog is V1 (QuestDB default), the segment is marked skipped_event_unreadable_no_row_count and no Parquet is emitted, refusing to fabricate row counts from mmap-preallocated column file lengths.

Tier 4 (per-segment _meta corrupt) borrows schema from a peer segment; the manifest flags substitution.

Per-txn SYMBOL resolution

QuestDB's WAL writer can reassign the same numeric symbol code to different strings across transactions inside one segment (especially after suspend/resume). The tool walks _event to build per-txn snapshots of each SYMBOL column's dictionary, resolves every row through its own transaction's snapshot, then deduplicates strings into a single global dictionary with newly assigned dense codes for the Parquet output. Rows are correctly attributed regardless of code reuse.

Event-walk failure handling

Per-event, not per-segment. Three failure shapes:

  • Body parse failure with valid header: entry preserves all coordinates plus an error string; iteration continues.
  • hasNext()/getType()/getTxn() failure (typically an unknown OSS-invisible type byte, or framing corruption): one UNKNOWN_EVENT_UNREADABLE entry recorded with (walId, segmentId) and segmentTxn=-1; event collection stops (cursor state undefined past this point).
  • _event file unopenable: one UNKNOWN_EVENT_UNREADABLE scoped to the segment.

Known limitations

  • ALTER SQL rendering: structural-change transactions (ADD/DROP/RENAME COLUMN) are recorded in manifest.structuralChanges as {seqTxn, commitTimestamp} markers only. Extracting the original ALTER SQL text would require booting CairoEngine or reimplementing AlterOperation's binary serialisation; the resulting schemas in __schemas.json are the authoritative source instead.
  • Enterprise-only events: if Enterprise stores GRANT/CREATE USER as ordinary WalTxnType.SQL events with the standard OSS framing, the SQL text is captured verbatim. Genuinely new event type bytes throw inside WalEventCursor and become UNKNOWN_EVENT_UNREADABLE (no per-txn detail).

Test plan

  • mvn -pl utils -am test -Dtest='WalToParquet*' -Dsurefire.failIfNoSpecifiedTests=false -- 30 tests, all pass (10 unit + 14 integration + 3 V2 integration + 1 tier-2 integration + 1 tier-4 integration + 1 partial-file integration).
  • Integration tests build real WAL tables via CairoEngine in a temp data root, deliberately never run ApplyWal2TableJob so data stays unapplied (except where applied-watermark behavior is being tested), invoke WalToParquet.main(), then read the recovered Parquet back through parquet_scan.
  • Coverage:
    • testHappyPathRecovery -- tier-1, shoulder columns enabled.
    • testAllDataTypesRoundtrip -- full type matrix: TIMESTAMP, TIMESTAMP_NS, SYMBOL, DECIMAL(20,4), DOUBLE, DOUBLE[][], UUID, FLOAT, LONG, BINARY, VARCHAR.
    • testSqlLogCapturesUpdates -- UPDATE statement captured with full SQL text.
    • testRecoveryStatusColumn -- never-applied rows tagged unapplied.
    • testAppliedSeqTxnReadFromTxn -- after ApplyWal2TableJob.drain(0), the _txn watermark is read correctly and surviving rows tag as applied_unpurged.
    • testNullValuePreservation -- explicit NULL values in LONG, DOUBLE, SYMBOL, VARCHAR survive the round-trip alongside non-NULL values.
    • testSchemasSidecar -- basic schemas emission.
    • testSchemaEvolutionMapping -- ADD COLUMN bumps structureVersion 0 to 1; both versions in schemas.json; manifest segments reference correct versions.
    • testDropColumnSchemaEvolution -- DROP COLUMN counterpart; verifies the dropped column is absent from both the post-drop schema AND the post-drop recovered Parquet.
    • testRenameColumnSchemaEvolution -- RENAME COLUMN counterpart; verifies the pre- and post-rename names land in their respective __schemas.json entries, and that manifest segments reference the correct version each.
    • testMultiSegmentRecovery -- two ALTERs roll the WAL into three segments; verifies three Parquet files are emitted with correct per-segment row counts and the total matches the inserted row count via parquet_scan.
    • testUnreadableEventRecordedAsPlaceholder -- truncated _event produces UNKNOWN_EVENT_UNREADABLE entry, manifest segment marked unrecoverable, no Parquet emitted (asserts the now-fixed bogus-row fabrication).
    • testUnderscorePrefixedUserTable -- discoverTables surfaces user tables whose directory name starts with _ (QuestDB's isValidTableName permits the prefix).
    • testNoShoulderFlag -- --no-shoulder omits the shoulder columns.
    • V2 integration class (separate engine with V2 sequencer): testV2TxnlogFormatVersionRecorded (V2 selection sanity), testV2Tier3RecoversFromCorruptEvent (truncate _event, assert __tier3.parquet carries the correct row count from txnlog), and testV2Tier3PerRowRecoveryStatusAcrossTxns (apply watermark falls inside a multi-txn segment; per-row _recovery_status_ must reflect each row's own seqTxn, not a segment-wide constant; asserts specific row counts per (seqTxn, status) bucket).
    • Tier-2 integration class (isolated engine): testTier2CorruptTxnlog -- truncate _txnlog, assert filesystem-scan fallback produces __tier2.parquet with correct row count and the manifest's txnLog.status is error.
    • Tier-4 integration class (isolated engine, V2 sequencer): testTier4PeerMetaFallback -- corrupt both _event and _meta for one segment, verify the peer-meta fallback inside the tier-3 path borrows the schema from an intact peer segment and emits Parquet with a tier-4 note in skippedColumns.
    • Partial-file integration class (isolated engine): testPartialColumnFileLoss -- delete one column's .d file; assert manifest surfaces the specific missing column AND the segment is marked skipped_reader_open_failed AND no Parquet is emitted.
  • Selected high-leak-risk integration tests (testHappyPathRecovery, testPartialColumnFileLoss, testV2Tier3PerRowRecoveryStatusAcrossTxns) wrap their bodies in a local assertMemoryLeak helper that tracks MemoryTag.NATIVE_DEFAULT against an 8 KB engine-warmup slack. The harness pins the partial-malloc-leak fixes in synthesizeSymbolBuffersPerTxn and makeRecoveryStatusColumn against future regression.

javier added 3 commits May 26, 2026 17:55
A forensic recovery tool that exports un-applied WAL data to Parquet
files. Designed for the case where a table is suspended or its base
storage is corrupt: the operator wants to extract whatever survives in
the WAL before scrubbing the table.

Lives in the utils/ module alongside RebuildIndex/RecoverVarIndex.
Strict read-only - no CairoEngine boot, no writes, no exclusive locks.
Safe to run against a live instance under snapshot semantics: anything
visible in _txnlog at the time of the walk is in scope, anything
appended after is ignored. The _txnlog header is read through a
transient RO file handle, then TableTransactionLogV1/V2 is constructed
without calling open() (which would mmap _txnlog read-write); the
cursor opens its own RO handles internally for the record walk.

Discovery scans the db root, picks up every directory that has a
txn_seq/_txnlog file, filters out hidden entries, and by default skips
sys.* tables (override with --include-system). Tables whose WAL is
fully purged are skipped unless --include-empty is set. Single-table
mode via --table-dir is still available for targeted runs.

Per (walId, segmentId) the tool emits one Parquet file named
<tableName>__wal<walId>__seg<segId>__seqTxn<lo>-<hi>.parquet under
--output-dir. The committed row count comes from the segment's _event
file (canonical source, works on both V1 and V2 sequencer logs).
Segments referenced by _txnlog but missing on disk - the WAL-purge
scenario - degrade gracefully with a one-line manifest entry and the
walk continues.

For each column the PartitionDescriptor is built directly from
WalReader's mmap'd memory: VARCHAR/BINARY/ARRAY and fixed-width types
are zero-copy. Two special cases:

The designated timestamp column on WAL disk is 16 bytes per row
(timestamp followed by rowID for O3 handling), not 8. The encoder
expects 8 bytes per row, so we allocate a compact buffer and
stride-copy only the timestamp halves. The detection key is
column == reader.getTimestampIndex(), which fires for both
TIMESTAMP and TIMESTAMP_NS designated timestamps.

SYMBOL columns are trickier: the WAL's local <column>.c/.o/.k files
at wal<N>/ only hold the cleanSymbolCount snapshot from the base
table. New symbols added during this WAL's commits live in _event
and aren't on disk in the same format. WalReader.getSymbolValue
already merges both sources into an in-memory map. The tool walks
each column's .d file for the maximum referenced key, then for every
key from 0..maxKey resolves the value via WalReader.getSymbolValue
and synthesises a (values, offsets) buffer pair in native memory in
the layout PartitionEncoder's native code expects. Validates
round-tripping the live demo_trades_today table's wal4/seg0 (16 rows)
and wal5/seg0 (10,015 rows) via parquet_scan.
The 15-line WalDirectoryPolicy stub doesn't warrant its own file. Move
it to a private static nested class inside WalToParquet so the utility
lives in a single source file, matching the spirit of the other
single-file utilities in the cliutil package.
Builds on the initial single-file utility with the items the original
forensic-recovery plan called for:

Manifest: per-table JSON manifest sits next to the Parquet files and
records every segment the tool saw (written, skipped, or partial), every
structural-change transaction in seqTxn order, the txnlog format/maxTxn
header, and per-segment reasons. The operator now has a machine-readable
trail of what was recovered vs lost.

Per-row shoulder columns: _wal_id, _segment_id, _segment_txn, _seq_txn,
_commit_ts are emitted by default so downstream consumers can dedupe
recovered rows against whatever survives in the base table. The
_segment_txn is derived per-row by replaying the segment's _event,
seq_txn and commit_ts are mapped from the txnlog records. Opt out with
--no-shoulder.

Tier 2 (txnlog corrupt or missing): falls back to a filesystem scan of
wal*/N/ directories. Cross-segment seqTxn ordering is unknown so files
are written as <table>__wal<N>__seg<N>__tier2.parquet and the manifest
reflects it.

Tier 3 (segment _event corrupt or missing): row count is derived from
the timestamp .d file size (16 B/row in WAL), and a direct-mmap
emission path bypasses WalReader (which itself requires _event). SYMBOL
columns are resolved from the WAL's on-disk symbol files - the base
table snapshot at WAL open time - so codes referencing new-in-WAL
symbols (which lived only in _event) are clamped to NULL. The manifest
notes how many rows were affected per column. Files are suffixed
__tier3.parquet.

Tier 4 (segment _meta corrupt): peer-segment schema substitution. The
tool scans other wal*/N/ directories for the same table, finds the
first segment with a readable _meta, and uses that as a schema source.
The schema may not match exactly if columns were added or dropped
between segments; the manifest flags this. Composes with tier 3 so a
segment missing both _meta and _event still recovers.

Partial column file loss: pre-checks every column's .d (and .i for
var-size) before opening WalReader and records each missing file in the
manifest's skippedColumns list. Even if WalReader subsequently fails to
construct because of the loss, the manifest still tells the operator
exactly which column files are gone.

Tests: 10 unit tests for the Args parser, the parquet filename builder
across all tier combinations, and TableInfo.fromDirName parsing.
Integration tests with programmatic WAL construction would need
CairoEngine setup and are left as follow-up.

README: build instructions and a Running section with the Java 17+
module-access flags moved to the top of utils/README.md. A new section
documents WalToParquet with options, output layout, tier suffix
matrix and worked examples.
@javier javier added New feature Feature requests WAL labels May 26, 2026
@coderabbitai

coderabbitai Bot commented May 26, 2026

Copy link
Copy Markdown

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a71f88c6-5e77-4fb2-a4d9-5b05ef8939b0

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jv/wal_to_parquet

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Per-txn SYMBOL resolution: QuestDB's WAL writer can reassign the same
numeric symbol code to different strings across transactions inside one
segment (most visibly after a suspend/resume cycle, where the WAL
writer's local symbol space resets and code 0 in batch 2 means a
different string than code 0 in batch 1). The previous synthesis used
WalReader.getSymbolValue, which exposes a single merged map and
silently lets later diffs overwrite earlier ones. The new
synthesizeSymbolBuffersPerTxn walks _event, snapshots each
segmentTxn's effective dictionary for the column, resolves every row
through its own transaction's snapshot, then deduplicates strings into
a single global dictionary with newly assigned dense codes. The
remapped .d buffer plus the new chars/offsets buffers are passed to
PartitionEncoder. Validated against a live table where the bug was
reproducible: batch 1 rows now correctly show BTC/ETH/XRP/SOL and
batch 2 rows correctly show NEW-AAA/BBB/CCC, with seven distinct
symbols across the recovered Parquet instead of four.

SQL log sidecar: a new <tableName>__sql_log.json captures every
non-DATA transaction observed in any of the table's WAL segments
(UPDATE, TRUNCATE, view definitions, mat-view invalidation). For each
it records (walId, segmentId, segmentTxn, seqTxn, commitTimestamp,
type, sql). These transactions do not materialise as rows in the
Parquet output - their effect would require replay against a live
table - so the sidecar is the only record of what work the WAL
contained beyond raw inserts. Operators reconcile the sidecar against
the data files via seqTxn.

Per-row shoulder column rename: _seq_txn was renamed to _txnSeq_ for
naming consistency with the sidecar's seqTxn field. The column carries
the global QuestDB sequencer transaction id that originally wrote each
row, so an UPDATE statement recorded in the sidecar at seqTxn=N can be
located inside the Parquet files via _txnSeq_=N.

Integration tests under utils/src/test: a real WAL is built via
CairoEngine and SqlExecutionContextImpl in a temp data root,
ApplyWal2TableJob is deliberately not started so rows stay unapplied,
WalToParquet.main is invoked against the temp root, and the recovered
Parquet is read back through parquet_scan to assert content and
schema. Coverage:

- testHappyPathRecovery exercises ts/long/symbol/double with shoulder
  columns enabled.
- testAllDataTypesRoundtrip covers the full type surface:
  TIMESTAMP, TIMESTAMP_NS, SYMBOL, DECIMAL(20,4), DOUBLE,
  DOUBLE[][], UUID, FLOAT, LONG, BINARY, VARCHAR; asserts
  aggregates plus a row-level length check on BINARY.
- testSqlLogCapturesUpdates runs an INSERT followed by an UPDATE,
  both stay in WAL, and verifies __sql_log.json captures the UPDATE
  statement with type=SQL and the original SQL text.
- testNoShoulderFlag confirms --no-shoulder omits the provenance
  columns.

README: WalToParquet section moved to the last entry under utils/
(it is the longest section, ends with worked examples). Updated to
document the new _txnSeq_ column name, the __sql_log.json sidecar,
and the per-txn SYMBOL handling note. Build and Running instructions
were already at the top of the file.
@javier javier changed the title feat(util): add tool to recover unapplied WAL data as Parquet files feat(utils): add tool to recover unapplied WAL data as Parquet files May 27, 2026
javier added 4 commits May 27, 2026 14:48
Recovery status column: a sixth shoulder column _recovery_status_ flags
each row as "unapplied", "applied_unpurged" or "unknown" by comparing
its _txnSeq_ against the table's appliedSeqTxn watermark read from
_txn at the table root. The watermark is read with TxReader semantics
- pick the active record via version parity, then read seqTxn plus
absolute lag txn count from that base, matching TableWriter.
getAppliedSeqTxn(). Manifest also surfaces appliedSeqTxn directly.
Operators can filter "applied_unpurged" rows after recovery if they
trust the committed partitions, or keep them all when the committed
file is the thing being scrubbed.

Schemas sidecar: <tableName>__schemas.json captures the column list
at each distinct structureVersion observed across the table's WAL
segments. Each entry records name, type, writerIndex, and
isDesignatedTimestamp. Versions are emitted in ascending numeric
order so the file is deterministic across runs regardless of File.
listFiles() order. The original ALTER SQL statements that produced
each transition are not extracted - that would need CairoEngine or
a full reimplementation of AlterOperation's binary format - so this
sidecar is the authoritative schema-evolution record.

ManifestSegment.structureVersion: each segment in the manifest now
carries the structureVersion it was written under. Set during the
existing recordMissingColumnFiles pass for happy-path segments, and
again right after tier-3 selects either own or peer-fallback _meta.
This closes the operator workflow chain: Parquet file -> manifest
segment entry -> structureVersion -> schemas.json[version] -> column
list.

Native memory accounting: happy-path symbol synthesis now tracks
each offsets buffer's size so Unsafe.free can deduct the real amount
from MemoryTag.NATIVE_DEFAULT counters. Tier-3 path was leaking all
but the last SYMBOL column's clamped-codes buffer because tracking
used scalar fields rather than a LongList. Replaced with parallel
LongLists so every allocation is freed.

JSON readability: disable Gson's HTML-escaping so sidecar files print
literal '<', '>', '=' and quote characters instead of <-style
escapes. Operators read sql_log.json by hand.

Tests: three new integration tests bring the count to 17.
testRecoveryStatusColumn verifies unapplied rows get tagged when
ApplyWal2TableJob never runs. testSchemasSidecar covers the basic
schemas file emission. testSchemaEvolutionMapping inserts rows at
structureVersion 0, runs ALTER TABLE ADD COLUMN, inserts more rows
at version 1, then asserts both versions appear in __schemas.json
with correct column counts AND each written manifest segment
references the correct structureVersion. testHappyPathRecovery's
column-count assertion bumped from 10 to 11 to account for the new
_recovery_status_ shoulder column.

README: documents _recovery_status_, the appliedSeqTxn watermark,
the __schemas.json sidecar with deterministic version ordering, and
the operator workflow for mapping a Parquet file back to its schema
via manifest -> structureVersion -> schemas.
Tier-3 fallback used to derive a segment's rowCount from the timestamp
column's .d file size on disk. WAL column files are mmap-preallocated,
so the file length reports capacity not committed appends - in the
unreadable-event test setup that meant a 1-row table was being
"recovered" as 65,536 rows of fabricated data. The new behaviour:

- Each txnlog cursor record's getTxnRowCount() is captured into the
  per-segment txnRowCounts list during enumerateSegments. V1
  sequencer format throws UnsupportedOperationException for that
  call, so V1 slots stay at -1.
- The tier-3 fallback now sums those per-txn counts via
  sumTxnRowCounts(). If any referenced segmentTxn lacks a trustworthy
  row count (i.e., V1 format, or tier-2 mode where the txnlog itself
  is unreadable) the helper returns -1 and the segment is marked
  unrecoverable with a new manifest status
  skipped_event_unreadable_no_row_count. No Parquet is written for
  that segment - silent fabrication of capacity-byte rows is worse
  than refusing to emit.
- A zero-row sum gets its own status
  skipped_event_unreadable_zero_rows so operators can tell "no data
  to recover" from "couldn't determine row count".

Event-walk hardening (per-event try/catch). collectNonDataEvents now
uses three nested catches:
- _event file unopenable (er.of() throws): one
  UNKNOWN_EVENT_UNREADABLE entry scoped to the segment is emitted and
  we move on. The WalEventReader is also closed via try-with-resources
  whether construction or of() fails (previously er.of() could leak
  the reader).
- hasNext()/getType()/getTxn() throws: WalEventCursor reads the full
  record - including dispatch on the type byte - inside hasNext(), so
  an unknown OSS-invisible type byte surfaces here as an exception
  with no segmentTxn known. One UNKNOWN_EVENT_UNREADABLE entry is
  recorded with the (walId, segmentId) and error string, then we
  stop - cursor state is undefined past this point.
- SQL body parse failure with valid header: the entry preserves
  type/walId/segmentId/segmentTxn/seqTxn/commitTimestamp and carries
  the parse error in a new "error" field; iteration continues.

New "error" field on ManifestSqlStatement keeps error context
separate from the "sql" text so downstream consumers can tell "we
have SQL" from "we couldn't read it".

Schemas mapping per segment. ManifestSegment.structureVersion was
populated only in the early recordMissingColumnFiles() pass. When
that pass failed but tier-3 later recovered the metadata via
peer-segment fallback, the manifest still carried -1 for the
segment's structureVersion, making __schemas.json unable to map
back to the file. Now writeSegmentToParquetTier3 sets
entry.structureVersion = meta.getMetadataVersion() right after the
metadata it will actually use is selected, covering the peer-fallback
path.

Schemas sorted by numeric structureVersion. The byVersion map was
populated by File.listFiles() which has no guaranteed order, so the
schemas file's emission order was filesystem-dependent. Now keys are
collected, sorted with Collections.sort, and re-inserted into the
output LinkedHashMap so the file is deterministic across runs.

Gson HTML-escaping disabled across all three sidecars. SQL log
previously printed >= for ">=" and ' for "'", making
the file barely readable by hand. New output writes the literal
characters.

Tests bring the count to 18 (10 unit + 8 integration). The
unreadable-event test now asserts no Parquet is emitted, the
manifest's segment carries the unrecoverable status, rowsWritten=0
and outputFile=null - the silent 65k fabrication is gone.

README updated to: clarify the tier-3 row in the suffix matrix
(V2-only recovery via txnlog row counts, V1 marked unrecoverable),
spell out the UNKNOWN_EVENT_UNREADABLE failure shapes accurately,
describe the schemas-version-to-segment join, and list
_recovery_status_ as the sixth shoulder column.
Closes the partial-malloc-leak windows around the new
trackNativeAllocation helper by wrapping the three-call
registration sequences with flag-tracked orphan cleanup so a
LongList resize-OOM mid-sequence cannot orphan the tail buffers.

Tightens tier-3 fixes from prior rounds:
- int clampedSize widened to long to avoid overflow at rowCount
  near Integer.MAX_VALUE
- SymbolMapReaderImpl is registered into the cleanup pool before
  sr.of() runs so a corrupt-symbol-files throw cannot leak the
  reader
- columnMemories.add for mem and aux moved inside the inner try
  so a resize-OOM during registration cannot leak the mmap
- per-row recovery status reconstructed from txnlog row counts
  (previously a single segment-wide constant) so multi-txn
  segments get correct per-row applied/unapplied attribution
- reconciliation note in the manifest if the txnlog row counts
  cover fewer rows than rowCount
- Path try/finally around Vm.getCMRInstance calls
- maxTxn watermark refreshed from observed seqTxn after the walk
- discoverTables filter loosened so user tables starting with
  underscore are not silently dropped
- recordMissingColumnFiles deferred to the WalReader-failure
  path, with a cheap recordSegmentStructureVersion called on the
  happy path
- RECOVERY_STATUS dictionary pinned to its index constants by a
  runtime check that fires even without -ea

Adds defensive infrastructure:
- TestUtils.assertMemoryLeak harness with named 8 KB slack for
  pool warmup; 8 of 14 main integration tests wrapped
- trackNativeAllocation helper that frees its pair's buffer on a
  resize-OOM and rolls back the addr list

Adds coverage for paths previously uncovered:
- testV2Tier3PerRowRecoveryStatusAcrossTxns regression for the
  multi-txn per-row recovery status
- testV2Tier3RecoversFromCorruptEvent for the V2 tier-3 happy
  path
- testTier2CorruptTxnlog in its own class to isolate
  engine.clear() blast radius
- testTier4PeerMetaFallback for the corrupt-_meta peer fallback
  via V2 tier-3
- testPartialColumnFileLoss in its own class for missing .d
  files
- testNullValuePreservation across LONG, DOUBLE, SYMBOL, VARCHAR
- testDropColumnSchemaEvolution and testRenameColumnSchemaEvolution
  verifying recovered Parquet content per structureVersion
- testMultiSegmentRecovery exercising ALTER-driven segment rolls
- testUnderscorePrefixedUserTable regression for discoverTables

Fixes a pre-existing UUID/LONG128 comparison bug in
TestUtils.assertColumnValues (lr.getLong128Hi swapped to
getLong128Lo for the low-half comparison).

Adjusts utils/pom.xml to keep Java compiler target at 17
matching the rest of the project. README documents the
manifest's tier-2 sqlStatements/structuralChanges gap and the
Gson rationale.

All 30 tests pass.
@javier

javier commented May 30, 2026

Copy link
Copy Markdown
Contributor Author

Level-3 review result (pass 5)

Summary

Verdict: approve. No Critical. One real Moderate (M1: class-init ordering fragility — only triggers if a future contributor reorders the static fields alphabetically). Five Minor items, all cosmetic or coverage gaps.

Findings tally: 0 Critical, 1 Moderate, 7 Minor verified. ~6 raw agent claims downgraded after source verification (Agent 10's seqTxn + abs(lag) claim — same false positive as pass #1; F6 NPE; F4 mmap; PR body wording).

In-diff vs out-of-diff: all 8 findings are in-diff. Zero out-of-diff regressions. Agent 9 reconfirmed zero production callers and clean upstream contracts.

Comparative trajectory across five passes:

  • Pass 1: 3 Critical + 8 Moderate
  • Pass 2: 3 Critical + 6 Moderate
  • Pass 3: 0 Critical + 5 Moderate
  • Pass 4: 0 Critical + 5 Moderate
  • Pass 5 (this): 0 Critical + 1 Moderate

The PR has reached the asymptote of what level-3 will find. Remaining items are cosmetic or coverage gaps that exist in many comparable PRs across QuestDB. Ready for human approval.

@questdb-butler

Copy link
Copy Markdown

⚠️ Enterprise CI Failed

The enterprise test suite failed for this PR.

Build: View Details
Tested Commit: 4e6ef9276707914288e050390be2a2ced87af0b7

Please investigate the failure before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

New feature Feature requests WAL

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants