Skip to content

Add creation of col stats from row_range or date_range#3154

Open
markovskipetar wants to merge 1 commit into
masterfrom
col_stats_from_date_or_row_range
Open

Add creation of col stats from row_range or date_range#3154
markovskipetar wants to merge 1 commit into
masterfrom
col_stats_from_date_or_row_range

Conversation

@markovskipetar

@markovskipetar markovskipetar commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds optional date_range and row_range. When provided, column stats are only (re)computed for row-slices overlapping the given range — existing stats for out-of-range slices are kept, in-range slice stats are replaced. The two parameters are mutually exclusive.

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

Label error. Requires exactly 1 of: patch, minor, major. Found:

if not column_stats:
return

column_stats = self._convert_to_native_column_stats(column_stats)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._convert_to_native_column_stats(...) does not exist anywhere in the codebase — the only definition is _get_column_stats (line 2439). Every call to create_column_stats_experimental will raise AttributeError before doing any work, so the feature is completely non-functional. This also confirms there is no test exercising this path (a single test would have caught it). Use the existing helper:

Suggested change
column_stats = self._convert_to_native_column_stats(column_stats)
column_stats = self._get_column_stats(column_stats)


def create_column_stats(
self, symbol: str, column_stats: Dict[str, Set[str]], as_of: Optional[VersionQueryInput] = None
def create_column_stats_experimental(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renaming the public NativeVersionStore.create_column_stats to create_column_stats_experimental is a breaking API change with no deprecation cycle. Existing callers break immediately — e.g. python/benchmarks/column_stats.py still calls lib._nvs.create_column_stats(...) in 4 places. Per the deprecation protocol (PR_REVIEW_GUIDELINES §2), keep create_column_stats as a working alias (or thin wrapper) and update all in-repo callers, rather than removing the name outright.

Comment on lines +93 to +135
SegmentInMemory merge_column_stats_with_range_replacement(
const SegmentInMemory& old_segment, SegmentInMemory new_segment, const entity::TimestampRange& range_replaced
) {
const size_t old_row_count = old_segment.row_count();
std::vector<SegmentInMemory> to_merge;
to_merge.reserve(3);

size_t first_in_range = old_row_count;
size_t last_in_range = 0;
bool found_overlap = false;

for (size_t row = 0; row < old_row_count; ++row) {
auto start_opt = old_segment.scalar_at<entity::timestamp>(row, start_index_column_offset);
auto end_opt = old_segment.scalar_at<entity::timestamp>(row, end_index_column_offset);

util::check(
start_opt.has_value() && end_opt.has_value(),
"Missing start/end index in old column stats segment at row {}",
row
);

entity::TimestampRange row_range{*start_opt, *end_opt};
if (entity::intersects(row_range, range_replaced)) {
if (!found_overlap) {
first_in_range = row;
found_overlap = true;
}
last_in_range = row;
}
}

if (!found_overlap) {
to_merge.emplace_back(old_segment.clone());
} else {
if (first_in_range > 0) {
to_merge.emplace_back(old_segment.truncate(0, first_in_range, false));
}
if (last_in_range + 1 < old_row_count) {
to_merge.emplace_back(old_segment.truncate(last_in_range + 1, old_row_count, false));
}
}
to_merge.emplace_back(std::move(new_segment));
return merge_column_stats_segments(to_merge);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two correctness concerns with the range-restricted RMW merge:

  1. Schema mismatch silently corrupts stats. In the is_filtered path the existing-stats drop optimization is skipped (version_core.cpp:2006), so new_segment holds stats for whatever column_stats was requested. If that set differs from the old segment's stat columns (e.g. user passes an explicit column_stats subset, or the eligible-column set changed between calls), merge_column_stats_segments unions columns by name: in-range rows will lack the old-only stat columns and out-of-range rows will lack the new-only columns, producing a sparse/inconsistent stats segment. The doc comment states "Both segments must share the same stat-column schema" but nothing validates or enforces it. Either assert the schemas match (like the non-filtered branch does at version_core.cpp:2081) or handle the differing-column case explicitly.

  2. Closed-interval intersects vs exclusive end_index. entity::intersects is closed on both ends (left.first <= right.second && left.second >= right.first). If the column-stats end_index is stored one-past-the-end (exclusive, as index keys are), an old row whose end_index equals range_replaced.first will be falsely treated as overlapping and dropped. Please confirm the end-index convention here and adjust the comparison if it is exclusive.

@claude

claude Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

ArcticDB Code Review Summary

This PR adds the ability to restrict column-stats creation to a row_range/date_range and to default to all eligible columns. The core feature does not currently work and ships with no tests.

Correctness

  • create_column_stats_experimental is dead on arrival — it calls self._convert_to_native_column_stats(...), which is undefined anywhere in the repo (the real helper is _get_column_stats). Every invocation raises AttributeError before doing any work. See inline comment on _store.py:1323.
  • Range-restricted RMW merge can silently corrupt stats — when the requested stat-column set differs from the existing one, merge_column_stats_with_range_replacement produces a sparse/inconsistent segment (in-range rows missing old columns, out-of-range rows missing new columns). No schema validation enforces the documented same-stat-column-schema precondition. See inline on column_stats.cpp:93-135.
  • ⚠️ Closed-interval intersects vs exclusive end_index — if column-stats end_index is stored one-past-the-end, the closed-interval overlap test will drop adjacent non-overlapping slices (off-by-one). Please confirm the convention. See inline on column_stats.cpp.

API & Compatibility

  • Breaking rename without deprecationNativeVersionStore.create_column_stats is removed and replaced by create_column_stats_experimental. This violates the deprecation protocol (section 2) and breaks existing callers, including python/benchmarks/column_stats.py (4 call sites still use the old name). Keep create_column_stats as a working alias and update in-repo callers. See inline on _store.py:1269.

Testing

  • No tests added. This is a new public-API feature with new C++ merge logic (merge_column_stats_with_range_replacement) and zero accompanying tests. Per CLAUDE.md's TDD requirement and section 16, every behavioural change needs a failing-then-passing test. A single test would have caught the _convert_to_native_column_stats bug above. Add Python tests covering: column_stats=None default, date_range, row_range, the mutually-exclusive error, the no-overlap no-op, and RMW correctness (in-range replaced, out-of-range preserved).

PR Title & Description

  • ⚠️ Title is reasonable, but the PR has no description and no labels. The public-API rename and the new _experimental surface are user-facing and must be called out; the PR should be labelled (enhancement) and is not eligible for no-release-notes.

Documentation

  • ⚠️ The new feature lives only on NativeVersionStore with a docstring; no docs/claude/ technical doc or tutorial update accompanies the new range-restricted/RMW behaviour. Confirm whether the update-docs checklist applies given the _experimental status.

// [start_index, end_index] lies fully outside `range_replaced`, and inserts the rows from
// `new_segment` (covering the in-range row-slices). Both segments must share the same
// stat-column schema. Used by the range-restricted RMW path in create_column_stats_impl.
SegmentInMemory merge_column_stats_with_range_replacement(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think let's just leave this merging with the old stats stuff out, at least for now, as it's a bit complicated and I'm not certain that we will need it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants