Add creation of col stats from row_range or date_range#3154
Add creation of col stats from row_range or date_range#3154markovskipetar wants to merge 1 commit into
Conversation
|
Label error. Requires exactly 1 of: patch, minor, major. Found: |
| if not column_stats: | ||
| return | ||
|
|
||
| column_stats = self._convert_to_native_column_stats(column_stats) |
There was a problem hiding this comment.
self._convert_to_native_column_stats(...) does not exist anywhere in the codebase — the only definition is _get_column_stats (line 2439). Every call to create_column_stats_experimental will raise AttributeError before doing any work, so the feature is completely non-functional. This also confirms there is no test exercising this path (a single test would have caught it). Use the existing helper:
| column_stats = self._convert_to_native_column_stats(column_stats) | |
| column_stats = self._get_column_stats(column_stats) |
|
|
||
| def create_column_stats( | ||
| self, symbol: str, column_stats: Dict[str, Set[str]], as_of: Optional[VersionQueryInput] = None | ||
| def create_column_stats_experimental( |
There was a problem hiding this comment.
Renaming the public NativeVersionStore.create_column_stats to create_column_stats_experimental is a breaking API change with no deprecation cycle. Existing callers break immediately — e.g. python/benchmarks/column_stats.py still calls lib._nvs.create_column_stats(...) in 4 places. Per the deprecation protocol (PR_REVIEW_GUIDELINES §2), keep create_column_stats as a working alias (or thin wrapper) and update all in-repo callers, rather than removing the name outright.
| SegmentInMemory merge_column_stats_with_range_replacement( | ||
| const SegmentInMemory& old_segment, SegmentInMemory new_segment, const entity::TimestampRange& range_replaced | ||
| ) { | ||
| const size_t old_row_count = old_segment.row_count(); | ||
| std::vector<SegmentInMemory> to_merge; | ||
| to_merge.reserve(3); | ||
|
|
||
| size_t first_in_range = old_row_count; | ||
| size_t last_in_range = 0; | ||
| bool found_overlap = false; | ||
|
|
||
| for (size_t row = 0; row < old_row_count; ++row) { | ||
| auto start_opt = old_segment.scalar_at<entity::timestamp>(row, start_index_column_offset); | ||
| auto end_opt = old_segment.scalar_at<entity::timestamp>(row, end_index_column_offset); | ||
|
|
||
| util::check( | ||
| start_opt.has_value() && end_opt.has_value(), | ||
| "Missing start/end index in old column stats segment at row {}", | ||
| row | ||
| ); | ||
|
|
||
| entity::TimestampRange row_range{*start_opt, *end_opt}; | ||
| if (entity::intersects(row_range, range_replaced)) { | ||
| if (!found_overlap) { | ||
| first_in_range = row; | ||
| found_overlap = true; | ||
| } | ||
| last_in_range = row; | ||
| } | ||
| } | ||
|
|
||
| if (!found_overlap) { | ||
| to_merge.emplace_back(old_segment.clone()); | ||
| } else { | ||
| if (first_in_range > 0) { | ||
| to_merge.emplace_back(old_segment.truncate(0, first_in_range, false)); | ||
| } | ||
| if (last_in_range + 1 < old_row_count) { | ||
| to_merge.emplace_back(old_segment.truncate(last_in_range + 1, old_row_count, false)); | ||
| } | ||
| } | ||
| to_merge.emplace_back(std::move(new_segment)); | ||
| return merge_column_stats_segments(to_merge); |
There was a problem hiding this comment.
Two correctness concerns with the range-restricted RMW merge:
-
Schema mismatch silently corrupts stats. In the
is_filteredpath the existing-stats drop optimization is skipped (version_core.cpp:2006), sonew_segmentholds stats for whatevercolumn_statswas requested. If that set differs from the old segment's stat columns (e.g. user passes an explicitcolumn_statssubset, or the eligible-column set changed between calls),merge_column_stats_segmentsunions columns by name: in-range rows will lack the old-only stat columns and out-of-range rows will lack the new-only columns, producing a sparse/inconsistent stats segment. The doc comment states "Both segments must share the same stat-column schema" but nothing validates or enforces it. Either assert the schemas match (like the non-filtered branch does atversion_core.cpp:2081) or handle the differing-column case explicitly. -
Closed-interval
intersectsvs exclusiveend_index.entity::intersectsis closed on both ends (left.first <= right.second && left.second >= right.first). If the column-statsend_indexis stored one-past-the-end (exclusive, as index keys are), an old row whoseend_indexequalsrange_replaced.firstwill be falsely treated as overlapping and dropped. Please confirm the end-index convention here and adjust the comparison if it is exclusive.
ArcticDB Code Review SummaryThis PR adds the ability to restrict column-stats creation to a Correctness
API & Compatibility
Testing
PR Title & Description
Documentation
|
| // [start_index, end_index] lies fully outside `range_replaced`, and inserts the rows from | ||
| // `new_segment` (covering the in-range row-slices). Both segments must share the same | ||
| // stat-column schema. Used by the range-restricted RMW path in create_column_stats_impl. | ||
| SegmentInMemory merge_column_stats_with_range_replacement( |
There was a problem hiding this comment.
I think let's just leave this merging with the old stats stuff out, at least for now, as it's a bit complicated and I'm not certain that we will need it.
Summary
Adds optional
date_rangeandrow_range. When provided, column stats are only (re)computed for row-slices overlapping the given range — existing stats for out-of-range slices are kept, in-range slice stats are replaced. The two parameters are mutually exclusive.