mruby-regexp: Fix String#gsub block handling for zero-width matches#6893
Conversation
String#gsub {} を切り出した rest ではなく、元の文字列 self と検索位置 pos で進める実装に変更した。
There was a problem hiding this comment.
Code Review
This pull request refactors the block-based implementation of String#gsub in mruby-regexp to use index-based matching (pos) instead of repeatedly slicing the remaining string, and adds tests for zero-width matches. However, the reviewer identified a critical issue where character offsets and byte offsets are mixed up when UTF-8 is enabled, which will break gsub on multi-byte UTF-8 strings. The reviewer provided a code suggestion to consistently use bytesize and byteslice to resolve this mismatch.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| pos = 0 | ||
| while pos <= self.length | ||
| md = pattern.match(self, pos) | ||
| break unless md | ||
| parts << md.pre_match | ||
| match_start = md.begin(0) | ||
| match_end = md.end(0) | ||
| parts << self[pos...match_start] | ||
| parts << block.call(md[0]).to_s | ||
| matched_len = md[0].length | ||
| if matched_len == 0 | ||
| parts << rest[0] if rest.length > 0 | ||
| rest = rest[1..-1] || "" | ||
| if match_start == match_end | ||
| parts << self[match_end] if match_end < self.length | ||
| pos = match_end + 1 | ||
| else | ||
| rest = md.post_match | ||
| pos = match_end | ||
| end | ||
| end | ||
| parts << rest | ||
| parts << self[pos..-1] |
There was a problem hiding this comment.
There is a mismatch between character offsets and byte offsets when handling UTF-8 strings.
In mruby-regexp, MatchData#begin and MatchData#end return byte offsets, and Regexp#match expects a byte offset for its second argument. However, self.length and self[...] operate on character offsets when UTF-8 is enabled.
This mismatch causes incorrect slicing and breaks gsub on multi-byte UTF-8 strings. To fix this, we should use self.bytesize and self.byteslice to consistently work with byte offsets. When advancing after a zero-width match, we can safely slice the remainder of the string to find the byte size of the next character, ensuring we do not split multi-byte UTF-8 characters.
pos = 0
len = self.bytesize
while pos <= len
md = pattern.match(self, pos)
break unless md
match_start = md.begin(0)
match_end = md.end(0)
parts << self.byteslice(pos, match_start - pos)
parts << block.call(md[0]).to_s
if match_start == match_end
rest = self.byteslice(match_end..-1)
if rest && rest.bytesize > 0
char = rest[0]
parts << char
pos = match_end + char.bytesize
else
pos = match_end + 1
end
else
pos = match_end
end
end
parts << self.byteslice(pos..-1)There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c784755bdc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| while pos <= self.length | ||
| md = pattern.match(self, pos) |
There was a problem hiding this comment.
Avoid matching multiline ^ after the final newline
With the new pos <= self.length loop, block gsub asks the regexp engine for another zero-width match after it has copied a trailing newline. For multiline anchors this adds an extra replacement at EOF, e.g. "\n".gsub(/^/m) { "!" } now returns "!\n!" whereas Ruby behavior (and the previous block implementation for this input) is "!\n"; the same affects strings ending in \n such as log/text lines.
Useful? React with 👍 / 👎.
Add UTF-8 zero-width match coverage to ensure gsub advances by a whole character without splitting multibyte strings.
This fixes
String#gsubwith a block when the regexp matches a zero-width position.Previously, the block form of
String#gsubinmruby-regexpsearched against a sliced remainder stringafter each zero-width match. That made anchors such as
^match the beginning of each sliced remainder, so:returned "!a!b!c" instead of Ruby's "!abc".
The implementation now keeps searching against the original string and advances an explicit position,
preserving anchor semantics while still making progress after zero-width matches.