Skip to content

mruby-regexp: Fix String#gsub block handling for zero-width matches#6893

Merged
matz merged 3 commits into
mruby:masterfrom
nattzn:fix/gsub-bol-zero-width
Jun 15, 2026
Merged

mruby-regexp: Fix String#gsub block handling for zero-width matches#6893
matz merged 3 commits into
mruby:masterfrom
nattzn:fix/gsub-bol-zero-width

Conversation

@nattzn

@nattzn nattzn commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

This fixes String#gsub with a block when the regexp matches a zero-width position.

Previously, the block form of String#gsub in mruby-regexp searched against a sliced remainder string
after each zero-width match. That made anchors such as ^ match the beginning of each sliced remainder, so:

"abc".gsub(/^/) { "!" }

returned "!a!b!c" instead of Ruby's "!abc".

The implementation now keeps searching against the original string and advances an explicit position,
preserving anchor semantics while still making progress after zero-width matches.

String#gsub {} を切り出した rest ではなく、元の文字列 self と検索位置 pos で進める実装に変更した。
@nattzn nattzn requested a review from matz as a code owner June 15, 2026 07:30

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the block-based implementation of String#gsub in mruby-regexp to use index-based matching (pos) instead of repeatedly slicing the remaining string, and adds tests for zero-width matches. However, the reviewer identified a critical issue where character offsets and byte offsets are mixed up when UTF-8 is enabled, which will break gsub on multi-byte UTF-8 strings. The reviewer provided a code suggestion to consistently use bytesize and byteslice to resolve this mismatch.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +38 to +53
pos = 0
while pos <= self.length
md = pattern.match(self, pos)
break unless md
parts << md.pre_match
match_start = md.begin(0)
match_end = md.end(0)
parts << self[pos...match_start]
parts << block.call(md[0]).to_s
matched_len = md[0].length
if matched_len == 0
parts << rest[0] if rest.length > 0
rest = rest[1..-1] || ""
if match_start == match_end
parts << self[match_end] if match_end < self.length
pos = match_end + 1
else
rest = md.post_match
pos = match_end
end
end
parts << rest
parts << self[pos..-1]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a mismatch between character offsets and byte offsets when handling UTF-8 strings.

In mruby-regexp, MatchData#begin and MatchData#end return byte offsets, and Regexp#match expects a byte offset for its second argument. However, self.length and self[...] operate on character offsets when UTF-8 is enabled.

This mismatch causes incorrect slicing and breaks gsub on multi-byte UTF-8 strings. To fix this, we should use self.bytesize and self.byteslice to consistently work with byte offsets. When advancing after a zero-width match, we can safely slice the remainder of the string to find the byte size of the next character, ensuring we do not split multi-byte UTF-8 characters.

    pos = 0
    len = self.bytesize
    while pos <= len
      md = pattern.match(self, pos)
      break unless md
      match_start = md.begin(0)
      match_end = md.end(0)
      parts << self.byteslice(pos, match_start - pos)
      parts << block.call(md[0]).to_s
      if match_start == match_end
        rest = self.byteslice(match_end..-1)
        if rest && rest.bytesize > 0
          char = rest[0]
          parts << char
          pos = match_end + char.bytesize
        else
          pos = match_end + 1
        end
      else
        pos = match_end
      end
    end
    parts << self.byteslice(pos..-1)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c784755bdc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +39 to +40
while pos <= self.length
md = pattern.match(self, pos)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid matching multiline ^ after the final newline

With the new pos <= self.length loop, block gsub asks the regexp engine for another zero-width match after it has copied a trailing newline. For multiline anchors this adds an extra replacement at EOF, e.g. "\n".gsub(/^/m) { "!" } now returns "!\n!" whereas Ruby behavior (and the previous block implementation for this input) is "!\n"; the same affects strings ending in \n such as log/text lines.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

nattzn added 2 commits June 15, 2026 17:25
Add UTF-8 zero-width match coverage to ensure gsub advances by a whole character without splitting multibyte strings.
@matz matz merged commit 9ffddd4 into mruby:master Jun 15, 2026
19 checks passed
@nattzn nattzn deleted the fix/gsub-bol-zero-width branch June 15, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants