mruby-regexp: Fix String#gsub block handling for zero-width matches by nattzn · Pull Request #6893 · mruby/mruby

nattzn · 2026-06-15T07:30:05Z

This fixes String#gsub with a block when the regexp matches a zero-width position.

Previously, the block form of String#gsub in mruby-regexp searched against a sliced remainder string
after each zero-width match. That made anchors such as ^ match the beginning of each sliced remainder, so:

"abc".gsub(/^/) { "!" }

returned "!a!b!c" instead of Ruby's "!abc".

The implementation now keeps searching against the original string and advances an explicit position,
preserving anchor semantics while still making progress after zero-width matches.

String#gsub {} を切り出した rest ではなく、元の文字列 self と検索位置 pos で進める実装に変更した。

gemini-code-assist

Code Review

This pull request refactors the block-based implementation of String#gsub in mruby-regexp to use index-based matching (pos) instead of repeatedly slicing the remaining string, and adds tests for zero-width matches. However, the reviewer identified a critical issue where character offsets and byte offsets are mixed up when UTF-8 is enabled, which will break gsub on multi-byte UTF-8 strings. The reviewer provided a code suggestion to consistently use bytesize and byteslice to resolve this mismatch.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-15T07:32:03Z

+    pos = 0
+    while pos <= self.length
+      md = pattern.match(self, pos)
      break unless md
-      parts << md.pre_match
+      match_start = md.begin(0)
+      match_end = md.end(0)
+      parts << self[pos...match_start]
      parts << block.call(md[0]).to_s
-      matched_len = md[0].length
-      if matched_len == 0
-        parts << rest[0] if rest.length > 0
-        rest = rest[1..-1] || ""
+      if match_start == match_end
+        parts << self[match_end] if match_end < self.length
+        pos = match_end + 1
      else
-        rest = md.post_match
+        pos = match_end
      end
    end
-    parts << rest
+    parts << self[pos..-1]


There is a mismatch between character offsets and byte offsets when handling UTF-8 strings.

In mruby-regexp, MatchData#begin and MatchData#end return byte offsets, and Regexp#match expects a byte offset for its second argument. However, self.length and self[...] operate on character offsets when UTF-8 is enabled.

This mismatch causes incorrect slicing and breaks gsub on multi-byte UTF-8 strings. To fix this, we should use self.bytesize and self.byteslice to consistently work with byte offsets. When advancing after a zero-width match, we can safely slice the remainder of the string to find the byte size of the next character, ensuring we do not split multi-byte UTF-8 characters.

pos = 0 len = self.bytesize while pos <= len md = pattern.match(self, pos) break unless md match_start = md.begin(0) match_end = md.end(0) parts << self.byteslice(pos, match_start - pos) parts << block.call(md[0]).to_s if match_start == match_end rest = self.byteslice(match_end..-1) if rest && rest.bytesize > 0 char = rest[0] parts << char pos = match_end + char.bytesize else pos = match_end + 1 end else pos = match_end end end parts << self.byteslice(pos..-1)

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c784755bdc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-15T07:42:06Z

+    while pos <= self.length
+      md = pattern.match(self, pos)


Avoid matching multiline ^ after the final newline

With the new pos <= self.length loop, block gsub asks the regexp engine for another zero-width match after it has copied a trailing newline. For multiline anchors this adds an extra replacement at EOF, e.g. "\n".gsub(/^/m) { "!" } now returns "!\n!" whereas Ruby behavior (and the previous block implementation for this input) is "!\n"; the same affects strings ending in \n such as log/text lines.

Useful? React with 👍 / 👎.

Add UTF-8 zero-width match coverage to ensure gsub advances by a whole character without splitting multibyte strings.

mruby-regexp: Fix String#gsub to handle zero-width matches correctly

c784755

String#gsub {} を切り出した rest ではなく、元の文字列 self と検索位置 pos で進める実装に変更した。

nattzn requested a review from matz as a code owner June 15, 2026 07:30

github-actions Bot added the mrbgems label Jun 15, 2026

gemini-code-assist Bot reviewed Jun 15, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 15, 2026

View reviewed changes

nattzn added 2 commits June 15, 2026 17:25

mruby-regexp: Fix String#gsub block offsets for UTF-8 strings

9e211c5

Add UTF-8 zero-width match coverage to ensure gsub advances by a whole character without splitting multibyte strings.

mruby-regexp: Avoid multiline ^ match after final newline

6d99e8d

matz merged commit 9ffddd4 into mruby:master Jun 15, 2026
19 checks passed

nattzn deleted the fix/gsub-bol-zero-width branch June 15, 2026 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mruby-regexp: Fix String#gsub block handling for zero-width matches#6893

mruby-regexp: Fix String#gsub block handling for zero-width matches#6893
matz merged 3 commits into
mruby:masterfrom
nattzn:fix/gsub-bol-zero-width

nattzn commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Uh oh!

nattzn Jun 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Uh oh!

nattzn Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nattzn commented Jun 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

nattzn Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

nattzn Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants