Skip to content

fix: cast to unsigned char in ASCII case conversion (1/2)#748

Open
goel-skd wants to merge 2 commits into
apache:mainfrom
goel-skd:fix-tolower-ub
Open

fix: cast to unsigned char in ASCII case conversion (1/2)#748
goel-skd wants to merge 2 commits into
apache:mainfrom
goel-skd:fix-tolower-ub

Conversation

@goel-skd

Copy link
Copy Markdown
Contributor

fix: cast to unsigned char in ASCII case conversion

std::tolower/toupper are UB when passed a negative char, which is what you get for any non-ASCII byte in a signed char. StringUtils::ToLower, ToUpper, and EqualsIgnoreCase
all did this. Cast through unsigned char.

Behavior for ASCII is unchanged; this doesn't add Unicode case folding, just fixes the UB.

Related: #613

@goel-skd goel-skd changed the title fix: cast to unsigned char in ASCII case conversion fix: cast to unsigned char in ASCII case conversion (1/2) Jun 16, 2026
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
private:
// std::tolower/std::toupper require their argument to be representable as
// unsigned char (or EOF); passing a raw char with a non-ASCII (negative) value is
// undefined behavior, so cast through unsigned char first.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, Quoting cppreference:

Like all other functions from <cctype>, the behavior of std::tolower is undefined if the argument's value is neither representable as unsigned char nor equal to EOF. To use these functions safely with plain chars (or signed chars), the argument should first be converted to unsigned char:

char my_tolower(char ch)
{
    return static_cast<char>(std::tolower(static_cast<unsigned char>(ch)));
}

@zhjwpku zhjwpku added the ready to merge This PR has been approved and it is ready to merge. label Jun 16, 2026
@wgtmac wgtmac requested a review from Copilot June 16, 2026 05:08

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes undefined behavior in ASCII case conversion utilities by ensuring std::tolower/std::toupper aren’t invoked with potentially-negative char values, and adds tests to cover non-ASCII byte behavior relevant to case-insensitive comparisons (Issue #613).

Changes:

  • Add <cctype> and route ToLower, ToUpper, and EqualsIgnoreCase through helper functions that cast via unsigned char before calling std::tolower/std::toupper.
  • Document ASCII-only intent and non-ASCII (UTF-8 multibyte) pass-through behavior.
  • Add tests for non-ASCII byte pass-through and EqualsIgnoreCase behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
src/iceberg/util/string_util.h Adds safe ASCII case conversion helpers and updates case-insensitive utilities to use them.
src/iceberg/test/string_util_test.cc Adds regression tests covering non-ASCII byte handling and EqualsIgnoreCase.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +140 to +142
static char ToLowerAscii(char c) {
return static_cast<char>(std::tolower(static_cast<unsigned char>(c)));
}
Comment on lines +144 to +146
static char ToUpperAscii(char c) {
return static_cast<char>(std::toupper(static_cast<unsigned char>(c)));
}
Comment on lines +49 to +53
ASSERT_EQ(StringUtils::ToLower("Naïve"), "naïve");
ASSERT_EQ(StringUtils::ToUpper("café"), "CAFé");
// Pure non-ASCII input is returned verbatim.
ASSERT_EQ(StringUtils::ToLower("日本語"), "日本語");
ASSERT_EQ(StringUtils::ToUpper("日本語"), "日本語");
Comment on lines +62 to +63
ASSERT_TRUE(StringUtils::EqualsIgnoreCase("Café", "café"));
ASSERT_FALSE(StringUtils::EqualsIgnoreCase("café", "cafe"));
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready to merge This PR has been approved and it is ready to merge.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants