fix: cast to unsigned char in ASCII case conversion (1/2)#748
Conversation
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| private: | ||
| // std::tolower/std::toupper require their argument to be representable as | ||
| // unsigned char (or EOF); passing a raw char with a non-ASCII (negative) value is | ||
| // undefined behavior, so cast through unsigned char first. |
There was a problem hiding this comment.
+1, Quoting cppreference:
Like all other functions from <cctype>, the behavior of std::tolower is undefined if the argument's value is neither representable as unsigned char nor equal to EOF. To use these functions safely with plain chars (or signed chars), the argument should first be converted to unsigned char:
char my_tolower(char ch)
{
return static_cast<char>(std::tolower(static_cast<unsigned char>(ch)));
}
There was a problem hiding this comment.
Pull request overview
Fixes undefined behavior in ASCII case conversion utilities by ensuring std::tolower/std::toupper aren’t invoked with potentially-negative char values, and adds tests to cover non-ASCII byte behavior relevant to case-insensitive comparisons (Issue #613).
Changes:
- Add
<cctype>and routeToLower,ToUpper, andEqualsIgnoreCasethrough helper functions that cast viaunsigned charbefore callingstd::tolower/std::toupper. - Document ASCII-only intent and non-ASCII (UTF-8 multibyte) pass-through behavior.
- Add tests for non-ASCII byte pass-through and
EqualsIgnoreCasebehavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| src/iceberg/util/string_util.h | Adds safe ASCII case conversion helpers and updates case-insensitive utilities to use them. |
| src/iceberg/test/string_util_test.cc | Adds regression tests covering non-ASCII byte handling and EqualsIgnoreCase. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| static char ToLowerAscii(char c) { | ||
| return static_cast<char>(std::tolower(static_cast<unsigned char>(c))); | ||
| } |
| static char ToUpperAscii(char c) { | ||
| return static_cast<char>(std::toupper(static_cast<unsigned char>(c))); | ||
| } |
| ASSERT_EQ(StringUtils::ToLower("Naïve"), "naïve"); | ||
| ASSERT_EQ(StringUtils::ToUpper("café"), "CAFé"); | ||
| // Pure non-ASCII input is returned verbatim. | ||
| ASSERT_EQ(StringUtils::ToLower("日本語"), "日本語"); | ||
| ASSERT_EQ(StringUtils::ToUpper("日本語"), "日本語"); |
| ASSERT_TRUE(StringUtils::EqualsIgnoreCase("Café", "café")); | ||
| ASSERT_FALSE(StringUtils::EqualsIgnoreCase("café", "cafe")); |
fix: cast to unsigned char in ASCII case conversion
std::tolower/toupperare UB when passed a negativechar, which is what you get for any non-ASCII byte in a signedchar.StringUtils::ToLower,ToUpper, andEqualsIgnoreCaseall did this. Cast through
unsigned char.Behavior for ASCII is unchanged; this doesn't add Unicode case folding, just fixes the UB.
Related: #613