ISO/IEC JTC1 SC22 WG21 P2314R4
Author: Jens Maurer
Target audience: CWG, LWG
2021-09-21
isalpha.This changes the specified behavior of the stringizing preprocessor operator [cpp.stringize] as follows
| C++20 | this paper |
#define S(x) # x
const char * s1 = S(Köppe); // "K\\u00f6ppe"
const char * s2 = S(K\u00f6ppe); // "K\\u00f6ppe"
|
#define S(x) # x
const char * s1 = S(Köppe); // "Köppe"
const char * s2 = S(K\u00f6ppe); // "Köppe"
|
| Context | Destination |
|---|---|
| asm-declaration | build environment |
#include "fn" or #include <fn> | file name |
| language linkage | translation |
operator "" [over.literal] | translation |
#line directive | diagnostic |
| argument for [[nodiscard]] and [[deprecated]] | diagnostic |
#error, static_assert | diagnostic |
| __FILE__, __func__ | literal encoding |
std::typeinfo::name() | literal encoding |
| character-literal or string-literal appearing elsewhere | literal encoding |
| user-defined-literal | literal encoding |
The destinations have the following meaning:
The paper P2297R0 "Wording improvements for encodings and character sets" by Corentin Jabot has overlap with this paper. The main differences are:
SF F N A SA 2 4 1 0 1Present: 9
SF F N A SA 3 5 0 0 0Present: 9
SF F N A SA 5 6 0 0 0
sequence of one or more bytes representing a member of the
extended character set of either the source or the execution
environment the code unit sequence for an encoded
character of the execution character set
...
3. The source file is decomposed into preprocessing tokens (5.4 [lex.pptoken]) and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or in a partial comment. [ Footnote: ... ] Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is unspecified. As characters from the source file are consumed to form the next preprocessing token (i.e., not being consumed as part of a comment or other forms of whitespace), except when matching a c-char-sequence, s-char-sequence, r-char-sequence, h-char-sequence, or q-char-sequence, universal-character-names are recognized and replaced by the designated element of the translation character set. The process of dividing a source file’s characters into preprocessing tokens is context-dependent. [Example: See the handling of < within a #include preprocessing directive. — end example]
4. Preprocessing directives are executed, macro invocations are
expanded, and _Pragma unary operator expressions are executed. If a
character sequence that matches the syntax of a
universal-character-name is produced by token concatenation
(15.6.3 [cpp.concat]), the behavior is undefined. A #include
preprocessing directive causes the named header or source file to be
processed from phase 1 through phase 4, recursively. All
preprocessing directives are then deleted.
5. For a sequence of two or more
adjacent string-literal tokens, a
common encoding-prefix is determined as specified in 5.13.5
[lex.string]. Each such string-literal token is then
considered to have that common encoding-prefix.
Each basic-c-char, basic-s-char, and r-char
in a character-literal or a string-literal, as well
as each escape-sequence and universal-character-name
in a character-literal or a non-raw string literal, is
encoded in the literal’s associated character encoding as specified in
5.13.3 [lex.ccon] and 5.13.5 [lex.string].
2 The basic character set is a subset of the translation character set, consisting of 96 characters as specified in table X. [ Note: Unicode short names are given only as a means to identifying the character; the numerical value has no other meaning in this context. -- end note ]
| U+0009 | CHARACTER TABULATION | |
| U+000B | LINE TABULATION | |
| U+000C | FORM FEED (FF) | |
| U+0020 | SPACE | |
| U+000A | LINE FEED (LF) | new-line |
| U+0021 | EXCLAMATION MARK | ! |
| U+0022 | QUOTATION MARK | " |
| U+0023 | NUMBER SIGN | # |
| U+0025 | PERCENT SIGN | % |
| U+0026 | AMPERSAND | & |
| U+0027 | APOSTROPHE | ' |
| U+0028 | LEFT PARENTHESIS | ( |
| U+0029 | RIGHT PARENTHESIS | ) |
| U+002A | ASTERISK | * |
| U+002B | PLUS SIGN | + |
| U+002C | COMMA | , |
| U+002D | HYPHEN-MINUS | - |
| U+002E | FULL STOP | . |
| U+002F | SOLIDUS | / |
| U+0030 .. U+0039 | DIGIT ZERO .. NINE | 0 1 2 3 4 5 6 7 8 9 |
| U+003A | COLON | : |
| U+003B | SEMICOLON | ; |
| U+003C | LESS-THAN SIGN | < |
| U+003D | EQUALS SIGN | = |
| U+003E | GREATER-THAN SIGN | > |
| U+003F | QUESTION MARK | ? |
| U+0041 .. U+005A | LATIN CAPITAL LETTER A .. Z | A B C D E F G H I J K L M N O P Q R S T U V W X Y Z |
| U+005B | LEFT SQUARE BRACKET | [ |
| U+005C | REVERSE SOLIDUS | \ |
| U+005D | RIGHT SQUARE BRACKET | ] |
| U+005E | CIRCUMFLEX ACCENT | ^ |
| U+005F | LOW LINE | _ |
| U+0061 .. U+007A | LATIN SMALL LETTER A .. Z | a b c d e f g h i j k l m n o p q r s t u v w x y z |
| U+007B | LEFT CURLY BRACKET | { |
| U+007C | VERTICAL LINE | | |
| U+007D | RIGHT CURLY BRACKET | } |
| U+007E | TILDE | ~ |
The universal-character-name construct provides a way to name other characters.The basic literal character set consists of all characters of the basic character set, plus the control characters specified in table Y.hex-quad : hexadecimal-digit hexadecimal-digit hexadecimal-digit hexadecimal-digit universal-character-name : \u hex-quad \U hex-quad hex-quadA universal-character-name designates the character inISO/IEC 10646 (if any)the translation character set whoseUnicode code pointUCS scalar value is the hexadecimal number represented by the sequence of hexadecimal-digits in the universal-character-name. The program is ill-formed if that number is not aUnicode code point or if it is a surrogate code pointUCS scalar value.Noncharacter code points and reserved code points are considered to designate separate characters distinct from any ISO/IEC 10646 character.If a universal-character-name outside the c-char-sequence, s-char-sequence, or r-char-sequence of a character-literal or string-literal (in either case, including within a user-defined-literal) corresponds to a control character or to a character in the basicsourcecharacter set, the program is ill-formed. [Footnote:Note: A sequence of characters resembling a universal-character-name in an r-char-sequence (5.13.5) does not form a universal-character-name. ][Note: ISO/IEC 10646 code points are integers in the range [0, 10FFFF] (hexadecimal). A surrogate code point is a value in the range [D800, DFFF] (hexadecimal). A control character is a character whose code point is in either of the ranges [0, 1F] or [7F, 9F] (hexadecimal). — end note]
| U+0000 | NULL |
| U+0007 | BELL |
| U+0008 | BACKSPACE |
| U+000D | CARRIAGE RETURN (CR) |
A code unit is an integer value of character type (6.8.1 [basic.fundamental]). Characters in a character-literal other than a multicharacter or non-encodable character literal or in a string-literal are encoded as a sequence of one or more code units, as determined by the encoding-prefix ([lex.ccon], [lex.string]); this is termed the respective literal encoding. The ordinary literal encoding is the encoding applied to an ordinary character or string literal. The wide literal encoding is the encoding applied to a wide character or string literal.
A literal encoding or a locale-specific encoding of one of the execution character sets (16.3.3.3.5.1 [character.seq]) encodes each element of the basic literal character set as a single code unit with non-negative value, distinct from the code unit for any other such element. [ Note: A character not in the basic literal character set can be encoded with more than one code unit; the value of such a code unit can be the same as that of a code unit for an element of the basic literal character set. -- end note ]. The U+0000 NULL character is encoded as the value 0. No other element of the translation character set is encoded with a code unit of value 0. The code unit value of each decimal digit character after the digit 0 (U+0030) shall be one greater than the value of the previous. The ordinary and wide literal encodings are otherwise implementation-defined. For a UTF-8, UTF-16, or UTF-32 literal, the UCS scalar value corresponding to each character of the translation character set is encoded as specified in ISO/IEC 10646 for the respective UCS encoding form.Change the grammar in 5.4 [lex.pptoken] paragraph 1:The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose value is 0. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous. The execution character set and the execution wide-character set are implementation-defined supersets of the basic execution character set and the basic execution wide-character set, respectively. The values of the members of the execution character sets and the sets of additional members are locale-specific.
Change in 5.4 [lex.pptoken] paragraph 2:preprocessing-token: header-name import-keyword module-keyword export-keyword identifier pp-number character-literal user-defined-character-literal string-literal user-defined-string-literal preprocessing-op-or-punceach universal-character-name that cannot be one of the aboveeach non-whitespace character that cannot be one of the above
A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6. In this document, glyphs are used to identify elements of the basic character set ([lex.charset]). The categories of preprocessing token are: header names, placeholder tokens produced by preprocessing import and module directives (import-keyword, module-keyword, and export-keyword), identifiers, preprocessing numbers, character literals (including user-defined character literals), string literals (including user-defined string literals), preprocessing operators and punctuators, and singleChange in 5.4 [lex.pptoken] paragraph 3 bullet 1: If the next character begins a sequence of characters that could be the prefix and initial double quote of a raw string literal, such as R", the next preprocessing token shall be a raw string literal. Between the initial and final double quote characters of the raw string, any transformations performed inuniversal-character-names andnon-whitespace characters that do not lexically match the other preprocessing token categories.If a single universal-character-name does not match any of the other preprocessing token categories, the program is ill-formed.If a' or a "U+0027 APOSTROPHE or a U+0022 QUOTATION MARK character matches the last category, the behavior is undefined. If any character not in the basic character set matches the last category, the program is ill-formed. Preprocessing tokens can be separated by whitespace; this consists of comments (5.7), or whitespace characters (space, horizontal tabU+0020 SPACE, U+0009 CHARACTER TABULATION, new-line,vertical tab, and form-feedU+000B LINE TABULATION, and U+000C FORM FEED), or both. ...
h-char:
any member of the source translation character set except new-line and > U+003E GREATER-THAN SIGN
...
q-char:
any member of the source translation character set except new-line and " U+0022 QUOTATION MARK
Change in 5.10 [lex.name]:
identifier-start:
nondigit
universal-character-name
an element of the translation character set of class XID_Start
identifier-continue:
digit
nondigit
universal-character-name
an element of the translation character set of class XID_Continue
Change in 5.13.3 [lex.ccon] before paragraph 1:
basic-c-char:
any member of the basic source translation character set
except the single-quote ’, backslash \ U+0027 APOSTROPHE, U+005C REVERSE SOLIDUS, or new-line character
...
conditional-escape-sequence-char:
any member of the basic source character set that is not an octal-digit, a simple-escape-sequence-char, or
the characters u, U, or x
Change in 5.13.3 [lex.ccon] paragraph 2:
[Note 1 : The associated character encoding for ordinary and wide character literals determines encodability, but does not determine the value of non-encodable ordinary or wide character literals or ordinary or wide multicharacter literals. The examples in Table 9 for non-encodable ordinary and wide character literals assume that the specified character lacks representation in theChange in 5.13.3 [lex.ccon] table tab:lex.ccon.literal:execution character setordinary literal encoding orexecution wide-character setwide literal encoding, respectively, or that encoding it would require more than one code unit. — end note]
Replace 5.13.3 [lex.ccon] table tab:lex.ccon.esc: The character specified by a simple-escape-sequence is specified in Table 10.
Encoding prefix ... Associated character encoding none ... encoding of the execution character setordinary literal encodingL ... encoding of the execution wide-character setwide literal encoding
| character | simple-escape-sequence | |
|---|---|---|
| U+000A | LINE FEED (LF) | \n |
| U+0009 | CHARACTER TABULATION | \t |
| U+000B | LINE TABULATION | \v |
| U+0008 | BACKSPACE | \b |
| U+000D | CARRIAGE RETURN (CR) | \r |
| U+000C | FORM FEED (FF) | \f |
| U+0007 | BELL | \a |
| U+005C | REVERSE SOLIDUS | \\ |
| U+003F | QUESTION MARK | \? |
| U+0027 | APOSTROPHE | \' |
| U+0022 | QUOTATION MARK | \" |
basic-s-char:
any member of the basic source translation character set
except the double-quote ", backslash \U+0022 QUOTATION MARK, U+005C REVERSE SOLIDUS, or new-line character
...
r-char:
any member of the source translation character set, except a right parenthesis ) U+0029 RIGHT PARENTHESIS followed by
the initial d-char-sequence (which may be empty) followed by a double quote " U+0022 QUOTATION MARK.
...
d-char:
any member of the basic source character set except:
space, the left parenthesis (, the right parenthesis ), the backslash \, and the control characters
representing horizontal tab, vertical tab, form feed
U+0020 SPACE, U+0028 LEFT PARENTHESIS, U+0029 RIGHT PARENTHESIS, U+005C REVERSE SOLIDUS,
U+0009 CHARACTER TABULATION, U+000B LINE TABULATION, U+000C FORM FEED (FF), and new-line
Change in 5.13.5 [lex.string] table tab:lex.string.literal:
Change in 5.13.5 [lex.string] paragraphs 7 and 8: - 7 -
Encoding prefix ... Associated character encoding none ... encoding of the execution character setordinary literal encodingL ... encoding of the execution widecharacter setwide literal encoding
Table 13 has some examples of valid concatenations.
- 8 -
In translation phase 6 (5.2 [lex.phases]),
adjacent string-literals are concatenated.
The lexical structure and grouping of the contents of the individual
string-literals is retained.
Characters in concatenated strings are kept distinct.
[Example:
"\xA" "B"
R"(\u00)" "41"represents six characters, starting with a backslash and ending with the digit
1 (and not the single character "A" specified by
a universal-character-name).
Table 13 has some examples of valid concatenations. — end example]
L,When encoding a stateful character encoding, ...
[ Note: The sequence c1 c2 ...ck can only contain characters from the basicChange in 5.13.8 [lex.ext] paragraph 4:sourcecharacter set. — end note]
[ Note: The sequence c1 c2 ...ck can only contain characters from the basicChange in 6.7.1 [intro.memory] paragraph 1:sourcecharacter set. — end note]
The fundamental storage unit in the memory model is the byte. A byte is at least large enough to containChange in 6.8.2 [basic.fundamental] paragraph 7:any memberthe ordinary literal encoding of any element of the basicexecutionliteral character set (5.3) and the eight-bit code units of the Unicode UTF-8 encoding form and is composed of a contiguous sequence of bits, [ Footnote: ... ] the number of which is implementation-defined.
TypeEditing note: The strike-out above is already stated in the definition of "byte", above. If desired, we can add a note that a char takes exactly one byte. No change in 6.8.2 [basic.fundamental] paragraph 8:charis a distinct type that has an implementation-defined choice of “signed char” or “unsigned char” as its underlying type.The values of type...charcan represent distinct codes for all members of the implementation’s basic character set.
TypeChange in 6.8.2 [basic.fundamental] paragraph 11:wchar_tis a distinct type that has an implementation-defined signed or unsigned integer type as its underlying type. The values of typewchar_tcan represent distinct codes for all members of the largest extended character set specified among the supported locales (28.3.1).
The typesChange in 7.5.1 [expr.prim.literal] paragraph 1:char,wchar_t,char8_t,char16_t,char32_tare collectively called character types. The character types,Typesbool,char, wchar_t, char8_t, char16_t, char32_t,and the signed and unsigned integer types are collectively called integral types. A synonym for integral type is integer type. [Note: Enumerations (9.7.1) are not integral; however, unscoped enumerations can be promoted to integral types as specified in 7.3.6. — end note]
Change in 15.2 [cpp.cond] paragraph 12:A literal is a primary expression.The type of a literal is determined based on its form as specified in 5.13 [lex.literal]. A string-literal is an lvalue designating a corresponding string literal object ([lex.string]), a user-defined-literal has the same value category as the corresponding operator call expression described in 5.13.8 [lex.ext], and any other literal is a prvalue.
The resulting tokens comprise the controlling constant expression which is evaluated according to the rules of 7.7 using arithmetic that has at least the ranges specified in 17.3. For the purposes of this token conversion and evaluation all signed and unsigned integer types act as if they have the same representation as, respectively, intmax_t or uintmax_t (17.4). [Note: ... -- end note] This includes interpreting character-literals, which may involveChange in 15.6.3 [cpp.concat] paragraph 3:converting escape sequences into execution character set membersinterpreting escape-sequences and universal-character-names (5.13.3 [lex.ccon]). Whether the numeric value for these character-literals matches the value obtained when an identical character-literal occurs in an expression (other than within a #if or #elif directive) is implementation-defined. [Note: ... -- end note] Also, whether a single-character character-literal may have a negative value is implementation-defined. Each subexpression with typeboolis subjected to integral promotion before processing continues.
For both object-like and function-like macro invocations, before the replacement list is reexamined for more macro names to replace, each instance of a ## preprocessing token in the replacement list (not from an argument) is deleted and the preceding preprocessing token is concatenated with the following preprocessing token. Placemarker preprocessing tokens are handled specially: concatenation of two placemarkers results in a single placemarker preprocessing token, and concatenation of a placemarker with a non-placemarker preprocessing token results in the non-placemarker preprocessing token. If the result begins with a sequence matching the syntax of universal-character-name, the behavior is undefined. [ Note: This determination does not consider the replacement of universal-character-names in translation phase 3 ([lex.phases]). ] If the result is not a valid preprocessing token, the behavior is undefined. The resulting token is available for further macro replacement. The order of evaluation of ## operators is unspecified.Change in 16.3.3.3.5.1 [character.seq] paragraph 1:
The C standard library makes widespread use of characters and character sequences that follow a few uniform conventions:Change in 16.3.3.3.5.2 [multibyte.strings] paragraph 1:
- Properties specified as locale-specific may change during program execution by a call to
setlocale(int, const char*)(28.5.1 [clocale.syn]), or by a change to alocaleobject, as described in 28.3 [locales] and Clause 29 [input.output].- The execution character set and the execution wide-character set are supersets of the basic literal character set (5.3 [lex.charset]). The encodings of the execution character sets and the sets of additional elements (if any) are locale-specific. [ Note: The encodings of the execution character sets can be unrelated to any literal encoding. -- end note ]
- A letter is any of the 26 lowercase or 26 uppercase letters in the basic
executioncharacter set.- The decimal-point character is the locale-specific (single-byte) character used by functions that convert between a (single-byte) character sequence and a value of one of the floating-point types. It is used in the character sequence to denote the beginning of a fractional part. It is represented in Clause 17 through Clause 32 and Annex D by a period, ’.’, which is also its value in the "C" locale
, but may change during program execution by a call to.setlocale(int, const char*), [ Footnote: ... ] or by a change to alocaleobject, as described in 28.3 and Clause 29
A null-terminated multibyte string, or ntmbs, is an ntbs that constitutes a sequence of valid multibyte characters, beginning and ending in the initial shift state. [ Footnote: An NTBS that contains characters only from the basicChange in 27.13 [time.parse] table [tab:time.parse.spec]:executionliteral character set is also an NTMBS. Each multibyte character then consists of a single byte. ]
Change in 28.4.2.2.3 [locale.ctype.virtuals] paragraphs 11 and 13: The only characters for which unique transformations are required are those in the basic
%Z The time zone abbreviation or name. A single word is parsed. This word can only contain characters from the basic sourcecharacter set (5.3 [lex.charset]) that are alphanumeric, or one of ’_’, ’/’, ’-’, or ’+’.
[...]
For any character c in the basicdo_widen(do_narrow(c, 0)) == cChange in C.2.3 [diff.cpp14.lex]:
Affected subclause: 5.2
Change: Removal of trigraph support as a required feature.
Rationale: Prevents accidental uses of trigraphs in non-raw string literals and comments. Effect on original feature: Valid C ++ 2014 code that uses trigraphs may not be valid or may have different semantics in this revision of C ++ . Implementations may choose to translate trigraphs as specified in C ++ 2014 if they appear outside of a raw string literal, as part of the implementation-defined mapping from physical source file characters to the basicsourcecharacter set.