Save strings internally as WTF-8#184
Open
drizt wants to merge 1 commit into
Open
Conversation
Contributor
Author
|
This commit allows such strings to be parsed. {
"valid surrogate pair (😀 U+1F600)": "\uD83D\uDE00",
"lone high surrogate": "\uD800",
"lone low surrogate": "\uDC00",
"high surrogate not followed by low surrogate": "\uD834\u0061",
"low surrogate not preceded by high surrogate": "\u0061\uDD1E",
"reversed surrogate order (low then high)": "\uDC00\uD800",
"two high surrogates in a row": "\uD800\uD801",
"two low surrogates in a row": "\uDC00\uDC01",
"surrogate pair split by space": "\uD83D\u0020\uDE00",
"surrogate halves separated by text": "\uD83Dtest\uDE00",
"high surrogate followed by another escape": "\uD83D\u000A",
"high surrogate at end of string": "ABC\uD800"
}
|
Contributor
Author
|
Also my commit fix #58. |
02431a9 to
2165f27
Compare
RFC 8259 doesn't force strings to be valid unicode stings. In real it allows to contain any \uxxxx values. It's possible to keep any binary data in JSON strings. This commit removes limitation for strings to be valid UTF-8 strings. WTF-8 (Wobbly Transformation Format − 8-bit) is asuperset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired. WTF-8 strings are not compatible with current tests. Tests use some python code which works only with valid UTF-8 strings. Need to upgrade tests system or replace it with something another that has full JSON support.
Member
|
Did you use any form of generative AI while authoring these changes or PRs? |
Contributor
Author
|
Code wrote with helping of ChatGPT. Edited and tested (in my own project) manually. I learned JSON RFC and WTF-8 doc before apply this changes in my own code. Test wrote with ChatGPT. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
RFC 8259 doesn't force strings to be valid unicode stings. In real it allows to contain any \uxxxx values. It's possible to keep any binary data in JSON strings. This commit removes limitation for strings to be valid UTF-8 strings.
WTF-8 (Wobbly Transformation Format − 8-bit) is asuperset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.
WTF-8 strings are not compatible with current tests. Tests use some python code which works only with valid UTF-8 strings. Need to upgrade tests system or replace it with something another that has full JSON support.