Save strings internally as WTF-8 by drizt · Pull Request #184 · json-parser/json-parser

drizt · 2026-01-05T14:36:51Z

RFC 8259 doesn't force strings to be valid unicode stings. In real it allows to contain any \uxxxx values. It's possible to keep any binary data in JSON strings. This commit removes limitation for strings to be valid UTF-8 strings.

WTF-8 (Wobbly Transformation Format − 8-bit) is asuperset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired.

WTF-8 strings are not compatible with current tests. Tests use some python code which works only with valid UTF-8 strings. Need to upgrade tests system or replace it with something another that has full JSON support.

drizt · 2026-01-05T14:47:25Z

This commit allows such strings to be parsed.

{
  "valid surrogate pair (😀 U+1F600)": "\uD83D\uDE00",
  "lone high surrogate": "\uD800",
  "lone low surrogate": "\uDC00",
  "high surrogate not followed by low surrogate": "\uD834\u0061",
  "low surrogate not preceded by high surrogate": "\u0061\uDD1E",
  "reversed surrogate order (low then high)": "\uDC00\uD800",
  "two high surrogates in a row": "\uD800\uD801",
  "two low surrogates in a row": "\uDC00\uDC01",
  "surrogate pair split by space": "\uD83D\u0020\uDE00",
  "surrogate halves separated by text": "\uD83Dtest\uDE00",
  "high surrogate followed by another escape": "\uD83D\u000A",
  "high surrogate at end of string": "ABC\uD800"
}

drizt · 2026-01-05T14:59:28Z

Also my commit fix #58.

RFC 8259 doesn't force strings to be valid unicode stings. In real it allows to contain any \uxxxx values. It's possible to keep any binary data in JSON strings. This commit removes limitation for strings to be valid UTF-8 strings. WTF-8 (Wobbly Transformation Format − 8-bit) is asuperset of UTF-8 that encodes surrogate code points if they are not in a pair. It represents, in a way compatible with UTF-8, text from systems such as JavaScript and Windows that use UTF-16 internally but don’t enforce the well-formedness invariant that surrogates must be paired. WTF-8 strings are not compatible with current tests. Tests use some python code which works only with valid UTF-8 strings. Need to upgrade tests system or replace it with something another that has full JSON support.

LB-- · 2026-01-05T21:18:17Z

Did you use any form of generative AI while authoring these changes or PRs?

drizt · 2026-01-05T21:55:53Z

Code wrote with helping of ChatGPT. Edited and tested (in my own project) manually. I learned JSON RFC and WTF-8 doc before apply this changes in my own code. Test wrote with ChatGPT.

drizt force-pushed the wtf8 branch from 252264b to 4b970f1 Compare January 5, 2026 14:40

drizt force-pushed the wtf8 branch 2 times, most recently from 02431a9 to 2165f27 Compare January 5, 2026 15:12

drizt force-pushed the wtf8 branch from 2165f27 to 995efda Compare January 5, 2026 15:41

drizt mentioned this pull request Jan 11, 2026

Serialization of unicode characters #179

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save strings internally as WTF-8#184

Save strings internally as WTF-8#184
drizt wants to merge 1 commit into
json-parser:masterfrom
drizt:wtf8

drizt commented Jan 5, 2026

Uh oh!

drizt commented Jan 5, 2026

Uh oh!

drizt commented Jan 5, 2026

Uh oh!

LB-- commented Jan 5, 2026

Uh oh!

drizt commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drizt commented Jan 5, 2026

Uh oh!

drizt commented Jan 5, 2026

Uh oh!

drizt commented Jan 5, 2026

Uh oh!

LB-- commented Jan 5, 2026

Uh oh!

drizt commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants