Very Rough Algorithm Outline
Normalization Algorithm
Transform a string into a normalized string. String padding (\s+) should be removed by the user. Various errors may be thrown during computation.
Example Input: 💩Raffy.eth
-
Tokenize the string into a list labels, where each label is a list of
EmojiandTexttokens, where each token contains codepoints:[ [Emoji(1F4A9), Text(52, 61, 66, 66, 79)], [Text(65, 74, 68)] ]
-
Decode and Validate each label:
Token[]→Token[][Emoji(1F4A9), Text(52, 61, 66, 66, 79)]→ no change[Text(65, 74, 68)]→ no change
-
Flatten each label into a list of codepoints joined together with
$PRIMARY_STOPand interpret as a string.[1F4A9 52 61 66 66 79] + 2E + [65 74 68]
-
[CheckBidi] If the resulting string is Bidi domain name (any token contains a codepoint of BIDI class R, AL, or AN) and the Textual Content of any label fails RFC5893, throw an error. (Specification, Code)
Example Output: 💩raffy.eth
Tokenize
Return a list of labels by repeatedly processing and consuming the prefix of string until empty. Emitted tokens are appended to the last label. The list of labels initially contains a single empty label: [[]]. Labels can be empty.
Essentially UTS-51 parsing is done first, followed by UTS-46 w/IDNA 2008 (std3=true, transitional=false, +NV8, +XV8, and exceptions listed below.)
Apply the following cases, starting from the top after each match:
-
If the prefix is an exact match of a whitelisted emoji sequence, pop the match, and emit an verbatim
Emojitoken (one that contains the matching codepoints.) -
If the prefix is an emoji flag sequence, pop the match, and emit an verbatim
Emojitoken. -
Determine if the prefix is an emoji keycap sequence:
Caution: Under EIP-137 (commonly implemented as UTS-46 w/IDNA 2003),# FE0F 20E3normalizes to# 20E3(asFE0Fis ignored) but normalizing# 20E3again is an error. A REQ/OPT distinction preserves idempotency and makes keycap handling future-proof. The malformed 0-9 keycaps are effectively grandfathered.
$KEYCAP_OPT = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
$KEYCAP_REQ = [#, *]- If the prefix is
$KEYCAP_REQ FE0F 20E3, where$KEYCAP_REQis a codepoint, pop the match, and emit an verbatimEmojitoken. - If the prefix is
$KEYCAP_OPT FE0F? 20E3, whereFE0Fis optional and$KEYCAP_OPTis a codepoint, pop the match, and emitEmoji($KEYCAP_OPT, 20E3).
- If the prefix is
-
Determine if the prefix is a ZWJ sequence or a single ModPre. ZWJ sequences are whitelisted as sequences of ModPre joined by
$ZWJ. Consume and emit the longest match as singleEmojitoken.- Determine if the prefix is a ModPre:
An emoji sequence isemoji_zwj_sequence | emoji_tag_sequence | emoji_core_sequence.
An emoji core sequence isemoji_presentation_sequence | emoji_keycap_sequence | emoji_modifier_sequence | emoji_flag_sequence.
Since flags and keycaps have already been processed, only modifier and presentation sequences remain.
- If the prefix is an emoji_modifier_sequence, pop the match, and emit an verbatim
Emojitoken. - Determine if the prefix is an emoji_presentation_sequence:
Caution: Backwards compatibility with EIP-137 is maintained by permitting some emoji to optionally matchFE0Fbut tokenize without it (DROP). ZWJ and future emoji require (and always include)FE0F(REQ). - If the prefix is
$STYLE_REQ FE0Fand$STYLE_REQis a codepoint, pop the match, and an verbatimEmojitoken. - If the prefix is
$STYLE_DROP FE0F?whereFE0Fis optional and$STYLE_DROPis an codepoint, pop the match, and emitEmoji($STYLE_DROP).
- Determine if the prefix is a ModPre:
-
If the prefix is
$STOP, pop the match, append a new label, and continue. -
If the prefix is
$IGNORED, pop the match, and continue. -
If the prefix is
$VALID, pop the match, and emit a verbatimTexttoken. -
If the prefix is
$MAPPED, pop the match, and emitText(map[$MAPPED])wheremap::codepoint→codepoint[]. -
The prefix is character is invalid. Throw an error.
Decode and Validate
Transform a label (list of Emoji/Text tokens) into a valid label.
- Flatten the tokens into a list of codepoints, where
Emojitoken codepoints are used verbatim andTexttoken codepoints are converted using Normalization Form C. - If the resulting sequence starts
78 6E $HYPHEN $HYPHEN("xn--"):- Decode the remainder of the sequence as Punycode (RFC3492). (Specification, Code)
- If decoding failed, throw an error.
- Tokenize the decoded the sequence where
$STOP,$IGNORED, and$MAPPEDare empty sets. - If tokenization failed, throw an error.
- Flatten the tokens to a list of codepoints like above.
- If flattened sequence doesn’t exactly match the decoded sequence, throw an error.
- Replace the initial tokens and flattened sequence with the decoded values.
- If the sequence is empty, return an empty list.
- If the sequence has
$HYPHENat 3rd and 4th codepoints, throw an error. - If the sequence starts with
$HYPHEN, throw an error. - If the sequence ends with
$HYPHEN, throw an error. - If the sequence starts with
$COMBINING_MARK, throw an error. - [ContextJ/ContextO] If the Textual Content fails RFC5892, throw an error. (Specification, Code)
- Return the tokens.
Textual Content of a Label
Flatten the tokens into list of codepoints, where Text token codepoints are converted using Normalization Form C and Emoji tokens before the first Text token are ignored, and replaced by FE0F afterwards.
This prevents emoji from interacting incorrectly with Bidi and Context rules.
Defined Variables
$PRIMARY_STOP = 2E$HYPHEN = 2D$ZWJ = 200D$STOP = [2E]$VALID, $MAPPED, $IGNORED→ IdnaMappingTable$COMBINING_MARK→ DerivedGeneralCategory matchingM*- Custom ens-normalize Rules
- Derived Rules (UTS46 + UTS51)