Very Rough Algorithm Outline
Normalization Algorithm
Transform a string into a normalized string. String padding (\s+
) should be removed by the user. Various errors may be thrown during computation.
Example Input: 💩Raffy.eth
-
Tokenize the string into a list labels, where each label is a list of
Emoji
andText
tokens, where each token contains codepoints:[ [Emoji(1F4A9), Text(52, 61, 66, 66, 79)], [Text(65, 74, 68)] ]
-
Decode and Validate each label:
Token[]
→Token[]
-
[Emoji(1F4A9), Text(52, 61, 66, 66, 79)]
→ no change -
[Text(65, 74, 68)]
→ no change
-
-
Flatten each label into a list of codepoints joined together with
$PRIMARY_STOP
and interpret as a string.[1F4A9 52 61 66 66 79] + 2E + [65 74 68]
-
[CheckBidi] If the resulting string is Bidi domain name (any token contains a codepoint of BIDI class R, AL, or AN) and the Textual Content of any label fails RFC5893, throw an error. (Specification, Code)
Example Output: 💩raffy.eth
Tokenize
Return a list of labels by repeatedly processing and consuming the prefix of string until empty. Emitted tokens are appended to the last label. The list of labels initially contains a single empty label: [[]]
. Labels can be empty.
Essentially UTS-51 parsing is done first, followed by UTS-46 w/IDNA 2008 (std3=true, transitional=false, +NV8, +XV8, and exceptions listed below.)
Apply the following cases, starting from the top after each match:
-
If the prefix is an exact match of a whitelisted emoji sequence, pop the match, and emit an verbatim
Emoji
token (one that contains the matching codepoints.) -
If the prefix is an emoji flag sequence, pop the match, and emit an verbatim
Emoji
token. -
Determine if the prefix is an emoji keycap sequence:
Caution: Under EIP-137 (commonly implemented as UTS-46 w/IDNA 2003),# FE0F 20E3
normalizes to# 20E3
(asFE0F
is ignored) but normalizing# 20E3
again is an error. A REQ/OPT distinction preserves idempotency and makes keycap handling future-proof. The malformed 0-9 keycaps are effectively grandfathered.
$KEYCAP_OPT = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
$KEYCAP_REQ = [#, *]
- If the prefix is
$KEYCAP_REQ FE0F 20E3
, where$KEYCAP_REQ
is a codepoint, pop the match, and emit an verbatimEmoji
token. - If the prefix is
$KEYCAP_OPT FE0F? 20E3
, whereFE0F
is optional and$KEYCAP_OPT
is a codepoint, pop the match, and emitEmoji($KEYCAP_OPT, 20E3)
.
- If the prefix is
-
Determine if the prefix is a ZWJ sequence or a single ModPre. ZWJ sequences are whitelisted as sequences of ModPre joined by
$ZWJ
. Consume and emit the longest match as singleEmoji
token.- Determine if the prefix is a ModPre:
An emoji sequence isemoji_zwj_sequence | emoji_tag_sequence | emoji_core_sequence
.
An emoji core sequence isemoji_presentation_sequence | emoji_keycap_sequence | emoji_modifier_sequence | emoji_flag_sequence
.
Since flags and keycaps have already been processed, only modifier and presentation sequences remain.
- If the prefix is an emoji_modifier_sequence, pop the match, and emit an verbatim
Emoji
token. - Determine if the prefix is an emoji_presentation_sequence:
Caution: Backwards compatibility with EIP-137 is maintained by permitting some emoji to optionally matchFE0F
but tokenize without it (DROP). ZWJ and future emoji require (and always include)FE0F
(REQ). - If the prefix is
$STYLE_REQ FE0F
and$STYLE_REQ
is a codepoint, pop the match, and an verbatimEmoji
token. - If the prefix is
$STYLE_DROP FE0F?
whereFE0F
is optional and$STYLE_DROP
is an codepoint, pop the match, and emitEmoji($STYLE_DROP)
.
- Determine if the prefix is a ModPre:
-
If the prefix is
$STOP
, pop the match, append a new label, and continue. -
If the prefix is
$IGNORED
, pop the match, and continue. -
If the prefix is
$VALID
, pop the match, and emit a verbatimText
token. -
If the prefix is
$MAPPED
, pop the match, and emitText(map[$MAPPED])
wheremap
::codepoint
→codepoint[]
. -
The prefix is character is invalid. Throw an error.
Decode and Validate
Transform a label (list of Emoji/Text
tokens) into a valid label.
- Flatten the tokens into a list of codepoints, where
Emoji
token codepoints are used verbatim andText
token codepoints are converted using Normalization Form C. - If the resulting sequence starts
78 6E $HYPHEN $HYPHEN
("xn--"
):- Decode the remainder of the sequence as Punycode (RFC3492). (Specification, Code)
- If decoding failed, throw an error.
-
Tokenize the decoded the sequence where
$STOP
,$IGNORED
, and$MAPPED
are empty sets. - If tokenization failed, throw an error.
- Flatten the tokens to a list of codepoints like above.
- If flattened sequence doesn’t exactly match the decoded sequence, throw an error.
- Replace the initial tokens and flattened sequence with the decoded values.
- If the sequence is empty, return an empty list.
- If the sequence has
$HYPHEN
at 3rd and 4th codepoints, throw an error. - If the sequence starts with
$HYPHEN
, throw an error. - If the sequence ends with
$HYPHEN
, throw an error. - If the sequence starts with
$COMBINING_MARK
, throw an error. - [ContextJ/ContextO] If the Textual Content fails RFC5892, throw an error. (Specification, Code)
- Return the tokens.
Textual Content of a Label
Flatten the tokens into list of codepoints, where Text
token codepoints are converted using Normalization Form C and Emoji
tokens before the first Text
token are ignored, and replaced by FE0F
afterwards.
This prevents emoji from interacting incorrectly with Bidi and Context rules.
Defined Variables
$PRIMARY_STOP = 2E
$HYPHEN = 2D
$ZWJ = 200D
$STOP = [2E]
-
$VALID, $MAPPED, $IGNORED
→ IdnaMappingTable -
$COMBINING_MARK
→ DerivedGeneralCategory matchingM*
- Custom ens-normalize Rules
- Derived Rules (UTS46 + UTS51)