ENS Name Normalization

raffy · April 20, 2022, 8:37am

Very Rough Algorithm Outline

Normalization Algorithm

Transform a string into a normalized string. String padding (\s+) should be removed by the user. Various errors may be thrown during computation.

Example Input: 💩Raffy.eth

Tokenize the string into a list labels, where each label is a list of Emoji and Text tokens, where each token contains codepoints:
- [ [Emoji(1F4A9), Text(52, 61, 66, 66, 79)], [Text(65, 74, 68)] ]
Decode and Validate each label: Token[] → Token[]
- [Emoji(1F4A9), Text(52, 61, 66, 66, 79)] → no change
- [Text(65, 74, 68)] → no change
Flatten each label into a list of codepoints joined together with $PRIMARY_STOP and interpret as a string.
- [1F4A9 52 61 66 66 79] + 2E + [65 74 68]
[CheckBidi] If the resulting string is Bidi domain name (any token contains a codepoint of BIDI class R, AL, or AN) and the Textual Content of any label fails RFC5893, throw an error. (Specification, Code)

Example Output: 💩raffy.eth

Tokenize

Return a list of labels by repeatedly processing and consuming the prefix of string until empty. Emitted tokens are appended to the last label. The list of labels initially contains a single empty label: [[]]. Labels can be empty.

Essentially UTS-51 parsing is done first, followed by UTS-46 w/IDNA 2008 (std3=true, transitional=false, +NV8, +XV8, and exceptions listed below.)

Apply the following cases, starting from the top after each match:

If the prefix is an exact match of a whitelisted emoji sequence, pop the match, and emit an verbatim Emoji token (one that contains the matching codepoints.)
If the prefix is an emoji flag sequence, pop the match, and emit an verbatim Emoji token.
Determine if the prefix is an emoji keycap sequence:
Caution: Under EIP-137 (commonly implemented as UTS-46 w/IDNA 2003), # FE0F 20E3 normalizes to # 20E3 (as FE0F is ignored) but normalizing # 20E3 again is an error. A REQ/OPT distinction preserves idempotency and makes keycap handling future-proof. The malformed 0-9 keycaps are effectively grandfathered.
$KEYCAP_OPT = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
$KEYCAP_REQ = [#, *]
1. If the prefix is $KEYCAP_REQ FE0F 20E3, where $KEYCAP_REQ is a codepoint, pop the match, and emit an verbatim Emoji token.
2. If the prefix is $KEYCAP_OPT FE0F? 20E3, where FE0F is optional and $KEYCAP_OPT is a codepoint, pop the match, and emit Emoji($KEYCAP_OPT, 20E3).
Determine if the prefix is a ZWJ sequence or a single ModPre. ZWJ sequences are whitelisted as sequences of ModPre joined by $ZWJ. Consume and emit the longest match as single Emoji token.
- Determine if the prefix is a ModPre:
  An emoji sequence is emoji_zwj_sequence | emoji_tag_sequence | emoji_core_sequence.
  An emoji core sequence is emoji_presentation_sequence | emoji_keycap_sequence | emoji_modifier_sequence | emoji_flag_sequence.
  Since flags and keycaps have already been processed, only modifier and presentation sequences remain.
1. If the prefix is an emoji_modifier_sequence, pop the match, and emit an verbatim Emoji token.
2. Determine if the prefix is an emoji_presentation_sequence:
  Caution: Backwards compatibility with EIP-137 is maintained by permitting some emoji to optionally match FE0F but tokenize without it (DROP). ZWJ and future emoji require (and always include) FE0F (REQ).
3. If the prefix is $STYLE_REQ FE0F and $STYLE_REQ is a codepoint, pop the match, and an verbatim Emoji token.
4. If the prefix is $STYLE_DROP FE0F? where FE0F is optional and $STYLE_DROP is an codepoint, pop the match, and emit Emoji($STYLE_DROP).
If the prefix is $STOP, pop the match, append a new label, and continue.
If the prefix is $IGNORED, pop the match, and continue.
If the prefix is $VALID, pop the match, and emit a verbatim Text token.
If the prefix is $MAPPED, pop the match, and emit Text(map[$MAPPED]) where map :: codepoint → codepoint[].
The prefix is character is invalid. Throw an error.

Decode and Validate

Transform a label (list of Emoji/Text tokens) into a valid label.

Flatten the tokens into a list of codepoints, where Emoji token codepoints are used verbatim and Text token codepoints are converted using Normalization Form C.
If the resulting sequence starts 78 6E $HYPHEN $HYPHEN ("xn--"):
- Decode the remainder of the sequence as Punycode (RFC3492). (Specification, Code)
- If decoding failed, throw an error.
- Tokenize the decoded the sequence where $STOP, $IGNORED, and $MAPPED are empty sets.
- If tokenization failed, throw an error.
- Flatten the tokens to a list of codepoints like above.
- If flattened sequence doesn’t exactly match the decoded sequence, throw an error.
- Replace the initial tokens and flattened sequence with the decoded values.
If the sequence is empty, return an empty list.
If the sequence has $HYPHEN at 3rd and 4th codepoints, throw an error.
If the sequence starts with $HYPHEN, throw an error.
If the sequence ends with $HYPHEN, throw an error.
If the sequence starts with $COMBINING_MARK, throw an error.
[ContextJ/ContextO] If the Textual Content fails RFC5892, throw an error. (Specification, Code)
Return the tokens.

Textual Content of a Label

Flatten the tokens into list of codepoints, where Text token codepoints are converted using Normalization Form C and Emoji tokens before the first Text token are ignored, and replaced by FE0F afterwards.

This prevents emoji from interacting incorrectly with Bidi and Context rules.

Defined Variables

$PRIMARY_STOP = 2E
$HYPHEN = 2D
$ZWJ = 200D
$STOP = [2E]
$VALID, $MAPPED, $IGNORED → IdnaMappingTable
$COMBINING_MARK → DerivedGeneralCategory matching M*
Custom ens-normalize Rules
Derived Rules (UTS46 + UTS51)