ENS Name Normalization

I’ve got a few remaining questions regarding normalization.

  1. The UTS-46 spec makes a distinction between “registration” and “lookup”, where stricter rules apply during registration. Since ENS is decentralized, aren’t these the same? I feel like the goal here should be have a single procedure that takes user input and makes it standard across all applications and platforms.

  2. One example of this difference is ContextO (Rules #3-#9). Should ENS name normalization follow ContextO? For reference, ContextJ should definitely be used since it prevents ZWJ abuse. However, the spec says ContextO is for registration, not lookup.

  3. Does anyone have experience with bidirectional characters enough to form an opinion regarding the check_bidi flag in UTS-46? Essentially, when enabled, labels (the characters between stops in a domains name) can only contain strictly LTR or RTL characters. I have a few bidi examples in my resolver demo for reference. Mixed direction text is user-hostile IMO but maybe there’s a use-case for mixing something like english and hebrew inside of a single label?

  4. UTS-46 operates on unescaped Unicode sequences. Should the normalization process be knowledgeable-of and expand HTML, Unicode, and URI escapes? UTS #46: Unicode IDNA Compatibility Processing I suggest that every ENS accepting input field should automatically translate these escapes.
    Edit 3: I now understand that escapes can be dangerous because they can be nested, as you can escape the escape characters. UTS-46 poorly addresses this problem because punycode can expand to more punycode if CheckHyphens is false.

  5. My library already passes 100% of IDNATestV2 yet it appears that this test is insufficient, as I’ve just discovered that IDNA2008 disallows more characters (see NV8 and XV8) than I was aware. Should we follow IDNA2008? From my understanding, UTS-46 w/transitional=false should follow IDNA2008.
    Edit: it appears NV8 can’t be used exactly as the spec suggests, because it contains emoji. XV8 only contains one character, 0x19DA = ᧚.
    Edit 2: NV8 contains 8759 characters, except for emoji, I think most these should be disallowed. Example: ˃˃˃.eth

  6. Lastly, please take a look this post in the emoji thread regarding treating the presence or absence of ZWJ inside emoji sequences as separate entities.

Any thoughts would be greatly appreciated.

A few prior threads for reference:

5 Likes

Let me break this problem up a bit more:

Why do we need normalization in the first place?

  • Casefolding: A vs a
  • Normalized forms: Å (212B) vs Å (0xC5) or . (0x2E) vs .(0xFF0E)
  • Potential Junk: obviously illegal characters \[{(%&, characters that lack display, etc.
  • Names need to be identifiable from their context. Consider "A B.eth" if spaces were allowed.
  • Names also exist in a hierarchy, so they need to decompose to a series of labels with a separator (.).

Why does this get complicated?

  • Alphabet overload: Latin, Greek, Subscript, Circled, Squared, Math, Full-width, Bold, Italic, Script, etc. There’s more than 20 characters that map to the letter "a".
  • Emoji Appearance: ⚠ (26A0) vs ⚠️ (26A0 FE0F)
  • Emoji ZWJ Sequences: 🏴‍☠️ (1F3F4 200D 2620 FE0F) vs 🏴☠️ (1F3F4 2620 FE0F) vs 🏴☠ (1F3F4 2620)
  • Unicode evolution: there will obviously be more emoji in Unicode 15 in 2022.

What are the actual use-cases:

  • Registration: I want to rent raffy.eth, is it available?
  • Identity: I want to send 1E to raffy.eth

How can we improve on this?

I’ve mostly been thinking about the user-side, so I’m just going to discuss the identity aspect.


Identity

Where do name comes from?

  • Direct input
  • Copy/paste

What encoding problems naturally happen?

  • DNS uses punycode to represent Unicode
  • Some channels don’t support Unicode
  • Some channels mutilate Unicode (this forum outside of `-blocks)

What can go wrong?

  • Direct input: complex emoji are hard to write/edit, especially with joiners.
  • Direct input: mixed directional names are hard to write/edit
  • Direct input: input mistakes
  • Direct input: transcription confusion
  • Copy/paste: homograph attacks

Imagine you’re give a name, either as a placard to transcribe or text to copy/paste:

  • If it’s Base36 [a-z0-9], that’s a great sign (trivial case)
  • If it’s already normalized, that’s a good sign. Note: this is false for many emoji.
  • My suggestion: If it matches the display name, that’s a great sign.
  • If it’s Base62, that’s a good sign.
  • If it only requires case-folding to reach the normalized name, that’s a good sign.
  • If it’s IDNA, that’s an okay sign.
  • If it’s a single script per label, that’s an okay sign.
  • If it contains ignored characters, that’s a bad sign.
  • If it contains disallowed characters, that’s a bad sign.
  • If it has punycode, that’s a bad sign.

If you’re sending crypto to someone, why would you ever want to interact with a name that isn’t normalized? It’s basically a red flag.

However, 🏳️‍🌈.eth is forever unnormalized because the FE0F is stripped by IDNA rules, so it will always trigger this kind of warning. But, if we implement the display name conventions, then this would work!

3 Likes

@royalfork in a previous thread talked about the issues with emoji. He points to UTS-51 as the spec for emoji presentation.

Using just UTS-46 leaves us in this weird scenario because the correct application of UTS-46 + IDNA2008 kills almost all emoji. Emoji only got partially through because IDNA2003 was used. Only some gTLDs allow emoji via punycode but they violate UTS-46.

I think we should respect UTS-51 so we can take full advantage of current and future emoji. This would enable many more emoji, like country flags (RGI_Emoji_Tag_Sequence), a bunch of missing complex sequences (RGI_Emoji_ZWJ_Sequence), and many standard emoji that are disallowed by IDNA without emoji styling.

I believe it’s also possible to do this without breaking the existing non-standard names by grandfathering those emojis sequences All you need is a finite list of sequences that had FE0F removed (as of today) that are disallowed without it. Everything else going forward could be handled correctly. (The ideal solution would be migrating them, but that’s complex and $$$, so ignore that.)

For example, this needs fixed: 👩‍🦱.eth vs 👩‍‍🦱.eth as you can’t even tell there’s ZWJ in there. This is different from ZWJ as exterior padding.

Edit: this renders different on Mac vs PC.
EmojiDoubleJoiner

3 Likes

The algorithm that I’m currently imaging works like this:

  • Consume characters left to right. If the character is an emoji starter, scan ahead and check if the emoji is valid. This process follows UTS-51 but allows grandfathered emoji to have missing FE0F and removes FE0F if it was there (to preserve namehash.)

  • If it is, add an emoji token Emoji(💩) and skip ahead.

  • If not, apply IDNA2008 rules:

  • If it’s a stop, add a Stop token.

  • Otherwise, add it as a Char token, where adjacent Char character tokens merge together: Char("a") + Char("b") = Char("ab")

  • This produces a sequences of tokens like: Abc💩.eth[Char("abc"), Emoji(💩), Stop, Char("eth")]

  • For more information, you could also tokenize the actual IDNA rules: Disallowed("\"), Ignored(" "), and separate Char into Valid and Mapped("A","a") full-contextual information.
    This produces a sequences of tokens like: Abc💩.eth[Mapped("A","a"), Valid("bc"), Emoji(💩), Stop, Valid("eth")] which is great for debugging or producing intelligent errors.

  • The main objective is that the emoji are lifted out of textual processing.

  • Merge any adjacent Mapped/Valid token into a Char token:
    [Mapped("A","a"), Valid("bc")]Char("abc")]

  • Apply NFC each Char token
    From my tests, emoji is unaffected by NFC, but I think this is cleaner than flattening the string, and apply NFC once, and then retokenizing.

  • Split the token sequence on every Stop token:
    [Char("abc"), Emoji(💩), Stop, Char("eth")][[Char("abc"), Emoji(💩)], [Char("eth")]]

  • If any label when flattened starts with xn-- decode puny, and require that it was already normalized, no exceptions.

  • Determine if the domain name is bidi.

  • For each label:
    1.) Apply UTS-46 validation: CheckHyphens and CheckJoiners where the label is flattened.
    2.) Apply ContextJ and ContextO rules where Emoji tokens are treated like a singular character.
    3.) Unknown: if bidi domain, apply Bidi rules where Emoji tokens are treated like a singular character.

  • Flatten the tokens to a single sequence of code points where Stop is replaced by "."

3 Likes

Agreed. We need to pick a single standard for everything.

Definitely not.

Following IDNA2008 was my original intention.

This is an excellent summary, thank you!

This is also an issue for just displaying names - for example, reverse record in Etherscan.

Not a big fan of this, but it’s workable if we can demonstrate the win is big enough. How does the UTS-51 solution differ from what you’ve already tested out previously using UTS-46 and updated Unicode tables?

The process you describe seems reasonable, although I’m not entirely clear on how it deviates from UTS-46 et al except for Emoji processing.

It also seems like it’d be simpler if you first split on ‘.’ and then applied the algorithm for each label - no need for ‘stop’ tokens etc then.

2 Likes

The difference is that you have to first identify and separate the emoji from IDNA processing. The text parts go through IDNA and the emoji parts are as-is (or could be passed through a new algorithm for Emoji.)

This let’s you enable IDNA2008 which turns on NV8 and XV8 and disallows many more characters.

Additionally for validation, you don’t look inside emoji sequences to apply context rules, because we already deemed them valid according to UTS-51.

Two examples:

  • #️ [0023 FE0F] is from Emoji version 1.1 yet will never be valid under UTS-46 (IDNA 2003 or 2008) because FE0F gets removed and # [0023] is invalid. However, it’s a correct emoji according to UTS-51 and easily identifiable. When we hash a string containing this emoji, the FE0F should remain. The text version is # [0023 FE0E] which becomes # [0023] which is disallowed. TLDR: only #️ works.

  • ⌚ [231A FE0F] (also from Emoji version 1.1) presents correctly when stripped ⌚ [231A] and is legal in UTS-46 using IDNA2003 but disallowed in IDNA2008 because it’s an emoji. When we hash a string containing this emoji, we should use the grandfathered version and strip the FE0F to preserve the existing namehash (effectively mapping [231A FE0F] to [231A]). For the text version, ⌚︎ [231A FE0E], we strip FE0E and see that it matches [231A]. TLDR: the 3 watch emojis are equivalent.

I think this works in all cases and lets the standard integrate new emoji w/o issue, with the penalty that we forever remember the list of grandfathered emoji to account for previously registered names, which seems pretty mild compared to breaking thousands of registered names.

You technically could, but UTS-46 permits multiple stops which come from IDNA mapping, and I don’t know if any stops could occur inside of an emoji (eg. TAG stop is E002E which occurs inside Emoji but not mapped to a stop by either IDNA.) It might we smart to deviate from this and say that only (.) [002E] is a stop and these other stops are bad:

  • 3002 IDEOGRAPHIC FULL STOP
  • FF0E FULLWIDTH FULL STOP
  • FF61 HALFWIDTH IDEOGRAPHIC FULL STOP
1 Like

Thanks for the summary!

I thought we were already relying on IDNA2008. Have you checked if there are existing valid names that this would invalidate?

It sounds like this will require manually evaluating every existing emoji to determine how to handle it. How big a job is this?

This seems reasonable to me.

1 Like

Thousands. Emoji are not allowed in IDNA2008 :frowning:. But they’ll work using the algorithm above.

I have the grandfathered list, I’m just trying to codify UTS-51 and put as much of the logic as possible into the payload that is derived automatically from the Unicode data.

There is a question about which emoji sets should be ultimately be disallowed, but they’re all presentation/opinion based. I’m not sure if that discussion has happened anywhere yet.

  • I think @royalfork mentioned the keycaps emoji 1️⃣ [0031 FE0F 20E3] vs digit 1 w/ emoji style 1️ [0031 FE0F], where the visual difference depends on the platform (they differ on mobile, but desktop appears to be lagging) but that’s true for a wide set of emoji that are already registered.

  • Skincolor (emoji modifiers) are already very subtle: ✍🏼 vs ✍🏼 but are already registered.

  • Emoji tags let you embed arbitrary data inside of an emoji sequence, but since they’ll hash differently, I think they’re fine or they could just be disallowed.

  • I think there’s a question about emoji ordering when joined by ZWJ. eg. “family: man, woman” vs family: “woman, man”. I think these are separate. We could either allow any joined sequence of any valid emoji or follow only suggested sequences (which will grow every Unicode update.)

1 Like

Sorry, I mean names that are currently registered but would be invalid with your entire algorithm, not just the IDNA 2008 bit.

Wow, I had no idea. That’s simultaneously cool and terrifying.

Do any existing renderers show both versions the same way?

1 Like

I’m pretty sure by doing emoji processing first, you only get the benefits of a wider set of emoji. However, IDNA2008 removes 9K additional characters, so there might be some non-emoji text that someone registered that’s no longer valid. This will be the first thing I compute once it’s working.

I’m looking into that. I think its supposed to be positional. The spec does define an order for modifiers, because you could have MAN+WHITE-SKIN+RED-HAIR and MAN+RED-HAIR+WHITE-SKIN, whereas you can currently reach those as separate names. I haven’t checked if they render correctly though (emoji quirks mode?)

I think the special case list for emoji is pretty small so far:

  • ["#","23"],["*","2a"],["‼","203c"],["⁉","2049"] these 4 emoji are disallowed by IDNA2003 but when followed with a FE0F, they’re valid. So they could be enabled by retaining the FE0F w/o any ambiguity. Or they could remain disallowed.

  • ["™","2122"],["ℹ","2139"],["Ⓜ","24c2"],["㊗","3297"],["㊙","3299"],["🈁","1f201"],["🈂","1f202"],["🈚","1f21a"],["🈯","1f22f"],["🈲","1f232"],["🈳","1f233"],["🈴","1f234"],["🈵","1f235"],["🈶","1f236"],["🈷","1f237"],["🈸","1f238"],["🈹","1f239"],["🈺","1f23a"],["🉐","1f250"],["🉑","1f251"]] these 20 characters are mapped by IDNA2003. I don’t think you can safely enable these as emoji because we don’t know what the owners intent was. If no names use them, they could be enabled as two separate characters as the spec intended: the mapped version and the emoji version.
    eg. ™️™️™️.eth (emoji) vs ™™™.eth → ("tm") → tmtmtm.eth

  • Every other emoji was untouched by IDNA2003.

At this point, you could say any emoji that’s actively in a name, gets grandfathered, and can be used in either form, like the letter “1” and the emoji 1 1️ 0031 FE0F are interchangeable currently, even though they’re clearly different characters with different input mechanisms. And any remaining emoji gets FE0F applied if its in the emoji form, so it stays an emoji, or IDNA transformed if it’s not.

Or, we could say FE0F is assumed and any future emoji that is assigned to a character that IDNA mapped can also use that transformation.

1️1️1️.eth === 111.eth but 👨🏾.eth =!= 👨🏿.eth.

Edit: It appears only 5 registered names contain the disallowed characters above. I can’t compute the mapped ones because ("tm") is ambiguous.

// all currently unreachable
0023 => [ '{23}{23}{23}' ] // 1 
0042 => [ '{2A}{2A}{2A}', '{2A}{FE0F}{20E3}' ] // 2, 3
8252 => [ '{203C}{FE0F}{203C}{FE0F}{203C}{FE0F}' ] // 4
8265 => [ '{2049}{FE0F}{2049}{FE0F}{2049}{FE0F}' ] // 5

// would be valid with FE0F
// but no longer a short name
1 => `0023 FE0F 0023 FE0F 0023 FE0F`
2 => `00A2 FE0F 00A2 FE0F 00A2 FE0F`

// valid under algorithm above:
3 = *️⃣ (Emoji Keycap Sequence *)
4 = ‼️‼️‼️
5 = ⁉️⁉️⁉️
2 Likes

I’d prefer that we continue to normalise any characters that are confusable with regular letters to their canonical form. Emoji are trickier - you’ve pointed out how subtle some skin color changes are, particularly in small sizes - but we should still aim to avoid having identical-appearing names with different encodings.

2 Likes

I think I got it working but I haven’t run a damage report yet. Latest version can be played on my ENS Resolver demo or on my github.

It’s basically running UTS-51 in front of IDNA2008, peeling off emoji into tokens, so they aren’t transformed. The implementation is pretty close to what I described above. It is also using ContextO.

So far I’ve noticed that 6️⃣9️⃣.eth = 36 fe0f 20e3 39 fe0f 20e3norm(6️⃣9️⃣.eth) = 36 20e3 39 20e3norm(norm(6️⃣9️⃣.eth)) => disallowed. This is because 20E3 is not allowed in IDNA2008 but the loss of the FE0F means the UTS-51 emoji criteria fails, but this could be fixed.

One advantage of being able to detect emoji sequences is that I can represent them as proper sequences:
ENSTokenized

2 Likes

Can you elaborate on why this is the case? Shouldn’t this be treated as two emoji sequences and thus not transformed at all?

  • $DIGIT FE0F 20E3 is the emoji sequence for a keycap, where $DIGIT = /#*0-9/.
  • $DIGIT 20E3 is unqualified.
  • IDNA2003($DIGIT 20E3) = $DIGIT 20E3
  • IDNA2008($DIGIT 20E3) => disallowed 20E3

I fixed this by just making the FE0F optional.


The damage report for UTS-51 + IDNA2008 is actually pretty mild.

One big gray area are the set of characters that are allowed by IDNA2003, disallowed by IDNA2008, but not emoji. I think they mostly correspond to the pictographs. I’m currently looking through them.

'1F150..1F169' // NEGATIVE CIRCLED
'1F16D..1F18F' // NEGATIVE SQUARED 
'1F191..1F1AC' // SQUARED

Edit: Some of these seem really dangerous and should be disallowed:

  • LEFT PARENTHESIS EXTENSION: "⎜" 239C
  • DIGRAM FOR GREATER YANG: "⚌" 268C

But why does the former normalise to the latter? I thought this new procedure identified emoji sequences and preserved them?

Edit I guess the goal was to have a framework that does it the correct way, and then try to relax it so it fits as many of the registered names as possible.

I thought the keycaps could be fixed using my method, but since some are already registered, it has to use the unqualified form. Both the * and # keycaps can use the FE0F.


Edit 2: Let me explain that a bit more and summarize:

If we use UTS-51 + IDNA2008, my suggestion would be anytime ENS wants to enable new emoji (ie. Unicode updates that add more emoji), the new characters should all normalize with FE0F attached when applicable. Both to preserve the intention (this an emoji) and avoid unqualified representations. (According to the spec, all emoji keyboards produce fully-qualified emoji.)

If there’s no FE0F, it goes though IDNA2008 is mapped or destroyed.

Since names have already been registered under IDNA2003 rules, emoji that were not mapped by IDNA2003 had their FE0F removed because it was ignored. Keycaps were lucky because they’re 3 characters and 20E3 was not removed, so they can still be “detected”.

This means FE0F is optional for some inputs, and results in names that can freely mix between emoji and text. This is mostly an historical accident, and maybe it’s good w/r/t homographic attacks, but most of the characters didn’t get this treatment.

  • tmtmtm === ™™™ === ™️™️™️ === tm™️™
  • 111 === 1️1️1️ but =!= 1️⃣1️⃣1️⃣
  • mmm === ⓂⓂⓂ === Ⓜ️Ⓜ️Ⓜ️ === mⓂⓂ️ but =!= 🅜🅜🅜 =!= 🅼🅼🅼

Some emoji like ⁉️ are disallowed in both versions of IDNA but are valid emoji. These can safely be enabled like new emoji, where FE0F is used.

My grandfather suggestion was that any emoji that’s not in a registered name, should also use FE0F going forward, treating it effectively like a new emoji. Implementation-wise, it’s simple: there’s 2 lists, the single character emoji set where FE0F is optional and everything else.

Edit 3: ❶ =!= ➊ (serif vs san-serif), ֍ =!= ֎ (orientation)

1 Like

That makes sense. Thanks for clarifying.

It might be helpful to divide this problem into 2 separate components:

  1. Normalization: Technical details for converting a domain into a labelhash.
  2. User experience: What a UI shows the user during ENS interactions.

I consider this distinction similar to the one between UTS#46 (protocol level normalization) and Internationalized Domain Names (IDN) in Google Chrome (user experience guidelines). The normalization piece is largely fixed and immutable, while the user experience can be tweaked depending on the application and/or context.

What do you think about breaking this out into 2 separate goals?

  1. Definition of a strict and precisely defined protocol which converts a unicode string into a labelhash. The protocol should be as simple as possible, and any changes should be backwards compatible.
  2. Recommend standard usability guidelines across platforms (these guidelines can exist on a spectrum depending on the application, and would be amenable to change in the future).

Additionally, I think the “holy grail” for normalization would be an on-chain normalization implementation. Even if it’s only economical with eth_call, and even if it doesn’t completely implement UTS46/IDNA2008; it would allow for an unambiguous version of ENS normalization. If we’re taking the time to firm up normalization requirements, I think we should consider whether this is actually feasible.

From EIP137:

<domain> ::= <label> | <domain> "." <label>
<label> ::= any valid string label per [UTS46](https://unicode.org/reports/tr46/)

I interpret this to mean that . shouldn’t be part of any string that is UTS46 normalized, so UTS46 stop rules would not apply.

Careful that we don’t fall into a false comparison here. The alternative isn’t to break existing names; it’s also possible to just ignore NV8 (the status quo). In my estimation, the proposed “branched normalization” procedure carries a steep cost (almost impractically steep, IMO):

  • doubles the complexity of the ENS normalization process (emoji rules + everything else rules)
  • doubles the complexity of the emoji normalization/validation process (old + new registrations)
  • requires every client maintain a list of old registrations

With stated benefits of:

  • Normalized emoji are fully-qualified
  • ContextJ doesn’t break emoji
  • Compliance with NV8 (does this convey any practical benefit?)
  • Small handful of previously disallowed emoji are now allowed.

I think these same benefits can be achieved more surgically with a few choice additions/deletions of individual IDNA2008 rules and some additional “user display” logic. I’m not completely against your “branched normalization” approach (I am a self-described ENS/emoji enthusiast, after all :slight_smile: ), but I’d be very cautious. The devil is in the details, which I don’t think are fully fleshed out yet.

Agreed. I suggested a display name which I think helps address the 99% situation— you transcribe or copy/paste a name the visual appearance after validation being the same as what you typed—is a good test. ie. Normalize, Lookup, Normalize again, Compare. I also think knowing the users intention per name is valuable.

For normalization, I think the IDNA 2003 rules are too random: strict on things that should be separate and transparent the things that are obviously malicious. I only realized NV8 wasn’t being used after working through Bidi and ContextO. I blame the Unicode spec.

I also prefer only period as the label separator but UTS #46: Unicode IDNA Compatibility Processing

I think improvements that benefit the user are the best, so not being mislead by spoofed names and giving users confidence that names > addrs, should be the focus. NV8 is definitely a huge improvement for textual names. I think emoji are a manageable subset of Unicode. Deciding which emoji and non-emoji pictographs are both valid-and-unique is a one-time deal and feasible, hopefully with some community input.


This is where my library is at currently:

I think the following outputs are useful to look at:

1 Like

I think it’s worth setting some objectives for any change to the normalisation function. In my mind they would go something like this:

  1. A new function must not result in previously valid labels normalising to a different normalised representation, unless it can be demonstrated that there are a negligible number of names affected, and the benefit from the change outweighs the effect on those names. When considering if impact is negligible, the number of names, whether they resolve to anything, and whether they appear to be in active use should all be considered.
  2. A new function may result in previously valid labels becoming invalid only if it can be demonstrated that the affected names are abusive or deceptive (eg, names containing non-meaningful ZWJs).
  3. Where a new function affects the normalisation of an existing name under (1), ENS should register the new normalisation and configure it as a duplicate of the previous name where possible. Where this is not possible, ENS should refund the user any registration costs, and make best efforts to make the user aware of the upcoming change.
  4. Where a new function makes previously valid labels invalid, and there are affected names that aren’t clearly abusive, ENS should refund those users their registration fees, and make best efforts to make the user aware of the upcoming change.
  5. When choosing between simplicity of the normalisation function and preserving existing registrations, preserving existing registrations should be given priority.
  6. Wherever possible, the normalised representation should visually match the most common or familiar form that users will enter or display the name in.
  7. Any normalisation function should avoid introducing visually identical inputs that resolve to different normalised forms (and thus namehashes). Wherever practical, inputs should either all normalise to the same label, or alternate representations should be made invalid.
2 Likes