ENS Name Normalization

I’d prefer that we continue to normalise any characters that are confusable with regular letters to their canonical form. Emoji are trickier - you’ve pointed out how subtle some skin color changes are, particularly in small sizes - but we should still aim to avoid having identical-appearing names with different encodings.

3 Likes

I think I got it working but I haven’t run a damage report yet. Latest version can be played on my ENS Resolver demo or on my github.

It’s basically running UTS-51 in front of IDNA2008, peeling off emoji into tokens, so they aren’t transformed. The implementation is pretty close to what I described above. It is also using ContextO.

So far I’ve noticed that 6️⃣9️⃣.eth = 36 fe0f 20e3 39 fe0f 20e3norm(6️⃣9️⃣.eth) = 36 20e3 39 20e3norm(norm(6️⃣9️⃣.eth)) => disallowed. This is because 20E3 is not allowed in IDNA2008 but the loss of the FE0F means the UTS-51 emoji criteria fails, but this could be fixed.

One advantage of being able to detect emoji sequences is that I can represent them as proper sequences:
ENSTokenized

2 Likes

Can you elaborate on why this is the case? Shouldn’t this be treated as two emoji sequences and thus not transformed at all?

  • $DIGIT FE0F 20E3 is the emoji sequence for a keycap, where $DIGIT = /#*0-9/.
  • $DIGIT 20E3 is unqualified.
  • IDNA2003($DIGIT 20E3) = $DIGIT 20E3
  • IDNA2008($DIGIT 20E3) => disallowed 20E3

I fixed this by just making the FE0F optional.


The damage report for UTS-51 + IDNA2008 is actually pretty mild.

One big gray area are the set of characters that are allowed by IDNA2003, disallowed by IDNA2008, but not emoji. I think they mostly correspond to the pictographs. I’m currently looking through them.

'1F150..1F169' // NEGATIVE CIRCLED
'1F16D..1F18F' // NEGATIVE SQUARED 
'1F191..1F1AC' // SQUARED

Edit: Some of these seem really dangerous and should be disallowed:

  • LEFT PARENTHESIS EXTENSION: "⎜" 239C
  • DIGRAM FOR GREATER YANG: "⚌" 268C
1 Like

But why does the former normalise to the latter? I thought this new procedure identified emoji sequences and preserved them?

Edit I guess the goal was to have a framework that does it the correct way, and then try to relax it so it fits as many of the registered names as possible.

I thought the keycaps could be fixed using my method, but since some are already registered, it has to use the unqualified form. Both the * and # keycaps can use the FE0F.


Edit 2: Let me explain that a bit more and summarize:

If we use UTS-51 + IDNA2008, my suggestion would be anytime ENS wants to enable new emoji (ie. Unicode updates that add more emoji), the new characters should all normalize with FE0F attached when applicable. Both to preserve the intention (this an emoji) and avoid unqualified representations. (According to the spec, all emoji keyboards produce fully-qualified emoji.)

If there’s no FE0F, it goes though IDNA2008 is mapped or destroyed.

Since names have already been registered under IDNA2003 rules, emoji that were not mapped by IDNA2003 had their FE0F removed because it was ignored. Keycaps were lucky because they’re 3 characters and 20E3 was not removed, so they can still be “detected”.

This means FE0F is optional for some inputs, and results in names that can freely mix between emoji and text. This is mostly an historical accident, and maybe it’s good w/r/t homographic attacks, but most of the characters didn’t get this treatment.

  • tmtmtm === ™™™ === ™️™️™️ === tm™️™
  • 111 === 1️1️1️ but =!= 1️⃣1️⃣1️⃣
  • mmm === ⓂⓂⓂ === Ⓜ️Ⓜ️Ⓜ️ === mⓂⓂ️ but =!= 🅜🅜🅜 =!= 🅼🅼🅼

Some emoji like ⁉️ are disallowed in both versions of IDNA but are valid emoji. These can safely be enabled like new emoji, where FE0F is used.

My grandfather suggestion was that any emoji that’s not in a registered name, should also use FE0F going forward, treating it effectively like a new emoji. Implementation-wise, it’s simple: there’s 2 lists, the single character emoji set where FE0F is optional and everything else.

Edit 3: ❶ =!= ➊ (serif vs san-serif), ֍ =!= ֎ (orientation)

2 Likes

That makes sense. Thanks for clarifying.

It might be helpful to divide this problem into 2 separate components:

  1. Normalization: Technical details for converting a domain into a labelhash.
  2. User experience: What a UI shows the user during ENS interactions.

I consider this distinction similar to the one between UTS#46 (protocol level normalization) and Internationalized Domain Names (IDN) in Google Chrome (user experience guidelines). The normalization piece is largely fixed and immutable, while the user experience can be tweaked depending on the application and/or context.

What do you think about breaking this out into 2 separate goals?

  1. Definition of a strict and precisely defined protocol which converts a unicode string into a labelhash. The protocol should be as simple as possible, and any changes should be backwards compatible.
  2. Recommend standard usability guidelines across platforms (these guidelines can exist on a spectrum depending on the application, and would be amenable to change in the future).

Additionally, I think the “holy grail” for normalization would be an on-chain normalization implementation. Even if it’s only economical with eth_call, and even if it doesn’t completely implement UTS46/IDNA2008; it would allow for an unambiguous version of ENS normalization. If we’re taking the time to firm up normalization requirements, I think we should consider whether this is actually feasible.

From EIP137:

<domain> ::= <label> | <domain> "." <label>
<label> ::= any valid string label per [UTS46](https://unicode.org/reports/tr46/)

I interpret this to mean that . shouldn’t be part of any string that is UTS46 normalized, so UTS46 stop rules would not apply.

Careful that we don’t fall into a false comparison here. The alternative isn’t to break existing names; it’s also possible to just ignore NV8 (the status quo). In my estimation, the proposed “branched normalization” procedure carries a steep cost (almost impractically steep, IMO):

  • doubles the complexity of the ENS normalization process (emoji rules + everything else rules)
  • doubles the complexity of the emoji normalization/validation process (old + new registrations)
  • requires every client maintain a list of old registrations

With stated benefits of:

  • Normalized emoji are fully-qualified
  • ContextJ doesn’t break emoji
  • Compliance with NV8 (does this convey any practical benefit?)
  • Small handful of previously disallowed emoji are now allowed.

I think these same benefits can be achieved more surgically with a few choice additions/deletions of individual IDNA2008 rules and some additional “user display” logic. I’m not completely against your “branched normalization” approach (I am a self-described ENS/emoji enthusiast, after all :slight_smile: ), but I’d be very cautious. The devil is in the details, which I don’t think are fully fleshed out yet.

3 Likes

Agreed. I suggested a display name which I think helps address the 99% situation— you transcribe or copy/paste a name the visual appearance after validation being the same as what you typed—is a good test. ie. Normalize, Lookup, Normalize again, Compare. I also think knowing the users intention per name is valuable.

For normalization, I think the IDNA 2003 rules are too random: strict on things that should be separate and transparent the things that are obviously malicious. I only realized NV8 wasn’t being used after working through Bidi and ContextO. I blame the Unicode spec.

I also prefer only period as the label separator but UTS #46: Unicode IDNA Compatibility Processing

I think improvements that benefit the user are the best, so not being mislead by spoofed names and giving users confidence that names > addrs, should be the focus. NV8 is definitely a huge improvement for textual names. I think emoji are a manageable subset of Unicode. Deciding which emoji and non-emoji pictographs are both valid-and-unique is a one-time deal and feasible, hopefully with some community input.


This is where my library is at currently:

  • IDNATestV2: a few tests fail now that I’ve disallowed alternative stops. I’m also playing with some of the emoji and pictograph rules.

  • I implemented NFC myself because String.normalize fails some tests: NormalizationTest

  • I’ve moved the live demo to github hosting so I can iterate faster without publishing to npm: ENS Resolver Edit: fixed url

  • I think the core functions are pretty literal but it’s still a WIP: Tokenized IDNAContextJ+OUTS-46 w/BidiUTS-51 Emoji Logic

  • The Bidi, Context, IDNA, NFC, and Emoji units are all independent.

I think the following outputs are useful to look at:

2 Likes

I think it’s worth setting some objectives for any change to the normalisation function. In my mind they would go something like this:

  1. A new function must not result in previously valid labels normalising to a different normalised representation, unless it can be demonstrated that there are a negligible number of names affected, and the benefit from the change outweighs the effect on those names. When considering if impact is negligible, the number of names, whether they resolve to anything, and whether they appear to be in active use should all be considered.
  2. A new function may result in previously valid labels becoming invalid only if it can be demonstrated that the affected names are abusive or deceptive (eg, names containing non-meaningful ZWJs).
  3. Where a new function affects the normalisation of an existing name under (1), ENS should register the new normalisation and configure it as a duplicate of the previous name where possible. Where this is not possible, ENS should refund the user any registration costs, and make best efforts to make the user aware of the upcoming change.
  4. Where a new function makes previously valid labels invalid, and there are affected names that aren’t clearly abusive, ENS should refund those users their registration fees, and make best efforts to make the user aware of the upcoming change.
  5. When choosing between simplicity of the normalisation function and preserving existing registrations, preserving existing registrations should be given priority.
  6. Wherever possible, the normalised representation should visually match the most common or familiar form that users will enter or display the name in.
  7. Any normalisation function should avoid introducing visually identical inputs that resolve to different normalised forms (and thus namehashes). Wherever practical, inputs should either all normalise to the same label, or alternate representations should be made invalid.
6 Likes

I think those objectives seem reasonable. And I agree the decisions should hinge on concrete numbers and the damage caused to existing names.

1 Like

Well, I was just about to post that I had it all figured out, and then I was curious what would happen if I only gave my library UTS-51 rules and no UTS-46 rules (this should be a perfect emoji parser), and I ran into an issue with ZWJ when testing it:

💩💩 => valid
💩{200D}{200D}💩 => invalid
💩{200D}💩 => valid, but different than above
// invisible poo joiner!

// note: these are separate and already registered:
😵‍💫 = {1F635}{200D}{1F4AB}
😵💫 = {1F635}{1F4AB} 

// and even weirder, since "1" is an emoji
// this is also valid:
1{200D}1

I don’t see anyway of differentiating emoji_zwj_sequence (legal ZWJ sequences) from whats called the RGI_Emoji_ZWJ_Sequence (recommended for general interchange) without pinning the allowable sequences to a whitelisted set of permutations: https://adraffy.github.io/ens-normalize.js/build/unicode-raw/emoji-zwj-sequences.txt

Additionally, how should we treat deliberately text-styled emoji like 💩︎ = 💩{FE0E}?

// note: FE0E and FE0F are ignored

// (A) these should be the same
// (text styling without a joiner)
// each is 2 glyphs
😵︎💫 = {1F635}{FE0E}{1F4AB}
😵︎💫️ = {1F635}{FE0E}{1F4AB}{FE0F}
😵💫︎ = {1F635}{1F4AB}{FE0E}
😵︎💫︎ = {1F635}{FE0E}{1F4AB}{FE0E}

// (B) these should be the same (ZWJ sequence)
// but different from above:
// each is 1 glyph
😵‍💫 = {1F635}{200D}{1F4AB}
😵️‍💫 = {1F635}{FE0F}{200D}{1F4AB}

// but with deliberate text styling,
// the ZWJ sequence is supposed to terminate!
// (how these render varies per platform)
(1) 😵︎‍💫 = {1F635}{FE0E}{200D}{1F4AB} => technically invalid
(2) 😵︎‍💫️ = {1F635}{FE0E}{200D}{1F4AB}{FE0F} => technically invalid
// this one is okay, no joiner
(3) 😵︎💫︎ = {1F635}{FE0E}{1F4AB}{FE0F} => valid 

// depending on the Unicode implementation,
// the (1) edits as either:
(Edit1) [text-styled 1F635] + [200D] + [1F4AB] // 3 glyphs
(Edit2) [text-styled 1F635 + 200D + 1F4AB]     // 1 glyph
// (Edit1) and (Edit2) should correspond to (A) or (B) above

If we use the RGI_Emoji_ZWJ_Sequence whitelist, this problem is solved because no text-styled emoji appear in those sequences. Which would imply that text-styled emoji terminate, and are equivalent to their unstyled or emoji-styled versions: 💩︎ == 💩️{FE0F} == 💩(default style).

Edit: I believe my latest version gets everything correct w/ the exception of not knowing which emoji/characters should be enabled/disabled. It correctly handles the examples above.

Characters that should probably change:

  • Currency symbols: $, ¢, £, ¥, €, ₿, ... These are disallowed, they should be allowed.
  • Checkmarks: ❌❌︎🗙🗴🗵🗶🗷✔️✔🗸🗹 These are allowed, they should be consolidated.
  • Negative Circled (Serif vs San-serif): ❻ ➏ these are allowed, they should be equivalent.
  • Double vs Single Circled Digits: ➀ vs ⓵ these are allowed, they should be equivalent.
  • Emoji Tags: 🏴󠁧󠁢󠁥󠁮󠁧󠁿 1F3F4 E0067 E0062 E0065 E006E E0067 E007F: these are allowed, they should be ignored as they hide arbitrary data.

Reasonable but Non-RGI ZWJ Sequences:

Not sure:

  • Variations of this guy: ༼つ◕o◕༽つ, ಠ‿ಠ, └།๑益๑།┘
  • Blocks: ▁▂▃▄▅▆▇█ If these are allowed, why is _ disallowed? And what’s this? 🗕
  • Math Symbols: many seem very cool and unique, but need individually reviewed.
  • Abstract shapes

These lists are incomplete.

Edit: I’ve computed the 75 ZWJ sequences that are non-RGI but show up on emojipedia as JSON. I’ve also included them into the Emoji report.

4 Likes

I respect you! I want to resemble him.

2 Likes

What if we do limit it to the whitelisted set? We can expand the set over time without fear of breaking existing names.

I assume by ‘ignored’ you mean they should normalise to the non-tagged version?

Underscore should not be disallowed, I think. It’s theoretically disallowed in DNS, but in practice is used for a bunch of special-purpose stuff (dmarc, SPF etc).

1 Like

Yes.

After some thought, the presence of an emoji tag sequence should simply terminate the emoji parsing, rather than consuming the tag and ignoring it. The tag sequence would then be processed (and rejected) by IDNA 2008. Ignoring is bad because you need to differentiate Flag from Flag+TagSequence since Flag could combine with something else.

There are only (3) RGI tag sequences in the Unicode set (each following a black flag emoji.) Maybe they should be whitelisted, as 🏴󠁧󠁢󠁥󠁮󠁧󠁿.eth is currently owned (but vulnerable to spoofing on nonsupporting platforms as the tag sequence renders invisibly.)

So the whitelist logic would be:

SEQ = list of allowed complete sequences (3 RGI)
ZWJ = list of allowed ZWJ sequences (1349 RGI + 0-75 non-RGI)

  1. Find the longest SEQ that exactly matches the characters.
  2. If it exists, produce an emoji token and goto 1 (this handles the flag + tag sequences)
  3. Parse the characters according to UTS-51, where ZWJs can join if they form a whitelisted sequence.
  4. If an emoji was found, produce and emoji token and goto 1.
  5. Parse the character according to UTS-46 and goto 1.

With this logic, an unsupported ZWJ sequence will terminate before a ZWJ, which will then go through UTS-46 and get rejected by ContextJ. An unsupported SEQ sequence is just parsed normally by UTS-51 and UTS-46 (and likely rejected.)


I updated my library to support this logic. I also whitelisted the 3 RGI tag sequences and added some non-RGI ZWJ sequences as a test.

The auto-generated ZWJ and SEQ whitelists and my addition.

1 Like

Is this tied into why pfps don’t display right on emoji domains even when they are set with an avatar to display? The avatars don’t display I noticed

1 Like

Can you provide a specific example?

@mdt helped me find the answer in another thread, thank you! @raffy