ENS Name Normalization

I started to format the information like an ENSIP (still a WIP). Instead of explaining my implementation in words, I decided to explain the modifications and gotchas first, and then illustrate the tokenized version in psuedo-code as separate section.

This was my issue with punycode all throughout this thread. UTS-46 specifically has punycode decoding. I also agree we should remove it. I will take this as opportunity to do so.

However the presence of undecoded punycode will still throw an error because:

  • CheckHyphens disallows -- at position 3,4
  • In unstructed text, there’s no way to know if xn--ls8h is puny or literal

As an ENS holder and especially interested in collecting exotic domains (emojis, etc.) I was wondering when it will be possible to incorporate emojis from the 14.0 Unicode standard into the app. These are already in use in the newest iOS and Android versions.

For example:

https://app.ens.domains/search/%F0%9F%AB%A0%F0%9F%AB%A0%F0%9F%AB%A0.eth

The full suite of such emojis can be found here: Emoji Version 14.0 List

Having read the whole thread so far I want to applaud raffy very much for his efforts in this matter. His technical understanding of the Unicode standard and its quirks is very remarkable. His draft I believe is a much needed clarification and update to the older ENSIP standard from 2016 and brings much needed additions to the protocol that would reduce many ambiguities that are present in the current implementation.

I would be very much in favor for raffy being compensated for his extensive work here and I am very eager to see this draft implemented into the protocol soon.

2 Likes

We should probably remove that check too, if punycode has no special meaning.

Since we don’t recognise punycode, it’s always literal.

1 Like
  • CheckHyphens should just be false then? I see no issue with arbitrary hyphen placement. For example, underscore will be able to go anywhere. The DNS comment in ENSIP-1 seems to imply that CheckHyphens was intended to be false? as there’s a warning about starting/ending hyphens for DNS compatibility.

  • Should whitespace be ignored instead of an error? IDNA with UseSTD3ASCIIRules=true) disallows almost all whitespace. I was writing a section about preprocessing but noticed you can safely replace all whitespace with empty string. Example: " raffy.eth " == " r affy.eth" == "raffy.eth".

  • I’m writing my ENSIP in terms of modifications to UTS-51 and UTS-46, which makes the algorithm appear somewhat complicated. If we supply a derived file (similar to this one) which contains all of the character sets (valid, mapped, ignored, stops, combining marks, various emoji classes, whitelisted sequences, zwj sequences, etc.), I believe the implementation is pretty straight forward.

2 Likes

Yes, it should be false - though as noted, names that violate it may not work via gateways such as eth.link and eth.limo.

It should result in an error - though clients should probably strip leading and trailing whitespace before normalisation for user convenience.

That sounds like a good idea.

1 Like

I’m still uncertain if only catching confusables in the output/normalized form is sufficient but it greatly simplifies the problem. All you need is a mapping from normalized characters to scripts (to determine if a label is single-script), additional disallowed single characters (which are just extra IDNA rules), and a small list of 2-4 length character sequences (~400) that are also disallowed anywhere in the output label.


Here is an ornate example I found where you have confusability on the input (that isn’t covered by UTS-39 because it’s an emoji+text situation) but clearly different on the output (using proposed normalization, without additional logic.)

Text-styled regional indicators potentially render as small-cap ASCII:
Untitled


// default emoji-styled
1F1FA 1F1F8 = "🇺‌🇸" => "🇺🇸" // 2x regional (valid flag sequence)
1F1FA 1F1F9 = "🇺🇹"         // 2x regional (invalid flag sequence)

// explicit emoji styling
1F1FA FE0F 1F1F8 FE0F = "🇺🇸" // same

// explicit text styling
1F1FA FE0E 1F1F8 FE0E = "🇺︎🇸︎" // small-caps ASCII (no flag)

// confusing example
62 1F1F4 FE0E 1F1F8 FE0E 73 = "b🇴︎🇸︎s" vs "boss" // before normalization 
                            = "b🇴🇸s" // after normalization (FE0E is dropped)

Untitled-6

1 Like

Do you have reason to believe it’s insufficient, or just the impression that you may not have explored the entire design space here?

I think the issue boils down to whether the end-user ever sees the normalized form:

  • If I enter R𝔄𝔉𝔉𝔜.eth into app.ens.domains, I get sent to raffy.eth. In this situation, the end-user directly observes the normalized form.

  • If I type R𝔄𝔉𝔉𝔜.eth into Metamask send, I get my eth address and the letters 𝔄𝔉𝔉𝔜 are highlighted red but I do not see the normalized form. With this UX, there are certainly inputs that are confusable that result in outputs that are not.

  • The UX guidelines don’t comment about this.

The primary issue with showing the normalized form is that it mangles some names, which might be confusing to end-users.

I also am unsure how frequently ENS users use/share a valid but non-normalized variant of their ENS name. Some of these permutations may violate the confusable rules but I have no way of checking (without generating all permutations of registered labels.).

  • 10L.eth vs 10l.eth
  • atmo.eth vs A™O.eth (not confusable but you get the idea)

On-input would mean normalization throws on names that are confusable on input without any processing:

  • Need the full set of confusables (not just the casefolded normalized ones)
  • Are XⅩ𝐗𝑋𝑿𝒳𝓧𝔛𝕏𝖃𝖷𝗫𝘟𝙓𝚇 actually confusing if they all map to x? I’d say no.
  • Need to check if the output is also confusable as the normalized form might become confusable after normalization. eg. Ɑ.eth isn’t confusable (2C6D Latin Capital Letter Alpha) but it’s normalized form ɑ.eth is.

Here’s an example where on-output confusables will fail: ⒸⒸⒸ.eth =!= ©©©.eth

  • Ⓒ (24B8) normalizes to "c" (63)
  • © (A9 FE0E?) normalizes to "©" (A9)
  • ©️ (A9 FE0F) normalizes to "©" (A9)

After writing this, I think I need to check both input and output, and revise the numbers I computed above.

2 Likes

Revisiting the confusable stuff: this unpolished report shows for each script, which input characters correspond to confusables that map to 2+ output characters. The character left of the arrow is the normalized form of the group. Each colored character has a tooltip for name, script, and codepoints.

A grouping is red if the mapping changes the script when normalized. If the input label is single-script, then applying normalization will introduce a new script, so that the output label will not be single-script (unless all of this type).

The default decision would be to disallow all of these characters. Manual exceptions should be made when there is an obvious choice (like ASCII) but the choice isn’t clear in some cases and I don’t know how to do resolve the non-Latin scripts.

Here are a few examples:

  • Latin Small “c” has 2 possibilities, clearly the ASCII C should be allowed and the small-capital C should be disallowed. Likely, most (all?) the small capitals should be disallowed.

  • Latin Capital “A” with U-thingy has 2 possibilities: I don’t know which one should be canonical, probably disallow both?
    image


Like the circled-C example above, here is another example that would pass on-output filtering (because of IDNA mapping) but would correctly be handled if you check both input and output:

image

  • "lll" (6C 6C 6C Latin Small L)
  • "Ⅲ" (2162 Roman Numeral Three) normalizes to "iii"
  • Note: i and l wont confuse because they’re ASCII
    I’m assuming this is the right choice?
  • LLL.eth, lll.eth, iii.eth are fine
  • Ⅲ.eth would fail (input: fail, output: pass).
  • Note: I don’t know if the owner of iii.eth uses Ⅲ.eth
    I probably would because it’s a single character.
  • It’s possible this confusable should be ignored.

The above stuff looks insane, but as I said before, it essentially reduces to just a list of disallowed characters and sequences. The hard part is deriving that list.

3 Likes

This is definitely an issue, and clients should show the user the normalised form. You’re right that we need to update our docs to reflect this.

True; however, I think it’s more important to have a single consistent representation of each name that users will always see. This is also consistent with what browsers do - if you enter GOOGLE.COM, for example, it shows you the normalised google.com.

Wouldn’t on-input confusables also fail here, since they’re different characters? I’m also of the opinion that it’s less of an issue on the output, because the user can see what the normalised name looks like and if it matches their expectations.

How (if at all) are these examples handled at present?

Great, that’s what I was thinking as well.

Input: ⒸⒸⒸ vs ©©© isn’t confusable (but it should be)
Output: ccc vs ©©© isn’t confusable.

With on-output logic:

  • Anything that normalizes to ASCII c would be allowed. Any other c-like Latin character is disallowed.
  • Both A’s are lowercased and then disallowed because I decided ăǎ has no preferred choice in Latin.
  • Ⅲ.eth would be allowed because after normalization it’s iii.eth
1 Like

I found another exotic emoji situation but I don’t think it has any consequence because of the ZWJ whitelist.

According to UTS-51, emoji_modifier_sequence = emoji_modifier_base emoji_modifier where both of those terms are defined as singular emoji characters (no styling.)

Examples:

  • 261D 1F3FB + 🏻☝🏻
  • 1F469 1F3FD 200D 2695 → (👩 + 🏽👩🏽) + :woman_health_worker:t4:

However, Windows allows you to stick an FE0F (followed by an arbitrary amount of FE0E/FE0F) on the base emoji and then an arbitrary amount of FE0E/FE0F on the modifier emoji (including the mismatched case), and renders the sequences as identical:

  • 1F469 FE0F 1F3FD 200D 2695👩️🏽‍⚕
  • 1F469 FE0F 1F3FD FE0E 200D 2695👩️🏽︎‍⚕ (mismatched)
  • 1F469 FE0F 1F3FD FE0F 200D 2695👩️🏽️‍⚕
  • 1F469 FE0F FE0E 1F3FD FE0F 200D 2695👩️︎🏽️‍⚕

If a ZWJ sequence ever started with a modifier, then you could construct a ZWJ sequence that appears as different sequences on different platforms:

  • (:woman:+FE0F)+(🏽+ZWJ+⚕️)
  • ((:woman:+FE0F+🏽)+ZWJ+⚕️)

Instead of allowing non-standard emoji-styling on base and modifier, I assert that no whitelisted sequence can start with a modifier and that every base and modifier can be parsed without a FE0F.

4 Likes

Here are most recent files for a release without single-script input/output confusables.

4 Likes

In that latest error report, what does “diff-norm” mean? I see it’s all the punycode literals like xn--.... I’m assuming those names would also be considered invalid and fail normalization with your library?

2 Likes

Also, bringing this up because of ens-metadata-service #71, but I see that ⌐ {2310} is not allowed.

I see examples of that in the adraffy-error section, like:

⌐◨‒◨	{2310}{25E8}{2012}{25E8}
⌐◨−◨	{2310}{25E8}{2212}{25E8}

However, there are at least two other such registered names:

⌐◨‐◨	{2310}{25E8}{2010}{25E8}
⌐◨-◨	{2310}{25E8}{002D}{25E8}

If is not allowed, why aren’t they in the error report?

2 Likes

eth-ens-namehash and ens-normalize both normalize without error but produce different outputs.

Nick decided that puny should be ignored and CheckHyphens was/is false so puny literals (Lower ASCII + Hyphen) should pass as-is.

I will have a section in the ENSIP about DNS compatibility.

I also think there should be a tool which determines if your name can be shuttled onto DNS (using latest standards) and if punycode is required.

They’re both invalid in the demo so I must not have record of them. I am using a set derived from a list Nick provided + capturing NameRegistered events since then. Can someone provide me full set of registered names as-of today?

3 Likes

I think that as long as we ask clients to display the normalised name this should be okay.

Can ăǎ not normalise to themselves? They’re valid (non-latin) characters.

1 Like

Sure, but I don’t know where you’re drawing the line then: ăǎ (103 vs 1CE) look nearly identical to me. (Unicode Confusable)
image

1 Like

Another weird one (probably not important): I can’t find a single example of ContextJ (allowing ZWNJ/ZWJ in specific circumstances) beyond the rule existing in UTS-46 and RFC 5892. There must exist names that require it in Arabic/Indic/Persian. Supposedly it induces a cursive rendering of the joined characters? It’s unclear to me if it alters the interpretation.

However, if the character has no joining character, does not support cursive, or font support is poor, then the ZWJ renders invisibly. I am now confused why UTS-46/IDNA makes such a big deal about this situation.

The following names are distinct yet look pixel-identical on all devices I own:

  • a्‍.eth vs a्.eth (61 94D 200D) (Latin+Devanagari) →
    • a्‍a् (61 94D 200D 61 94D) renders like two different-sized a's.
  • ऄ् vs ऄ्‍ (904 94D 200D) (only Devanagari)
1 Like

Fair enough, I guess we don’t have a principled way to distinguish these.

1 Like