ENS Name Normalization

raffy · December 19, 2021, 1:06am

Let me break this problem up a bit more:

Why do we need normalization in the first place?

Casefolding: A vs a
Normalized forms: Å (212B) vs Å (0xC5) or . (0x2E) vs ．(0xFF0E)
Potential Junk: obviously illegal characters \[{(%&, characters that lack display, etc.
Names need to be identifiable from their context. Consider "A B.eth" if spaces were allowed.
Names also exist in a hierarchy, so they need to decompose to a series of labels with a separator (.).

Why does this get complicated?

Alphabet overload: Latin, Greek, Subscript, Circled, Squared, Math, Full-width, Bold, Italic, Script, etc. There’s more than 20 characters that map to the letter "a".
Emoji Appearance: ⚠ (26A0) vs ⚠️ (26A0 FE0F)
Emoji ZWJ Sequences: 🏴‍☠️ (1F3F4 200D 2620 FE0F) vs 🏴☠️ (1F3F4 2620 FE0F) vs 🏴☠ (1F3F4 2620)
Unicode evolution: there will obviously be more emoji in Unicode 15 in 2022.

What are the actual use-cases:

Registration: I want to rent raffy.eth, is it available?
Identity: I want to send 1E to raffy.eth

How can we improve on this?

I’ve mostly been thinking about the user-side, so I’m just going to discuss the identity aspect.

Identity

Where do name comes from?

Direct input
Copy/paste

What encoding problems naturally happen?

DNS uses punycode to represent Unicode
Some channels don’t support Unicode
Some channels mutilate Unicode (this forum outside of `-blocks)

What can go wrong?

Direct input: complex emoji are hard to write/edit, especially with joiners.
Direct input: mixed directional names are hard to write/edit
Direct input: input mistakes
Direct input: transcription confusion
Copy/paste: homograph attacks

Imagine you’re give a name, either as a placard to transcribe or text to copy/paste:

If it’s Base36 [a-z0-9], that’s a great sign (trivial case)
If it’s already normalized, that’s a good sign. Note: this is false for many emoji.
My suggestion: If it matches the display name, that’s a great sign.
If it’s Base62, that’s a good sign.
If it only requires case-folding to reach the normalized name, that’s a good sign.
If it’s IDNA, that’s an okay sign.
If it’s a single script per label, that’s an okay sign.
If it contains ignored characters, that’s a bad sign.
If it contains disallowed characters, that’s a bad sign.
If it has punycode, that’s a bad sign.

If you’re sending crypto to someone, why would you ever want to interact with a name that isn’t normalized? It’s basically a red flag.

However, 🏳️‍🌈.eth is forever unnormalized because the FE0F is stripped by IDNA rules, so it will always trigger this kind of warning. But, if we implement the display name conventions, then this would work!