Let me break this problem up a bit more:
Why do we need normalization in the first place?
-
Casefolding:
A
vsa
-
Normalized forms:
Å (212B)
vsÅ (0xC5)
or. (0x2E)
vs.(0xFF0E)
-
Potential Junk: obviously illegal characters
\[{(%&
, characters that lack display, etc. - Names need to be identifiable from their context. Consider
"A B.eth"
if spaces were allowed. - Names also exist in a hierarchy, so they need to decompose to a series of labels with a separator (
.
).
Why does this get complicated?
-
Alphabet overload: Latin, Greek, Subscript, Circled, Squared, Math, Full-width, Bold, Italic, Script, etc. There’s more than 20 characters that map to the letter
"a"
. -
Emoji Appearance:
⚠ (26A0)
vs⚠️ (26A0 FE0F)
-
Emoji ZWJ Sequences:
🏴☠️ (1F3F4 200D 2620 FE0F)
vs🏴☠️ (1F3F4 2620 FE0F)
vs🏴☠ (1F3F4 2620)
- Unicode evolution: there will obviously be more emoji in Unicode 15 in 2022.
What are the actual use-cases:
-
Registration: I want to rent
raffy.eth
, is it available? -
Identity: I want to send 1E to
raffy.eth
How can we improve on this?
I’ve mostly been thinking about the user-side, so I’m just going to discuss the identity aspect.
Identity
Where do name comes from?
- Direct input
- Copy/paste
What encoding problems naturally happen?
- DNS uses punycode to represent Unicode
- Some channels don’t support Unicode
- Some channels mutilate Unicode (this forum outside of `-blocks)
What can go wrong?
- Direct input: complex emoji are hard to write/edit, especially with joiners.
- Direct input: mixed directional names are hard to write/edit
- Direct input: input mistakes
- Direct input: transcription confusion
- Copy/paste: homograph attacks
Imagine you’re give a name, either as a placard to transcribe or text to copy/paste:
- If it’s Base36 [a-z0-9], that’s a great sign (trivial case)
- If it’s already normalized, that’s a good sign. Note: this is false for many emoji.
- My suggestion: If it matches the display name, that’s a great sign.
- If it’s Base62, that’s a good sign.
- If it only requires case-folding to reach the normalized name, that’s a good sign.
- If it’s IDNA, that’s an okay sign.
- If it’s a single script per label, that’s an okay sign.
- If it contains ignored characters, that’s a bad sign.
- If it contains disallowed characters, that’s a bad sign.
- If it has punycode, that’s a bad sign.
If you’re sending crypto to someone, why would you ever want to interact with a name that isn’t normalized? It’s basically a red flag.
However, 🏳️🌈.eth
is forever unnormalized because the FE0F
is stripped by IDNA rules, so it will always trigger this kind of warning. But, if we implement the display name conventions, then this would work!