ENS Name Normalization 2nd

raffy · January 12, 2023, 11:02am

Once again, I apologize for the slow turnaround.

The spec has been pretty stable and I’m satisfied with how it functions. I’ve had time to review the breakdown reports and I think they look reasonable.

A few months ago, an earlier version of ens-normalize.js (lacking validation/confusables) was incorporated into ethers.js. Marketplaces like ens.vision already use ens-normalize.js. I’ve been monitoring daily registrations and have seen a pretty large reduction in invalid names.

I’ve received many DMs with questions and concerns and appreciate all the feedback.

I’ve updated my eth-ens-namehash branch with the latest ens-normalize.js logic. I decided to only include the minimal code necessary (rather than all of the derivation and validation stuff.) I updated the PR too.

Everything else can be found in my ens-normalize.js repo.

At the moment, my ENSIP document is still unfinished. I’ve been struggling to concisely describe everything. Possibly the formatting style I chose—nested (un)ordered lists instead of paragraphs or pseudo-code—is just too limiting. I just need to complete the section on how I resolve confusables and then I’ll publish it.

I don’t exactly know what the next steps are in this process but I think there’s enough code, tests, and reports to get the ball rolling.

Here are some notes to help pinpoint an area of disagreement:

I don’t think any of my emoji choices are controversial. AFAIK ens-normalize.js is the only normalization library which enforces correct emoji sequencing. I’m not aware of any exploits or spoofs beyond emoji that actually look similar.
I think my hyphen changes have the correct balance of reducing confusables but providing good UX to users.
I think _ and $ are great additions. I allowed all modern currency symbols. The correct Ethereum symbol is Greek Xi. The other triple-bar characters are confusable.
I disabled Braille, Sign Writing, Linear A, and Linear B scripts.
I disabled all Combining Characters, Box Drawings, Di/Tri/Tetra/Hex-Grams, Small Capitals, Turned Characters, Musical Symbols, and other esoteric characters.
I disabled many formatting characters like digraphs and ligatures.
I disabled nearly all punctuation and phonetic characters.
I heavily curated the allowed combining marks in Latin-like scripts. Not all exemplar recommendations are allowed. I used prior registrations and scripted dictionaries to decide the limit on the number of adjacent combining marks, however native users might have different opinions.
I disabled many obsolete, depreciated, and archaic characters.
For non-emoji symbols, my convention was always to choose the heavy variant if available, rather than have 5+ differently-weighted variants of the same symbol.
I merged upper- and lowercase confusables. For example, scripts with a Capital G-like character (with no lower-case equivalent) confuse with Latin g since Latin G is casefolded to g.
I’m aware there are multiple-character confusable but I’m not aware of a reasonably exhaustive list of known cases. For the ones I’m aware of, I don’t think the implementation complexity is worth it.
There are still unresolved confusables where I can’t decide which character is the preferred one. The default has been to allow them. You can seem many of them by browsing the Han group using my confusable explainer.
There are features that ended up being dead code because they aren’t needed (but might be needed in the future.) Instead of including that code, it is commented out and there are checks during the derive process which fail if the code is required. For example, after all of the combining mark + confusable logic is applied, there aren’t any whitelisted multi-character combining mark sequences that don’t collapse (NFC) into a single character.
ADDED Non-IDNA Mappings — discussed above.. I think my solution is more consistent but I think disallowing these characters is also valid.

Nearly all of the decisions above can be found in /derive/rules/ and all of the necessary data for implementation can be found in /derive/output. There also is a text log of the derive process which annotates all of the changes relative to IDNA.

Resolver demo and npm package are using the latest code.