I wrote this on the GitHub issue but it’s worth repeating my thoughts here too.
At the end of the day, if the official ENS manager UI allows a name to be registered, then it should be considered valid in my opinion.
As far as normalization goes, I think it’s less important what the actual rules are, and more important that all the places in ENS that use normalization are consistent, or else it leads to these issues where someone registers a name and then finds out later that they can’t use it. This includes:
The ENS manager app UI
Searching for a name should yield the normalized/rectified version of that name
Going directly to the URL of a non-normalized name should redirect to the normalized version (like ⌐◨‐◨.eth should redirect to ⌐◨-◨.eth)
Since it sounds like the ENS metadata service is changing which names are valid/normalized and which ones aren’t, and some people have already registered names through the manager UI that they thought were valid, those people will simply be stuck with unusable/unsellable names. @mdt do you think it would be possible to get a report on how many currently registered names would still be rendered invalid after your changes?
The thing I learned the most from this adventure is that I’m never interacting with an ENS name unless I can type it myself. The moment I see anything weird, basically any emoji or non-ASCII, I’m entering it into my resolver demo site so I can see exactly how it decomposes.
I definitely agree with this.
This report is relative to eth-ens-namehash 2.0.5. I could run this again for mdt’s fork?
In my opinion “m”, “0”, “1” should not be considered confusables, as flagging these would greatly dilute the signal to users. If marketplaces/integrations are using the metadata service as a source of truth for name validity, that is a powerful lever that should be used to promote uniformity across the ecosystem. Maybe the metadata service could include some of the extended visual parsing that Raffy’s tool provides, accompanied by specific UX guidelines from ENS on how best to handle confusables.
On a related note, I would like to suggest that the default avatar typeface be changed to one that better distinguishes between all confusing characters, but particularly zeroes and lowercase L vs I:
Not suggesting a monospace font, just one with less ambiguity. Many confusables could potentially be addressed by a custom typeface. This makes the avatar overlay more useful. Imagine wallet UIs where the ENS avatar is displayed for quick visual confirmation, in the same way that avatars are used on Venmo, etc.
It’s unfortunate that Metamask/Opensea/etc have built their own ENS “validation” implementations (and I agree with @serenae and @aox.eth that the Metamask “confusables” UI has gone too far). This should speak to a clear need for ENS to develop and maintain improved client packages/libraries, and actively encourage their widespread use.
As a “front-end developer” working on ENS integration, I would want the following API:
process(input) -> {
"normalized": [ // for each label
{"label": <uts46 normalized label>, "hash": <label hash>}
],
"nameHash": 0x123....,
"display": <concatenated normalized field, but emoji are shown in full-qualified form>,
"warnings": [
"mixed char sets", // Maybe ascii+emoji is ok?
"extraneous invisible chars",
"right-to-left chars",
...
],
"info": [ // for each unicode char:
{"unicode": <U+XXX>, "charset": <unicode charset>, "confusable": <bool>},
...
]
}
Normalized field should only be used for internal ENS lookups, and should never be shown to end-users. I think it’s important that the protocol-level normalization is as simple as possible; it’s helpful that it can be implemented in only a few dozen lines of code.
Display field is designed for display to end-users, and could include @raffy’s emoji processing logic (among other potentially useful things).
Warnings signal likely scam attempts, and should always be prominently shown to the end-user. We can debate about what should and shouldn’t go in here, but I think the goal is that “good faith” registrations should never show any warnings.
Implementors can decide whether exposing “info” makes sense for their use-case. I personally think “confusables” are more of a typography issue, but if they’re a hard Metamask requirement, that’s probably good enough reason to include them.
I’m probably missing some things, but I think this general approach would solve most of the problems discussed in this thread (including the ZWJ issue, and the emoji issue), without the need for drastic changes to low-level normalization procedures or on-chain storage. It’s also fairly amenable to change without breaking the API.
I attempted this, but it’s a little more complex than I thought. I’ll need to revise the previous results once I figure out a better solution.
If we’re only looking at labels post-normalization which are single-script (ignore Common and Inherited), then the label is confusable-free iff every subsequence of that string is canonical (has no confusables or is the preferred form for that script.)
Every example in the confusable database is a single character, but the confusable-itself might be multiple characters, eg. O- for θ. The Unicode spec (and the official confusable utility) appears to only work with single-character confusables.
Unicode also doesn’t make confusables reflexive, so m is confusable with rn but not vice-versa (I guess because only single character matches are considered?)
If we only do single characters, the above statement can be relaxed to just checking if every character is canonical.
However, I’m not sure how to derive the canonical choice (per script) from the confusable database when there are ambiguities (or if it exists). There are 661 examples where a single script has 2+ matches:
Confusable a for Latin: a vs ɑ → Trivial: a should be canonical
Confusable f for Latin: f vs ꬵ vs ꞙ vs ẝ → Trivial: f should be canonical
Confusable l for Hebrew: ו vs ן → ???
Confusable o for Greek: ο vs σ → ???
There are also examples like rn and m that are just dumb (both ASCII), but I’m not sure how to resolve these in general either.
I will manually decide the Latin cases and take the rest as being invalid in any form and then recompute the results.
We want to know how many names are valid if names can only span a single script (but can be mixed with common and inherited scripts.)
I claim we only care if there are confusables in the normalized output, not the input.
An emoji is never confusable.
A confusable can span multiple characters.
A confusable requires an exact match.
A confusable that gets transformed by normalization isn’t a possible match.
When the confusables are grouped by script, many confusables disappear, but there still exist groups of 2+ sequences that are confusable.
For Latin, I’ve resolved all of these conflicts by hand: either the confusable has a canonical result: ["a","ɑ"] -> "a", the confusable should be ignored: ["rn","m"] (both ASCII), or all variations are confusing: ["ɔ","ↄ","ᴐ"].
For all other scripts, I’m assuming any match is confusing.
Given this information, I can take every registered name that’s normalized and spans a single-script and check it for confusables of that script.
There are 581953 registered names.
1653 are invalid (not normalized) according to ens-normalize.js
Note: as I’ve mentioned earlier in the thread, single-script + confusables doesn’t fix the cross-script confusables like Latin/Cyrillic. For example, someone has registered “apple” in Cyrillic. It only fails because I didn’t canonicalize the Cyrillic confusables.
Just so I’m not misunderstanding… is that report an exhaustive list of all name validity changes that would occur if we, say, dropped your library into eth-ens-namehash today?
I actually don’t know what version is being used across different tools/wallets/apps. It’s even inconsistent in ENS itself, the UI appears to use@ensdomains/eth-ens-namehash 2.0.15. And the ENS metadata service uses the up-to-date repo/version too. But ensjs useseth-ens-namehash 2.0.8?
So as far as “ENS validity” goes, your report is correct with respect to “what people see in Metamask after they register”, but not “what people can search on and register in the ENS manager”. For example, in the “ens-error” section I was puzzled because most of them appear to just be newer emoji.
Like .eth is marked as invalid in the ens-error section, which matches up with Metamask:
Maybe I’m just misunderstanding something, not sure. But would it make sense to change that report so it uses @ensdomains/eth-ens-namehash 2.0.15 instead?
I think the best solution is to make people confirm using a checkbox each Unicode character. The description of the character should be displayed along with the number of the character.
Yeah, the report shows 1653 names lost to normalization errors (1608 specific to my library + 45 shared with ens-eth-namehash). If single-script confusables are enforced, additional 634 names lost to for having 2+ scripts and 728 to confusables (317 Latin + 411 non-Latin).
It sounds like the library itself will need to be updated whenever new versions of Unicode are released right? I see that the current eth-ens-namehash doesn’t support Unicode 14 but your library does
Or would it make sense to future-proof it somewhat by preemptively allowing characters in designated emoji blocks, even if they haven’t been assigned yet? Symbols and Pictographs Extended-A - Wikipedia
Should people be using ethers.js instead? Because ethers.js doesn’t appear to use our normalization code at all, they’ve rolled their own IDNA-compliant nameprep method, which would probably have a whole different set of valid/invalid names in @raffy’s report.
And web3.js appears to use @ensdomains/ens, which in turn uses… dun dun dun… the old eth-ens-namehash 2.0.8.
Got it. Most of the names seem to fall into this category of using unusual letters like “small capital” letters. Personally I’m okay with making these invalid.