ENS Name Normalization

I wrote this on the GitHub issue but it’s worth repeating my thoughts here too.

At the end of the day, if the official ENS manager UI allows a name to be registered, then it should be considered valid in my opinion.

As far as normalization goes, I think it’s less important what the actual rules are, and more important that all the places in ENS that use normalization are consistent, or else it leads to these issues where someone registers a name and then finds out later that they can’t use it. This includes:

Since it sounds like the ENS metadata service is changing which names are valid/normalized and which ones aren’t, and some people have already registered names through the manager UI that they thought were valid, those people will simply be stuck with unusable/unsellable names. @mdt do you think it would be possible to get a report on how many currently registered names would still be rendered invalid after your changes?

3 Likes

Are you able to give some examples of legit-looking but newly-invalid names?

The thing I learned the most from this adventure is that I’m never interacting with an ENS name unless I can type it myself. The moment I see anything weird, basically any emoji or non-ASCII, I’m entering it into my resolver demo site so I can see exactly how it decomposes.


I definitely agree with this.

This report is relative to eth-ens-namehash 2.0.5. I could run this again for mdt’s fork?


Unsafe Latin Sample
messo
andymoog
dhoulmagus
fomoclub
snapmail
origamidao
immunologist
weem
championstickets
cryptomilan
metaar
mamadou
somepeoplecallmemaur
momentx
maevagennham
asylumdao
sgrasmann
pytm
metascraper
mavericknifty

Full Report.json

The main issue seems to be m has multiple confusables, m and rn.

1 Like

What if you eliminate any confusables where the input and output are both in the same set?

In my opinion “m”, “0”, “1” should not be considered confusables, as flagging these would greatly dilute the signal to users. If marketplaces/integrations are using the metadata service as a source of truth for name validity, that is a powerful lever that should be used to promote uniformity across the ecosystem. Maybe the metadata service could include some of the extended visual parsing that Raffy’s tool provides, accompanied by specific UX guidelines from ENS on how best to handle confusables.

On a related note, I would like to suggest that the default avatar typeface be changed to one that better distinguishes between all confusing characters, but particularly zeroes and lowercase L vs I:

ens-avatar-typeface

Not suggesting a monospace font, just one with less ambiguity. Many confusables could potentially be addressed by a custom typeface. This makes the avatar overlay more useful. Imagine wallet UIs where the ENS avatar is displayed for quick visual confirmation, in the same way that avatars are used on Venmo, etc.

1 Like

It’s unfortunate that Metamask/Opensea/etc have built their own ENS “validation” implementations (and I agree with @serenae and @aox.eth that the Metamask “confusables” UI has gone too far). This should speak to a clear need for ENS to develop and maintain improved client packages/libraries, and actively encourage their widespread use.

As a “front-end developer” working on ENS integration, I would want the following API:

process(input) -> {
	"normalized": [ // for each label
		{"label": <uts46 normalized label>, "hash": <label hash>}
	],
	"nameHash": 0x123....,
	"display": <concatenated normalized field, but emoji are shown in full-qualified form>,
	"warnings": [
		"mixed char sets", // Maybe ascii+emoji is ok?
		"extraneous invisible chars",
		"right-to-left chars",
		...
	],
	"info": [ // for each unicode char:
		{"unicode": <U+XXX>, "charset": <unicode charset>, "confusable": <bool>},
		...
	]
}
  • Normalized field should only be used for internal ENS lookups, and should never be shown to end-users. I think it’s important that the protocol-level normalization is as simple as possible; it’s helpful that it can be implemented in only a few dozen lines of code.
  • Display field is designed for display to end-users, and could include @raffy’s emoji processing logic (among other potentially useful things).
  • Warnings signal likely scam attempts, and should always be prominently shown to the end-user. We can debate about what should and shouldn’t go in here, but I think the goal is that “good faith” registrations should never show any warnings.
  • Implementors can decide whether exposing “info” makes sense for their use-case. I personally think “confusables” are more of a typography issue, but if they’re a hard Metamask requirement, that’s probably good enough reason to include them.

I’m probably missing some things, but I think this general approach would solve most of the problems discussed in this thread (including the ZWJ issue, and the emoji issue), without the need for drastic changes to low-level normalization procedures or on-chain storage. It’s also fairly amenable to change without breaking the API.

3 Likes

I attempted this, but it’s a little more complex than I thought. I’ll need to revise the previous results once I figure out a better solution.

If we’re only looking at labels post-normalization which are single-script (ignore Common and Inherited), then the label is confusable-free iff every subsequence of that string is canonical (has no confusables or is the preferred form for that script.)

  • Every example in the confusable database is a single character, but the confusable-itself might be multiple characters, eg. O- for θ. The Unicode spec (and the official confusable utility) appears to only work with single-character confusables.

  • Unicode also doesn’t make confusables reflexive, so m is confusable with rn but not vice-versa (I guess because only single character matches are considered?)

If we only do single characters, the above statement can be relaxed to just checking if every character is canonical.

However, I’m not sure how to derive the canonical choice (per script) from the confusable database when there are ambiguities (or if it exists). There are 661 examples where a single script has 2+ matches:

  • Confusable a for Latin: a vs ɑ → Trivial: a should be canonical
  • Confusable f for Latin: f vs vs vs → Trivial: f should be canonical
  • Confusable l for Hebrew: ו vs ן → ???
  • Confusable o for Greek: ο vs σ → ???

There are also examples like rn and m that are just dumb (both ASCII), but I’m not sure how to resolve these in general either.

I will manually decide the Latin cases and take the rest as being invalid in any form and then recompute the results.

2 Likes

This is an unending rabbit hole. The more you dig, the worse it gets

We want to know how many names are valid if names can only span a single script (but can be mixed with common and inherited scripts.)

  • I claim we only care if there are confusables in the normalized output, not the input.
  • An emoji is never confusable.
  • A confusable can span multiple characters.
  • A confusable requires an exact match.
  • A confusable that gets transformed by normalization isn’t a possible match.

When the confusables are grouped by script, many confusables disappear, but there still exist groups of 2+ sequences that are confusable.

  • For Latin, I’ve resolved all of these conflicts by hand: either the confusable has a canonical result: ["a","ɑ"] -> "a", the confusable should be ignored: ["rn","m"] (both ASCII), or all variations are confusing: ["ɔ","ↄ","ᴐ"].
  • For all other scripts, I’m assuming any match is confusing.

Given this information, I can take every registered name that’s normalized and spans a single-script and check it for confusables of that script.

  • There are 581953 registered names.
  • 1653 are invalid (not normalized) according to ens-normalize.js
  • 634 span 2+ scripts
  • 9786 have no primary script (eg. only emoji)
  • There are 567770 Latin names: 317 confusing!?
  • There are 1693 non-Latin names: 411 confusing

This is actually much better than I expected.

Confusing Latin Example:
image

All 317 Latin Results:

All 411 Non-Latin Results:

Note: as I’ve mentioned earlier in the thread, single-script + confusables doesn’t fix the cross-script confusables like Latin/Cyrillic. For example, someone has registered “apple” in Cyrillic. It only fails because I didn’t canonicalize the Cyrillic confusables.
image

Edit: All Results as JSON: script-confusable-labels.json

I found 2 very minor bugs while writing documentation.

  1. Regional Indicators (eg. 🇦️) normalized differently when followed by FE0F. ENS Resolver
  2. Context and Bidi checks didn’t use NFC form.

Updated Report: eth-ens-namehash (ens) vs adraffy (1.3.14)
(not using single-confusable logic)

2 Likes

Very awesome work as always.

Just so I’m not misunderstanding… is that report an exhaustive list of all name validity changes that would occur if we, say, dropped your library into eth-ens-namehash today?

Also, this is running against eth-ens-namehash 2.0.8, isn’t that an old version/repository from 4 years ago? Shouldn’t the latest actually be @ensdomains/eth-ens-namehash 2.0.15?

I actually don’t know what version is being used across different tools/wallets/apps. It’s even inconsistent in ENS itself, the UI appears to use @ensdomains/eth-ens-namehash 2.0.15. And the ENS metadata service uses the up-to-date repo/version too. But ensjs uses eth-ens-namehash 2.0.8?

Metamask appears to use the old repository version too: https://github.com/MetaMask/metamask-extension/blob/develop/package.json#L155

And ethers.js appears to roll their own nameprep method (following IDNA, not sure what version)? https://github.com/ethers-io/ethers.js/blob/master/packages/strings/src.ts/idna.ts

So as far as “ENS validity” goes, your report is correct with respect to “what people see in Metamask after they register”, but not “what people can search on and register in the ENS manager”. For example, in the “ens-error” section I was puzzled because most of them appear to just be newer emoji.

Like :woman_mage:.eth is marked as invalid in the ens-error section, which matches up with Metamask:
image

But does not match up with the ENS manager app:

Maybe I’m just misunderstanding something, not sure. But would it make sense to change that report so it uses @ensdomains/eth-ens-namehash 2.0.15 instead?

2 Likes

I think the best solution is to make people confirm using a checkbox each Unicode character. The description of the character should be displayed along with the number of the character.

1 Like

Updated to 2.0.15: https://adraffy.github.io/ens-normalize.js/test/output/ens-2.0.15-adraffy-1.3.14.html

Yeah, the report shows 1653 names lost to normalization errors (1608 specific to my library + 45 shared with ens-eth-namehash). If single-script confusables are enforced, additional 634 names lost to for having 2+ scripts and 728 to confusables (317 Latin + 411 non-Latin).

3 Likes

Cool, makes sense!

It sounds like the library itself will need to be updated whenever new versions of Unicode are released right? I see that the current eth-ens-namehash doesn’t support Unicode 14 but your library does :+1:

Or would it make sense to future-proof it somewhat by preemptively allowing characters in designated emoji blocks, even if they haven’t been assigned yet? Symbols and Pictographs Extended-A - Wikipedia

2 Likes

FYI that version of ensjs is (pretty much) deprecated

Gotcha okay! Should that be removed here? It’s the first library listed: ENS Libraries - ENS Documentation

Should people be using ethers.js instead? Because ethers.js doesn’t appear to use our normalization code at all, they’ve rolled their own IDNA-compliant nameprep method, which would probably have a whole different set of valid/invalid names in @raffy’s report.

And web3.js appears to use @ensdomains/ens, which in turn uses… dun dun dun… the old eth-ens-namehash 2.0.8.

And web3.js’s sub-package web3-eth-ens also uses the old library: https://github.com/ChainSafe/web3.js/blob/1.x/packages/web3-eth-ens/package.json#L18

2 Likes

Not technically deprecated just yet, but it isn’t maintained actively and we are developing a new version to replace it.

I’ll look into this. The general recommendation for basic ENS usage is ethers, so it should be following the standard.

Theoretically this is an easy fix, one of the other TNL devs is currently working on the web3 codebase so they might be able to do this.

3 Likes

Thanks! Here’s the IDNA nameprep implementation in ethers.js: https://github.com/ethers-io/ethers.js/blob/master/packages/strings/src.ts/idna.ts#L157

They also have their own namehash implementation where that nameprep gets used: https://github.com/ethers-io/ethers.js/blob/master/packages/hash/src.ts/namehash.ts#L27

1 Like

What characters are these? Upper-case latin (ASCII)? Or something else?

I’m in touch with Ricmoo, who is planning to adopt this library once it’s ready.

2 Likes

ʙʟᴏᴄᴋᴄʜᴀɪɴ = 299 29F 1D0F 1D04 1D0B 1D04 29C 1D00 26A 274 (src)

1 Like

Got it. Most of the names seem to fall into this category of using unusual letters like “small capital” letters. Personally I’m okay with making these invalid.

3 Likes