ENS Name Normalization

For the hypothetical situation of single-script confusables, there are two minor issues.

1.) There are characters that IDNA maps from a single character in a script to multiple characters in multiple scripts.

// these seem tame since Common is just a bag of junk
[0140:Latin]  is mapped to [6C:Latin][B7:Common]
[32C7:Common] is mapped to [0038:Common][6708:Han]

// this is weird
[33A5:Common] is mapped to [03BC:Greek][6D:Latin]

2.) There are characters that NFC maps from one script to another.

[1FEF:Greek] is normalized to [60:Common]

This just means the input scripts can be different from the NFC’d scripts and different from the output (normalized) scripts.

The most likely place you’d implement script-based restrictions is on the output, but there might be situations where you enter a Greek-only name, it gets transformed into Greek+Something during NFC/normalization, and then fails the single-script requirement for a non-obvious reason.


I can take all registered labels, find the valid normalized names, choose the names that span a single script, and see how many contain characters that have normalizable confusables with other characters in that same script.

There appears to be very few names that aren’t confusable in some way even when you require that the input is a single scripts.

Single Script
# pure = number of labels that only use the specified script
# safe = number of pure labels where no character has a confusable
{
  Latin: { pure: 489067, safe: 7},
  Han: { pure: 1063, safe: 266},
  Common: { pure: 8071, safe: 4819},
  Thai: { pure: 25, safe: 4},
  Arabic: { pure: 127, safe: 0},
  Cyrillic: { pure: 277, safe: 89},
  Devanagari: { pure: 18, safe: 10},
  Greek: { pure: 24, safe: 8},
  Hebrew: { pure: 22, safe: 0},
  Canadian_Aboriginal: { pure: 1, safe: 0},
  Tibetan: { pure: 4, safe: 0},
  Tamil: { pure: 2, safe: 0}
}

The following scripts have no registered names with same-script confusables:

{
  Katakana: { pure: 60, safe: 60},
  Egyptian_Hieroglyphs: { pure: 43, safe: 43},
  Hangul: { pure: 211, safe: 211},
  Ethiopic: { pure: 4, safe: 4},
  Hiragana: { pure: 18, safe: 18},
  Runic: { pure: 2, safe: 2},
  Gurmukhi: { pure: 2, safe: 2},
  Georgian: { pure: 1, safe: 1},
  Lisu: { pure: 1, safe: 1},
  Vai: { pure: 1, safe: 1},
  Phoenician: { pure: 1, safe: 1},
  Old_Italic: { pure: 1, safe: 1},
  Lao: { pure: 1, safe: 1}
}
Single Script + Common
{
  Latin: { pure: 538776, safe: 13 },
  Han: { pure: 1105, safe: 299 },
  Common: { pure: 8071, safe: 4922 },
  Thai: { pure: 25, safe: 4 },
  Arabic: { pure: 132, safe: 1 },
  Inherited: { pure: 1870, safe: 1870 },
  Hangul: { pure: 214, safe: 214 },
  Katakana: { pure: 114, safe: 114 },
  Cyrillic: { pure: 283, safe: 90 },
  Devanagari: { pure: 19, safe: 10 },
  Greek: { pure: 26, safe: 9 },
  Egyptian_Hieroglyphs: { pure: 43, safe: 43 },
  Runic: { pure: 3, safe: 3 },
  Ethiopic: { pure: 4, safe: 4 },
  Hebrew: { pure: 22, safe: 0 },
  Hiragana: { pure: 21, safe: 21 },
  Coptic: { pure: 1, safe: 1 },
  Canadian_Aboriginal: { pure: 1, safe: 0 },
  Gurmukhi: { pure: 2, safe: 2 },
  Georgian: { pure: 1, safe: 1 },
  Tibetan: { pure: 4, safe: 0 },
  Lisu: { pure: 1, safe: 1 },
  Tamil: { pure: 2, safe: 2 },
  Vai: { pure: 1, safe: 1 },
  undefined: { pure: 1, safe: 1 },
  Phoenician: { pure: 1, safe: 1 },
  Old_Italic: { pure: 1, safe: 1 },
  Lao: { pure: 1, safe: 1 }
}

I’ll try this again with a more relaxed set of confusables.

1 Like

Another very suspect character combination (that I wasn’t aware of) that’s still valid in IDNA 2008 is 0332 (COMBINING LOW LINE) which is a combining character that underlines:

X + 0332 = underlined X

  • a = 0061
  • = 0061 0332
  • a = <a>0061</a>
  • = <a>0061 0332</a>

Here are the current deviations from the spec that were already discussed:

  1. Emoji (UTS-51) are handled separately from text (UTS-46).

  2. There should only be (1) stop/label-separator character (.) 002E (FULL STOP) rather than (4).
    ENS Name Normalization - #7 by nick.eth

  3. Underscore should be allowed (_) 005F (LOW LINE).
    ENS Name Normalization - #26 by nick.eth

  4. Because UTS-51 is sloppy, all emoji must be whitelisted.
    ENS Name Normalization - #23 by raffy
    The ambiguity of “poop joiner” prevents an algorithmic solution.
    ENS Name Normalization - #39 by raffy

  5. There are some well-supported non-RGI emoji that should be allowed. So far, I’ve only experimentally whitelisted “women wrestling”. This likely requires community review.
    ENS Name Normalization - #24 by raffy

  6. Tag sequences must be whitelisted because invalid tags render invisibility. Luckily, there are only 3. Again, requires community review.
    ENS Name Normalization - #27 by raffy

This is the latest report using 553K registered labels for eth-ens-namehash vs adraffy-1.3.13 which corresponds to UTS-51+IDNA2008+CheckHyphen+CheckBidi+ContextJ+ContextO+ChangesAbove.

1 Like

That seems reasonable.

I don’t follow - aren’t most names ASCII-only, and hence non-confusable?

1 Like

I thought so too, but…

1 Like

It’s a question of how things are defined. For example, Latin letter “o” has other Latin (same script) characters that look similar eg. that are still valid in IDNA 2008. I’ll recompute the stats where it’s not considered a confusable when it’s the simplest/canonical form, which would permit ASCII.

2 Likes

The details of the normalization discussion is mind blowing here. I appreciate that. I will rather ask a relatively simple approach to you. The biggest pain point here is that there is no consistency between utilities, tools and apps yet. So when a name seems valid in one, not valid in another. I believe starting with this issue might be more beneficial to the users in shorter term. Then improvement for the normalization can take part gradually based on all the work being done here.

Based on my observation, I was trying to ease the problem that we currently have. I am pretty sure when I read all previous conversations the issue goes much more deeper than what I was able to see. But to achieve some of the goals to some extent, I want to share the idea I came up for the ens-metadata-service, which I believe can be implemented to other ens tools and utilities as well, if you guys think it is good enough as a first step.

To me ENS normalization has three separate parts;

  • generic normalization based on uts46. Currently being done with eth-ens-namehash
  • a way to check/rectify confusables.
  • while doing both checks respect to the emojis including compound ones.

While the first part is already done by eth-ens-namehash, if you think there is a better approach we can start using it.
For the second part; there was no other check in any other implementation but in ens-metadata-service by ens-validation library.

The library itself seems like doing a great job for ascii only characters, but when it comes to more broader unicode characters and emojis, it fails. I was able to bypass emoji check in the library with a dirty hack but non-ascii symbols were the biggest obstacles to me, most of the symbol sets are missing in the library. Instead of dealing with that, I thought finding a way to detect confusables with the help of official confusables database may be better approach.

While checking around for some more information, I came across with this library which does almost what was in my mind. The only missing part was compound emoji support, which I later forked and added here.

At the end, combination of eth-ens-namehash and unicode-confusables libraries does quite good job detecting invalid names, I even realized that it solves some of the issues mentioned in this discussion, but of course it is not perfect. Some other problems remains the same (e.g. Currency symbols as confusables " ¢, £, ¥, €")

Please let me know what do you think about the idea in general and especially about unicode-confusables implementation.

2 Likes

My library ens-normalize is also an UTS-46 solution (with some options regarding what specific settings we want: IDNA:Strict2003/Transitional/Strict2008, CheckJoiner, CheckBidi, ContextJ, ContextO, +previous post with differences) with UTS-51 emoji parsing in-front and a custom NFC impl. It was designed to be small: 35KB with everything (23KB w/o CheckBidi and NFC). It also has many tests and generated reports.

It includes a visual report on confusables which shows that just using the confusables as-is will not fix the spoof between a Latin and Greek letter O.


Answering the question from earlier, I’ve recomputed the Single-Script+Common confusable mapping across all registered names, where the canonical character is allowed. This would still eliminate 140K registered Latin names.

Updated Confusable Tally
# pure = number of labels that only use the specified script + Common
# safe = number of pure labels where no character has a non-canonical confusable
{
  Latin: { pure: 538653, safe: 397511 },
  Han: { pure: 1104, safe: 1055 },
  Common: { pure: 7774, safe: 5654 },
  Thai: { pure: 25, safe: 14 },
  Arabic: { pure: 132, safe: 14 },
  Inherited: { pure: 1869, safe: 1869 },
  Hangul: { pure: 214, safe: 214 },
  Katakana: { pure: 114, safe: 114 },
  Cyrillic: { pure: 283, safe: 140 },
  Devanagari: { pure: 19, safe: 19 },
  Greek: { pure: 26, safe: 11 },
  Egyptian_Hieroglyphs: { pure: 43, safe: 43 },
  Runic: { pure: 3, safe: 3 },
  Ethiopic: { pure: 4, safe: 4 },
  Hebrew: { pure: 22, safe: 10 },
  Hiragana: { pure: 21, safe: 21 },
  Coptic: { pure: 1, safe: 1 },
  Canadian_Aboriginal: { pure: 1, safe: 1 },
  Gurmukhi: { pure: 2, safe: 2 },
  Georgian: { pure: 1, safe: 1 },
  Tibetan: { pure: 4, safe: 4 },
  Lisu: { pure: 1, safe: 1 },
  Tamil: { pure: 2, safe: 2 },
  Vai: { pure: 1, safe: 1 },
  undefined: { pure: 1, safe: 1 },
  Phoenician: { pure: 1, safe: 1 },
  Old_Italic: { pure: 1, safe: 1 },
  Lao: { pure: 1, safe: 1 }
}

There are 180 names with 3+ scripts.
There are 803 names with 2+ scripts excluding Common.
There are 52376 names with 2+ scripts.
1 Like

I wrote this on the GitHub issue but it’s worth repeating my thoughts here too.

At the end of the day, if the official ENS manager UI allows a name to be registered, then it should be considered valid in my opinion.

As far as normalization goes, I think it’s less important what the actual rules are, and more important that all the places in ENS that use normalization are consistent, or else it leads to these issues where someone registers a name and then finds out later that they can’t use it. This includes:

Since it sounds like the ENS metadata service is changing which names are valid/normalized and which ones aren’t, and some people have already registered names through the manager UI that they thought were valid, those people will simply be stuck with unusable/unsellable names. @mdt do you think it would be possible to get a report on how many currently registered names would still be rendered invalid after your changes?

3 Likes

Are you able to give some examples of legit-looking but newly-invalid names?

The thing I learned the most from this adventure is that I’m never interacting with an ENS name unless I can type it myself. The moment I see anything weird, basically any emoji or non-ASCII, I’m entering it into my resolver demo site so I can see exactly how it decomposes.


I definitely agree with this.

This report is relative to eth-ens-namehash 2.0.5. I could run this again for mdt’s fork?


Unsafe Latin Sample
messo
andymoog
dhoulmagus
fomoclub
snapmail
origamidao
immunologist
weem
championstickets
cryptomilan
metaar
mamadou
somepeoplecallmemaur
momentx
maevagennham
asylumdao
sgrasmann
pytm
metascraper
mavericknifty

Full Report.json

The main issue seems to be m has multiple confusables, m and rn.

1 Like

What if you eliminate any confusables where the input and output are both in the same set?

In my opinion “m”, “0”, “1” should not be considered confusables, as flagging these would greatly dilute the signal to users. If marketplaces/integrations are using the metadata service as a source of truth for name validity, that is a powerful lever that should be used to promote uniformity across the ecosystem. Maybe the metadata service could include some of the extended visual parsing that Raffy’s tool provides, accompanied by specific UX guidelines from ENS on how best to handle confusables.

On a related note, I would like to suggest that the default avatar typeface be changed to one that better distinguishes between all confusing characters, but particularly zeroes and lowercase L vs I:

ens-avatar-typeface

Not suggesting a monospace font, just one with less ambiguity. Many confusables could potentially be addressed by a custom typeface. This makes the avatar overlay more useful. Imagine wallet UIs where the ENS avatar is displayed for quick visual confirmation, in the same way that avatars are used on Venmo, etc.

1 Like

It’s unfortunate that Metamask/Opensea/etc have built their own ENS “validation” implementations (and I agree with @serenae and @aox.eth that the Metamask “confusables” UI has gone too far). This should speak to a clear need for ENS to develop and maintain improved client packages/libraries, and actively encourage their widespread use.

As a “front-end developer” working on ENS integration, I would want the following API:

process(input) -> {
	"normalized": [ // for each label
		{"label": <uts46 normalized label>, "hash": <label hash>}
	],
	"nameHash": 0x123....,
	"display": <concatenated normalized field, but emoji are shown in full-qualified form>,
	"warnings": [
		"mixed char sets", // Maybe ascii+emoji is ok?
		"extraneous invisible chars",
		"right-to-left chars",
		...
	],
	"info": [ // for each unicode char:
		{"unicode": <U+XXX>, "charset": <unicode charset>, "confusable": <bool>},
		...
	]
}
  • Normalized field should only be used for internal ENS lookups, and should never be shown to end-users. I think it’s important that the protocol-level normalization is as simple as possible; it’s helpful that it can be implemented in only a few dozen lines of code.
  • Display field is designed for display to end-users, and could include @raffy’s emoji processing logic (among other potentially useful things).
  • Warnings signal likely scam attempts, and should always be prominently shown to the end-user. We can debate about what should and shouldn’t go in here, but I think the goal is that “good faith” registrations should never show any warnings.
  • Implementors can decide whether exposing “info” makes sense for their use-case. I personally think “confusables” are more of a typography issue, but if they’re a hard Metamask requirement, that’s probably good enough reason to include them.

I’m probably missing some things, but I think this general approach would solve most of the problems discussed in this thread (including the ZWJ issue, and the emoji issue), without the need for drastic changes to low-level normalization procedures or on-chain storage. It’s also fairly amenable to change without breaking the API.

3 Likes

I attempted this, but it’s a little more complex than I thought. I’ll need to revise the previous results once I figure out a better solution.

If we’re only looking at labels post-normalization which are single-script (ignore Common and Inherited), then the label is confusable-free iff every subsequence of that string is canonical (has no confusables or is the preferred form for that script.)

  • Every example in the confusable database is a single character, but the confusable-itself might be multiple characters, eg. O- for θ. The Unicode spec (and the official confusable utility) appears to only work with single-character confusables.

  • Unicode also doesn’t make confusables reflexive, so m is confusable with rn but not vice-versa (I guess because only single character matches are considered?)

If we only do single characters, the above statement can be relaxed to just checking if every character is canonical.

However, I’m not sure how to derive the canonical choice (per script) from the confusable database when there are ambiguities (or if it exists). There are 661 examples where a single script has 2+ matches:

  • Confusable a for Latin: a vs ɑ → Trivial: a should be canonical
  • Confusable f for Latin: f vs vs vs → Trivial: f should be canonical
  • Confusable l for Hebrew: ו vs ן → ???
  • Confusable o for Greek: ο vs σ → ???

There are also examples like rn and m that are just dumb (both ASCII), but I’m not sure how to resolve these in general either.

I will manually decide the Latin cases and take the rest as being invalid in any form and then recompute the results.

2 Likes

This is an unending rabbit hole. The more you dig, the worse it gets

We want to know how many names are valid if names can only span a single script (but can be mixed with common and inherited scripts.)

  • I claim we only care if there are confusables in the normalized output, not the input.
  • An emoji is never confusable.
  • A confusable can span multiple characters.
  • A confusable requires an exact match.
  • A confusable that gets transformed by normalization isn’t a possible match.

When the confusables are grouped by script, many confusables disappear, but there still exist groups of 2+ sequences that are confusable.

  • For Latin, I’ve resolved all of these conflicts by hand: either the confusable has a canonical result: ["a","ɑ"] -> "a", the confusable should be ignored: ["rn","m"] (both ASCII), or all variations are confusing: ["ɔ","ↄ","ᴐ"].
  • For all other scripts, I’m assuming any match is confusing.

Given this information, I can take every registered name that’s normalized and spans a single-script and check it for confusables of that script.

  • There are 581953 registered names.
  • 1653 are invalid (not normalized) according to ens-normalize.js
  • 634 span 2+ scripts
  • 9786 have no primary script (eg. only emoji)
  • There are 567770 Latin names: 317 confusing!?
  • There are 1693 non-Latin names: 411 confusing

This is actually much better than I expected.

Confusing Latin Example:
image

All 317 Latin Results:

All 411 Non-Latin Results:

Note: as I’ve mentioned earlier in the thread, single-script + confusables doesn’t fix the cross-script confusables like Latin/Cyrillic. For example, someone has registered “apple” in Cyrillic. It only fails because I didn’t canonicalize the Cyrillic confusables.
image

Edit: All Results as JSON: script-confusable-labels.json

I found 2 very minor bugs while writing documentation.

  1. Regional Indicators (eg. 🇦️) normalized differently when followed by FE0F. ENS Resolver
  2. Context and Bidi checks didn’t use NFC form.

Updated Report: eth-ens-namehash (ens) vs adraffy (1.3.14)
(not using single-confusable logic)

2 Likes

Very awesome work as always.

Just so I’m not misunderstanding… is that report an exhaustive list of all name validity changes that would occur if we, say, dropped your library into eth-ens-namehash today?

Also, this is running against eth-ens-namehash 2.0.8, isn’t that an old version/repository from 4 years ago? Shouldn’t the latest actually be @ensdomains/eth-ens-namehash 2.0.15?

I actually don’t know what version is being used across different tools/wallets/apps. It’s even inconsistent in ENS itself, the UI appears to use @ensdomains/eth-ens-namehash 2.0.15. And the ENS metadata service uses the up-to-date repo/version too. But ensjs uses eth-ens-namehash 2.0.8?

Metamask appears to use the old repository version too: https://github.com/MetaMask/metamask-extension/blob/develop/package.json#L155

And ethers.js appears to roll their own nameprep method (following IDNA, not sure what version)? https://github.com/ethers-io/ethers.js/blob/master/packages/strings/src.ts/idna.ts

So as far as “ENS validity” goes, your report is correct with respect to “what people see in Metamask after they register”, but not “what people can search on and register in the ENS manager”. For example, in the “ens-error” section I was puzzled because most of them appear to just be newer emoji.

Like :woman_mage:.eth is marked as invalid in the ens-error section, which matches up with Metamask:
image

But does not match up with the ENS manager app:

Maybe I’m just misunderstanding something, not sure. But would it make sense to change that report so it uses @ensdomains/eth-ens-namehash 2.0.15 instead?

2 Likes

I think the best solution is to make people confirm using a checkbox each Unicode character. The description of the character should be displayed along with the number of the character.

1 Like

Updated to 2.0.15: https://adraffy.github.io/ens-normalize.js/test/output/ens-2.0.15-adraffy-1.3.14.html

Yeah, the report shows 1653 names lost to normalization errors (1608 specific to my library + 45 shared with ens-eth-namehash). If single-script confusables are enforced, additional 634 names lost to for having 2+ scripts and 728 to confusables (317 Latin + 411 non-Latin).

3 Likes

Cool, makes sense!

It sounds like the library itself will need to be updated whenever new versions of Unicode are released right? I see that the current eth-ens-namehash doesn’t support Unicode 14 but your library does :+1:

Or would it make sense to future-proof it somewhat by preemptively allowing characters in designated emoji blocks, even if they haven’t been assigned yet? Symbols and Pictographs Extended-A - Wikipedia

2 Likes