ENS Name Normalization

I investigated the NormalizationTest issue a bit more. I compared my library to Python (unicodedata), Mathematica (CharacterNormalize), and various JS engines (Node, Chrome, Safari, Brave, Firefox) and realized its a total shitshow. My library and the latest version of Firefox are the only ones that pass the test. Pinning NFC to a specific Unicode spec appears to be the right choice.

I made a simple report that compares allowed characters (valid or mapped, minus emoji) between ENS0 (current) and IDNA 2008. I overlaid it with with my current-but-incomplete whitelist so (green = whitelist as-is, purple = whitelist mapped). These are the characters that need review and any input would be very helpful: IDNA: ENS0 vs 2008 Edit: I added the number of times each character shows up in a register name in brackets, eg. § [5], means 5 registered names use this character.

There also is a much larger list (the full disallowed list minus this list) that potentially contains characters that should of been enabled in the first place. However, this list too large for an HTML report. For example, underscore (_).

I added an additional 25K registered labels to the comparison reports. Also, this service displays the last 1024 registered labels with normalization applied: Recent ENS Names

2 Likes

UTS-39 and Chromium documentation discuss some good stuff, but they’re both even more restrictive than IDNA 2008.

UTS-39 references a tool which displays confusables. For example, x vs х is crazy dangerous. It shows up in a 14 registered names so far: some surrounded by the appropriate script and a few that are certainly malicious. Applying a confusable-like mapping might be a good idea but it will brick some names. Some of these are already handled by IDNA 2008.

To reduce implementation complexity, rather then relying on many public Unicode files, ENS could supply a singular table that combines UTS-51 (+SEQ/ZWJ), UTS-46, and UTS-39 which makes implementation relatively straight forward.

Edit: Here is a first attempt at a visualization of the confusables relative to IDNA 2008.

The way to read this is: “o” is a confusable category. There are 75 characters that are confusable with it. According to IDNA 2008, they correspond to 42 separate entities after normalization is applied. The largest group has 14 characters that map to o. The next largest group has 7 characters that map to ه, etc. Groups of one are just shown as a single element (without the count and black arrow) to save space. The color codes match the rest of the reports: green = valid, purple = mapped, red = disallowed.

The scary thing would be any groups that map to similar yet distinct characters. For the image above, 14 map to o (6F) and 5 map to ο (3BF). There’s very little difference between “Latin Letter O” and “Greek Omicron”.

2 Likes

I made some improvements to the Confusables. I also added a visual breakdown of the Scripts, with emoji removed. I also included the name if you hover over a character.


I am currently thinking about a Script-based approach to address homograph attacks, as my attempts to whitelist have not been very successful with so many characters. First, each script can be independently reduced to a non-confusable set. Then, each script can specify which scripts it can mix with. By reducing the problem from all characters, to just scripts that should be combined together, the script-script confusable surface is many orders of magnitude smaller.

A few of the script groupings are too sloppy. So I suggest creating a few artificial categories, like Emoji, Digits (that span all scripts), Symbols (split from Common), etc. and merge a few that get used frequently together (Han/Katakana/Hiragana). Common/Latin/Greek/Cyrillic scripts should get collapsed using an extremely aggressive version of the confusables, so there’s absolutely 0 confusables with ASCII-like characters.

Then, based on the labels registered so far, determine what kind of script-base rule permits the most names. eg “Latin|Emoji|Digits|Symbols” is a valid recipe.

I’ve computed some tallies that show what types of scripts show up in labels using the following process: map each character of a label to a script, aaαa → {Latin,Latin,Greek,Latin} then collect:

For the sorted case, you can see most labels aren’t that diverse:

Since 371K labels solely use the Latin script, I think starting with Emoji, Digits (0-9) and Latin (A-Z) and building up, using the confusable mapping to allow more characters, until all of those names are accepted is a good starting point. And then grow from there.

1 Like

I have nowhere near the expertise and knowledge that you do with this, but I have a question: would it be useful to disallow any names ending or beginning with ZWJ, plus disallowing consecutive ZWJ in name?. It seems like this would get rid of a lot of scam names or cut them way back?

So the rule would be a name can’t begin or end with a ZWJ or have more than one ZWJ within the name (I’m assuming consecutive ZWJs aren’t used for any words in other languages or in so-called ASCII art). Is this thinking in the right direction or helpful?

1 Like

Thinking about it further, I think this might solve all the issues for emoji domains with the addition of one rule. The additional rule being to disallow character-ZWJ-character-ZWJ-character… that can’t be a pattern for legitimate emoji domains. In other words, every other character can’t be a ZWJ.

So, the rule is for emojis that 1) the name can’t begin or end in a ZWJ, 2) no consecutive ZWJ and, 3) no alternating ZWJs with one character in between them. I think those rules restrict emoji domains to only legitimate ones. Unsure if this is useful…

EDIT: Rule 3 is inaccurate, there are many single glyph emojis that are 3 or 4 emojis combined and do use every other character as a ZWJ. So strike rule 3

1 Like

In ens-normalize.js, ZWJ are only allowed inside whitelisted UTS-51 emoji sequences and inside ContextJ which permit it between or following a few characters.

Technically, UTS-51 allows arbitrary-length sequences of ZWJ, but many of those do not correspond to a single glyph and indistinguishable, which is why a whitelist is required.

For example, 💩‍💩.eth = 1F4A9 200D 1F4A9 fits your rules and is valid UTS-51 but invalid in ens-normalize.js

Another example would be 😵‍💫😵‍💫😵‍💫.eth vs 😵💫😵💫😵💫.eth which can only be distinguished by ZWJ placement.

I would consider the ZWJ inside emoji solved, although I’m open to ZWJ sequence recommendations (Unsupported Set from Emojipedia), eg. Microsoft supports a much wider set of family permutations.

The 3 ContextJ allowances are here but I never received any input whether it was worth the extra complexity.

I believe most libraries that are currently in-use allow both ZWJs anywhere, as they are permitted in IDNA 2008 (typically under the assumption that ContextJ is active). They’re actually disallowed in IDNA 2003 (they are deviations) but were permitted to enable complex emoji.

The two lists I’m using are:

1 Like

It looks like ENS will have to whitelist new emoijs when they come out, I’m wondering if ENS will end up actioning them off (outside of the normal way) as it seems like the only type of whitelist where new characters are coming out that are widely used and universally recognized (emojis = pictographs).

Thanks for sharing those links

Is there a key to the meaning of the colors here? There are a few green ones I wouldn’t expect to be allowed (such as quotation marks) and many red ones that seem harmless.

This is the approach Chrome takes for IDNA domains, described here: Chromium Docs - Internationalized Domain Names (IDN) in Google Chrome

This seems reasonable, as long as we can ensure we don’t deploy a version that will cause legitimate existing names to stop resolving.

As it stands, we can’t prohibit registration of domains, only make them unresolvable - so we couldn’t auction them off. This would change if raffy succeeds in making an EVM implementation of his algorithm.

1 Like

I wanted to avoid this situation but there’s no algorithmic solution that prevents malicious use. Personally, I was really disappointed in how Unicode handled emoji. Even the choice of reusing ZWJ, rather than have a separate visible character that goes invisible when supported, was an enormous UX mistake.

I decorated it using the IDNA 2008 rules. I’ll add a key, it’s the same coloring I’m using everywhere: green=valid, purple=mapped, red=invalid, gray=ignored, black=special.

1 Like

Are you saying prohibiting the registration of scam names would change/be possible, or that it would change that ENS would auction off new emoji domains when they come out?

BTW @raffy I remember seeing emojis used in smart contracts and in token names when looking on etherscan. Would this make it easier for the whitelist or does it make it simpler to write?

EDIT: fwiw contract development - Can I use an emoji as part of a string in Solidity? - Ethereum Stack Exchange

From the start, ens-normalize was developed under the idea that I would make an on-chain version, as that was the original motivation. I wrote a simple version a while ago, but decided that any development time is wasted until the spec is figured out. The original compression method would of worked in solidity, but the latest version of the library uses an arithmetic compressor, which would not be gas efficient.

The client-side callable version is relatively straight forward, as the full payload is only 12KB, but the main value of an on-chain algorithm would be able to call it efficiently during a transaction. For this, you’d need a clean implementation with aggressive short circuiting, so the common cases are efficient.

The dream would be an upgradeable contract but the Unicode release cycle is every 2 years? so I think the smarter goal is just making a good reference implementation, that has a simple implementation, that can be ported everywhere w/o ambiguity (unlike most Unicode implementations), and then write the EVM version.

1 Like

Shouldn’t the quote mark and other ASCII symbols be red, then?

Like ʺ = 02BA and others? I agree it’s nonsense but they’re valid in IDNA 2003 and 2008. I added an ASCII table:

Sorry for the lack of updates.

I haven’t been able to figure out a solution to confusable problem. I think the script idea above is valid but breaking Common/Latin/Greek/Cyrillic into smarter groups has proven very difficult. Possibly I was just aiming for something that requires too much investment.

I’ve changed the default export to my library to the compat version. I believe this is faithful to EIP-137 and solves the emoji/ZWJ issue. I’ve discussed this version earlier in the thread.


To recap on the problems still unsolved:

  • There are many Unicode symbols that look visually identical, even after UTS-46 is applied. Some of these exist between separate scripts. Some are within the same script.

  • There is a Unicode spec (UTS-39) for grouping these symbols into confusable groups. Everything purple or green in these charts (everything purple or green) are valid ENS characters. I’ve discussed this earlier in the thread.

  • AFAIK, the best algorithm for dealing with this problem is outlined in the Google Chromium documentation. Unfortunately, many of the techniques involve using the punycode-form to indicate that the input name is potentially confusing.

  • If ENS used the punycode solution, it would mean there are 3 possible states for a name post-normalization: (1) an invalid name, (2) a confusable name (punycode), or (3) a valid name. Both (2) and (3) would be reachable. I’m not against this idea, I simply didn’t consider it, but it’s certainly possible.

  • UTS-39 and the Chrome solution also implement script-based restrictions. DNS has the advantage that traditional TLDs are country-based which means there’s an implicit script-set associated with many TLDs which can help resolve script ambiguities. .eth is international (similar to newer TLDs) and has no implicit script.

  • Even if ENS uses the most strict script-based technique (one script per label) there exist labels that look visually similar. As a native English user, most examples I’m aware of exist between the Common/Latin/Greek/Cyrillic scripts.

  • My idea was to split Common/Latin/Greek/Cyrillic into a better groups, such that script-restrictions would work AND prevent the obvious cross-script confusables. I’ve discussed this earlier in the thread. I haven’t been able to satisfactorily construct these groups.

1 Like

What would be the impact if we simply considered all confusable names invalid?

That’s probably worth exploring. I’ve looked at the statistics for replacements and scripts in the registered labels, but haven’t measured how confusable they are. Computing which character in a confusable group is used most frequently (other than the preferred) is probably also useful.

2 Likes

Is this guy with OpenSea or ENS? I ran across the chat today jw…

Hi Nick,

I’m new to the forum but I’ve been looking at the ENS project for a while and wanted to make a suggestion regarding this issue in particular.

There is a known issue with people being led astray by way of ENS names with zerowidth characters and other shenanigans, it isn’t particularly hidden either and clients (opensea etc) are signalling this loudly to their users with ugly triangles warning of names that contain non-ascii characters.

There’s been extensive discussion here about the subject and I see you’re leaning strongly towards in-client normalisation and a library etc, I think that approach is going to have problems because you can’t force clients to use your library and people can interact with the contract directly and find loopholes.

I think that you have a complicated problem on your hands and, for now, it might be a good solution to launch a new extension, perhaps .ens, with character restrictions enforced on-chain and permitting only a-z0-9 ascii characters. That eliminates a whole ton of issues and provides some separation between the perhaps-slightly-tainted .eth name and makes it abundantly clear that the zwj issues won’t occur with the new extension. Your exponentialpriceoracle when applied to a new extension should produce strong capital inflows.

Then you can solve the .eth normalisation issue in time.

1 Like

Also, I was thinking about your normalisation issue a lot and I’m not confident that’s a problem you can solve in code.

You’ll have issues where somebody registers an emoji string of three black-haired women and someone else tries to spoof it by registering two black-haired women with a brown-haired woman sandwiched between them.

There’s about three different cat emojis now and they all render differently on different platforms. Some updated platforms render the three different ones the same using the ‘primary’ cat emoji. The gun emoji famously renders as a waterpistol on iOS.

If you really really do want to allow emojis, thinking about it as a developer, bearing in mind that the purpose of a name is to be uniquely distinguishable, I’d want to go with a manually selected whitelist, and you’re going to get grief about your selections. For example, I’d permit only the generic ‘yellow’ character emoji, because if you allow once ‘skintone’ emoji you’re going to face pressure to allow all skintone emojis, which would be great in theory but if someone tells me ‘my ether name is a brown girl emoji’ I can’t reliably pick that one out of a list without at least something to compare with. And then you have colourblind users…

For that reason I’d go with a strict whitelist, and a short one at that.

That also generates premiums. If you limit the system to, for example, text-only domains or emoji domains which must contain three emojis from a short whitelist of emojis… then you’ll see some premium prices. You still have a huge set to sell from and you eliminate the case of one guy buying three hearts and the next guy buying 4 hearts… you eliminate so much potential user confusion and that’s so vital.

It should be remembered that the use case for most users revolves around fund transfers. Most people want their fund transfer ecosystem to be strict and regimented and with little room for error. SWIFT allows a very limited set of characters in their MT103s for good reason.

Just my thoughts.

1 Like