Zero-width characters pose a security risk and existential threat to ENS

isn’t this at front end library level, rather than at smartcontract/protocol level?

From what he said here, I understood that relying on the thirdparty is the temporary band aid regardless of they make it unresolvable or just warn.

Like showing of emojis on subdomains. If we have a list, we can decode them on our manager and Graph can also decode so that every dapp doesn’t have to decode on their own.

I think this argument heavily over-weigh any other point of this discussion and seeing the increasing flood of “fake” ens-names has completely deflated the value I saw in ENS names at the start.

I own a zwj-emoji domain, but I’ll happily give up ownership to see this fixed.

Almost everywhere I see an ens-domain in use, it’s clickable so what the domain actually is(characters) in relation to what it looks like actually does not matter at all.

Edit: Also, I’m not sure of this, but doesn’t Unicode categorize things into different categories and subsets that could help define which characters to support?

1 Like

Yes. That’s fine, though - the design of ENS has always been such that you can register invalid names, they just won’t resolve due to the normalisation rules.

A whitelist wouldn’t help here, since names can have multiple characters in them, not just a single emoji.

Someone has registered a domain name with zero-width connector.It is no longer possible to register domain names with zero-width connectors on Did he/she register it on other ENS-registered websites?

I firmly believe this conversation needs to be restarted.


Was anything more done on this front?

This is discussed in the Unicode specs as CONTEXTJ. From UTS #46: Unicode IDNA Compatibility Processing

Because of the visual confusability introduced by the joiner characters, IDNA2008 provides a special category for them called CONTEXTJ, and only permits CONTEXTJ characters in limited contexts: certain sequences of Arabic or Indic characters. However, applications that perform IDNA2008 lookup are not required to check for these contexts, so overall security is dependent on registries having correct implementations. Moreover, the IDNA2008 context restrictions do not catch most cases where distinct domain names have visually confusable appearances because of ZWJ and ZWNJ.

More specifically in RFC 5892 - The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)

Some code points need to be allowed in exceptional circumstances but
should be excluded in all other cases; these rules are also described
in other documents. The most notable of these are the Join Control
characters, U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH
NON-JOINER. Both of them have the derived property value CONTEXTJ.
A character with the derived property value CONTEXTJ or CONTEXTO
(CONTEXTUAL RULE REQUIRED) is not to be used unless an appropriate
rule has been established and the context of the character is
consistent with that rule. It is invalid to either register a string
containing these characters or even to look one up unless such a
contextual rule is found and satisfied. Please see Appendix A, “The
Contextual Rules Registry”, for more information.

UTS-46 calls out an implied asymmetry between domain resolution and domain registration. It expects that domain registrars will enforce stricter rules than those imposed by UTS-46, and accepts that some valid normalizations will never resolve because the domain can’t be registered.

Because normalization is not required for ENS domain registration (in the absence on on-chain normalization, anyone can register non-normalized 3+ character names directly on ETHRegistrarController), resolution is the only avenue by which restrictions can be placed on registrations. If ENS ever wished to enforce CONTEXTJ-style exceptions for arabic/emojis/etc, these exceptions would need to be published and used in all client resolution libraries.

Good analysis, thank you.

We did not end up making changes to our normalisation process. We’d need to be very careful with any changes that restrict previously valid names unless we can be 100% certain it will only catch deceptive ones.

raffy.eth (not on the forum yet, I think), has written this new UTS-46 implementation. I’m hoping he’ll chime in here with some input.

This is a good point about the handling of the (2) zero-width characters, 200C and 200D. The other (2) deviations I believe should be allowed (and thus mapped) as the IDNA 2008 spec suggests: 00DF → C39F and 03C2 → CF82. Certainly leaving the zero-widths unchanged is bad, but dropping them without everyone knowing the situation is also bad.

CONTEXTJ seems to be described here: rfc5892 The only issue I see is that these rules are kinda messy to automate the codification but they’re very simple to implement, eg:
If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C(Joining_Type:T)*(Joining_Type:{R,D})) Then True;

For @adraffy/ens-normalize.js, I let the zero-widths pass in my first version (1.0.2) as I assumed ENS was using IDNA 2008 so all deviations were allowed. I changed my library to support CONTEXTJ (which disallows ZW without context) and pushed a new version (1.0.3) which is reflected at my demo page: ENS Resolver I’ve included a specific examples w/r/t CONTEXTJ.

Kinda related: could the ENS dapp optionally display the namehash and/or a byte/codepoint representation of the name being registered?

Unfortunately, it looks like CONTEXTJ kills half of the complex emoji: ENS Emoji Test

eg. 1F9DD 1F3FC 200D 2642
200D => ZWNJ => Rule 1 => 1F3FC is not Virama => Disallowed

Is there a more modern rule that I’m missing?

Edit: looks like Recommended Emoji ZWJ Sequences, v14.0.

Edit 2: I think the ZW’s should be ignored when they’re out of context.

impersonators are always all over the place, for example police catch police impersonators all the time, but you can still buy police badges

no matter how restrictive are the rules, people will always find a way to bend them

I like to think about it like this → ENS smart contract is “wholesale”, its dealing in bulk large quantities and as such it must be censorship resistant, but then .eth name hits wallet UI, or exchange UI, or some app UI, this is “retail” level where approach can be more granular, so that its UI’s problem to catch bad people and UI’s reputation would suffer if its not providing robust solutions against impersonation

on the other hand it is beneficial to have fixed set of rules on smart contract level, and it is a very bad idea to keep changing them, eventually with time all UIs will learn the rules and develop robust strategies in dealing with problems

I think there’s a few issues here:

1.) The official dapp does not remove ZW characters and many have already been registered. For example: :genie:[0x200D][0x200D].eth

  • ENS NFT on Opensea
  • = labelhash(🧞[0x200D][0x200D])
  • = 2bee11594361....9266c44212152
  • = TokenID: 19870081826...51904077701458

2.) Using the encodedLabelHash encoding with the official dapp, you can register ANY namehash.
For example, I got attempted to register 💩.eth (1 character) through without any warning:

  • ENS NFT on Opensea
  • != labelhash(💩)
  • = labelhash([ba967c160905ade...9c8cbe452ad7a2])
  • = 6e0abe02c46fd98fe8652e10cf2717b988cfdd12484cd2d150ccf7f34bbaf215
  • = TokenID: 4977339322122.....604644227019285

This results in a bug where the encodedLabelHash gets registered as-is instead of the intended name (which should violate valid or fail on rentPrice):

  • [ba967c160905ade030f84952644a963994eeaed3881a6b8a4e9c8cbe452ad7a2].eth resolves to me, even though brackets are invalid.

Note the 3rd name on my ENS profile.

3.) parseSearchTerms in the official dapp uses UTF-16 character length instead of code-point count. Fortunately, this doesn’t result in any bugs because both valid() and rentPrice() enforce the 3 code-point minimum and UTF-16 length is always >= code-point length.

4A.) Personally, I think ZW should be ignored (removed) outside of CONTEXTJ to match the standard. This would require many registered emoji names to have their namehash (and NFT) changed. I have no idea what you would do regarding collisions.

In this situation, there’s nothing wrong with using the fully-qualified or minimally-qualified or even a mix of emoji – as long as they normalize to the same value, they’re the same.

  • norm("RAFFY.ETH") === norm("raffy.eth")
  • norm("🧟‍♂.eth") === norm("🧟♂.eth")

4B.) Another possibility would deriving a rule which lets ZW exist inside emoji context, which would leave all of the fully-qualified names untouched. However, you’d need to apply the reverse transformation to minimally-qualified emojis (to make the fully-qualified during normalization) and again deal with with the collision issue.

You don’t have 💩.eth - you have [ba967c160905ade030f84952644a963994eeaed3881a6b8a4e9c8cbe452ad7a2].eth. The frontend definitely shouldn’t let you register that, though.

It’s not clear to me how you got from :poop:.eth to registering this in the UI, though, without manually copying-and-pasting a labelhash. Can you elaborate?

I think this would be ideal, as it’d avoid breaking existing perfectly reasonable emoji names, as well as preventing deceptive uses of ZWJ.

What about disallowing names with ZWJs whenever a ZWJ is adjacent to an alphanumeric or hypen character?

EDIT: this helps with emojis but will not work with languages that mix latin letters with special characters.

Correct, my wording was bad. I do not own :poop:.eth. I saw that the dapp code supports encodedNameHash, so I computed labelhash(💩), and then registered [ba967c160905ade030f84952644a963994eeaed3881a6b8a4e9c8cbe452ad7a2].eth, because I was curious what it would do if it didn’t know the actual name. I was extra curious when I didn’t pay a large fee (but then I read the contracts and saw how that works.)

However, it shows up as “:poop:.eth” for me, because the dapp memoizes previously attempted labels. decodedNameHash then knows that [ba967c160905ade030f84952644a963994eeaed3881a6b8a4e9c8cbe452ad7a2] corresponds to :poop:.

The token I have is 6e0abe02c46fd98fe8652e10cf2717b988cfdd12484cd2d150ccf7f34bbaf215 which is the labelhash of [ba967c160905ade030f84952644a963994eeaed3881a6b8a4e9c8cbe452ad7a2].

I added support for Unicode v14.0.0 emoji that contain ZWNJ to @adraffy/ens-normalize.js (1.0.6). These changes are reflected in my ENS Resolver app.

My library currently applies IDNA 2008 rules with CONTEXTJ but also retains any emoji from the recommended set and upgrades any combinations that were entered minimally-qualified. This effectively preserves existing namehashes by injecting missing ZWJ during normalization. ZWJ are ignored outside these contexts.

ens_normalize("🧟‍♀️") == "🧟‍♀️" == ens_normalize("🧟♀")

1 Like

From an earlier draft of UTS-46, one reviewer brought up the emoji case:

  1. If CheckJoiners, the label must satisfy the ContextJ rules from Appendix A, in The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) [IDNA2008] , except that if EmojiVersion≠0, ZWJ characters are allowed if they are within Emoji ZWJ Sequences specified for Unicode Emoji Version=EmojiVersion.

Not sure if there was any additional internal Unicode discussions around this, but seems like it didn’t make it into a final draft.

How would it treat “:family:.eth” vs “:man::woman::boy:.eth”? Are these 2 distinct domains?

ens_normalize(👨‍👩‍👦.eth) == ens_normalize(👨👩👦.eth)

Namehash: 6a6e9485869136355dfca02a926456f0f66316e92a53fe6b9ad732a9f55baa13

ENS Resolver: Joined | Split

This is a very nice piece of work, and I think it could be the foundation for a better way of normalising names for ENS. There’s a couple of things we’d need to make that so:

  1. Clear, explicit documentation describing the normalisation process, such that anyone else can implement it from scratch; it’s not viable for people to rely on a single JS library everywhere. Preferably, pseudocode that starts from the primitive of a compliant UTS-46 implementation.
  2. Tests over all existing ENS names to see which names’ resolution will be affected and how.

If you’re prepared to handle #1, I can take care of #2.

I released an update that has an optional boolean which ignores (rather than throws) on disallowed characters. I also added another layer of compression and got the minified file down to ~25KB (17KB gzip).

I’ve added a few comments and citations regarding the algorithm and sequence of operations.

I also included the start of a bunch of tests:

  • known.js tracks things that I’m specifically aware of and test-known.js makes sure they match.
  • goofy-labels.txt is a complete list of non-trivial ENS registered (thanks to @nick.eth) and check-goofy.js generates goofy.html for normalizations that don’t match.
  • opensea.js pulls known (name, token, owner)'s and can generate opensea-label-hash.json, from which check-opensea.js generates opensea.html for label-hashes that don’t match.
  • compare-ethers.js compares ens_normalize() to ethers nameprep() using known.js and generates compare-ethers.html

Before we can deploy this we’ll need documentation that’s comprehensive enough someone can recreate the algorithm from scratch independently, and test vectors they can use to check their implementation. I’m happy to help with that.

It seems clear that a lot of these names were not normalised - and hence not resolvable - in the first place. Would it be possible to filter the list for names that are normalised according to the current Ethers implementation, and then only show those that have a different normalisation under yours?