Zero-width characters pose a security risk and existential threat to ENS

matoken.eth · April 20, 2021, 8:44am

I sort of like the idea of maintaining the whitelist of emoji characters so that it can also be used for decoding emojis for subdomains by indexing service like TheGraph (and the lack of the list is the reason why we see lots of a jumble of characters on subdomain list).

Can you remind me where are you suggesting to make it unresolvable?
“Making it unresolvable” is a way of censorship and I think @nick.eth didn’t include the mechanism at smartcontract/protocol level by design.

Restricting the new registration or renewal at .eth level registrar is probably more feasible though.

Someone called me out who the hell I am but neither you nor I actually have authority to say which names/emojis should be in the whitelist. Probably this is where we may need some sort of DAO to make the collective decision.

vincentkovacs · April 20, 2021, 12:46pm

Emojis utillizing ZWJ possibly have formatting rules to them so like @tom said regex could also be a solution. At least on registration.

Nobody can predict though what new emojis will be out in the future so I don’t actually know how fast the service could react on that either by a whitelist or regexp…

nick.eth · April 20, 2021, 8:13pm

Zero-width joiners are not only used for emoji; they’re used to compose characters in languages other than English, too. We cannot solve this by whitelisting without excluding a huge number of non-english speakers from using names that make sense to them.

I’m not sure what issue with the Graph you’re talking about?

I believe he’s talking about a change to the resolution/normalisation rules as I suggested earlier.

One thing I forgot to bring up in my earlier post is that MyCrypto has a library that identifies domains with deceptive encoding by using Chrome’s algorithm: GitHub - ensdomains/ens-validation

matoken.eth · April 21, 2021, 8:11am

isn’t this at front end library level, rather than at smartcontract/protocol level?

From what he said here, I understood that relying on the thirdparty is the temporary band aid regardless of they make it unresolvable or just warn.

Like showing of emojis on subdomains. If we have a list, we can decode them on our manager and Graph can also decode so that every dapp doesn’t have to decode on their own. https://app.ens.domains/name/ethmojis.eth/subdomains

dude · April 25, 2021, 11:36pm

I think this argument heavily over-weigh any other point of this discussion and seeing the increasing flood of “fake” ens-names has completely deflated the value I saw in ENS names at the start.

I own a zwj-emoji domain, but I’ll happily give up ownership to see this fixed.

Almost everywhere I see an ens-domain in use, it’s clickable so what the domain actually is(characters) in relation to what it looks like actually does not matter at all.

Edit: Also, I’m not sure of this, but doesn’t Unicode categorize things into different categories and subsets that could help define which characters to support?

nick.eth · April 27, 2021, 9:34pm

Yes. That’s fine, though - the design of ENS has always been such that you can register invalid names, they just won’t resolve due to the normalisation rules.

A whitelist wouldn’t help here, since names can have multiple characters in them, not just a single emoji.

himalaya · May 2, 2021, 1:29am

Someone has registered a domain name with zero-width connector.It is no longer possible to register domain names with zero-width connectors on app.ens.domains. Did he/she register it on other ENS-registered websites?

Shiba · August 17, 2021, 8:19pm

I firmly believe this conversation needs to be restarted.

royalfork · November 23, 2021, 4:56am

Was anything more done on this front?

This is discussed in the Unicode specs as CONTEXTJ. From UTS #46: Unicode IDNA Compatibility Processing

Because of the visual confusability introduced by the joiner characters, IDNA2008 provides a special category for them called CONTEXTJ, and only permits CONTEXTJ characters in limited contexts: certain sequences of Arabic or Indic characters. However, applications that perform IDNA2008 lookup are not required to check for these contexts, so overall security is dependent on registries having correct implementations. Moreover, the IDNA2008 context restrictions do not catch most cases where distinct domain names have visually confusable appearances because of ZWJ and ZWNJ.

More specifically in RFC 5892 - The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)

Some code points need to be allowed in exceptional circumstances but
should be excluded in all other cases; these rules are also described
in other documents. The most notable of these are the Join Control
characters, U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH
NON-JOINER. Both of them have the derived property value CONTEXTJ.
A character with the derived property value CONTEXTJ or CONTEXTO
(CONTEXTUAL RULE REQUIRED) is not to be used unless an appropriate
rule has been established and the context of the character is
consistent with that rule. It is invalid to either register a string
containing these characters or even to look one up unless such a
contextual rule is found and satisfied. Please see Appendix A, “The
Contextual Rules Registry”, for more information.

UTS-46 calls out an implied asymmetry between domain resolution and domain registration. It expects that domain registrars will enforce stricter rules than those imposed by UTS-46, and accepts that some valid normalizations will never resolve because the domain can’t be registered.

Because normalization is not required for ENS domain registration (in the absence on on-chain normalization, anyone can register non-normalized 3+ character names directly on ETHRegistrarController), resolution is the only avenue by which restrictions can be placed on registrations. If ENS ever wished to enforce CONTEXTJ-style exceptions for arabic/emojis/etc, these exceptions would need to be published and used in all client resolution libraries.

nick.eth · November 23, 2021, 5:58am

Good analysis, thank you.

We did not end up making changes to our normalisation process. We’d need to be very careful with any changes that restrict previously valid names unless we can be 100% certain it will only catch deceptive ones.

raffy.eth (not on the forum yet, I think), has written this new UTS-46 implementation. I’m hoping he’ll chime in here with some input.

raffy · November 23, 2021, 9:48am

This is a good point about the handling of the (2) zero-width characters, 200C and 200D. The other (2) deviations I believe should be allowed (and thus mapped) as the IDNA 2008 spec suggests: 00DF → C39F and 03C2 → CF82. Certainly leaving the zero-widths unchanged is bad, but dropping them without everyone knowing the situation is also bad.

CONTEXTJ seems to be described here: rfc5892 The only issue I see is that these rules are kinda messy to automate the codification but they’re very simple to implement, eg:
If RegExpMatch((Joining_Type:{L,D})(Joining_Type:T)*\u200C(Joining_Type:T)*(Joining_Type:{R,D})) Then True;

For @adraffy/ens-normalize.js, I let the zero-widths pass in my first version (1.0.2) as I assumed ENS was using IDNA 2008 so all deviations were allowed. I changed my library to support CONTEXTJ (which disallows ZW without context) and pushed a new version (1.0.3) which is reflected at my demo page: ENS Resolver I’ve included a specific examples w/r/t CONTEXTJ.

Kinda related: could the ENS dapp optionally display the namehash and/or a byte/codepoint representation of the name being registered?

raffy · November 23, 2021, 9:58am

Unfortunately, it looks like CONTEXTJ kills half of the complex emoji: ENS Emoji Test

eg. 1F9DD 1F3FC 200D 2642
200D => ZWNJ => Rule 1 => 1F3FC is not Virama => Disallowed

Is there a more modern rule that I’m missing?

Edit: looks like Recommended Emoji ZWJ Sequences, v14.0.

Edit 2: I think the ZW’s should be ignored when they’re out of context.

SpikeWatanabe.eth · November 23, 2021, 10:11am

impersonators are always all over the place, for example police catch police impersonators all the time, but you can still buy police badges

no matter how restrictive are the rules, people will always find a way to bend them

I like to think about it like this → ENS smart contract is “wholesale”, its dealing in bulk large quantities and as such it must be censorship resistant, but then .eth name hits wallet UI, or exchange UI, or some app UI, this is “retail” level where approach can be more granular, so that its UI’s problem to catch bad people and UI’s reputation would suffer if its not providing robust solutions against impersonation

on the other hand it is beneficial to have fixed set of rules on smart contract level, and it is a very bad idea to keep changing them, eventually with time all UIs will learn the rules and develop robust strategies in dealing with problems

raffy · November 24, 2021, 2:19am

I think there’s a few issues here:

1.) The official dapp does not remove ZW characters and many have already been registered. For example: [0x200D][0x200D].eth

ENS NFT on Opensea
= labelhash(🧞[0x200D][0x200D])
= 2bee11594361....9266c44212152
= TokenID: 19870081826...51904077701458

2.) Using the encodedLabelHash encoding with the official dapp, you can register ANY namehash.
For example, I ~~got~~ attempted to register 💩.eth (1 character) through app.ens.domains without any warning:

ENS NFT on Opensea
!= labelhash(💩)
= labelhash([ba967c160905ade...9c8cbe452ad7a2])
= 6e0abe02c46fd98fe8652e10cf2717b988cfdd12484cd2d150ccf7f34bbaf215
= TokenID: 4977339322122.....604644227019285

This results in a bug where the encodedLabelHash gets registered as-is instead of the intended name (which should violate valid or fail on rentPrice):

[ba967c160905ade030f84952644a963994eeaed3881a6b8a4e9c8cbe452ad7a2].eth resolves to me, even though brackets are invalid.

Note the 3rd name on my ENS profile.

3.) parseSearchTerms in the official dapp uses UTF-16 character length instead of code-point count. Fortunately, this doesn’t result in any bugs because both valid() and rentPrice() enforce the 3 code-point minimum and UTF-16 length is always >= code-point length.

4A.) Personally, I think ZW should be ignored (removed) outside of CONTEXTJ to match the standard. This would require many registered emoji names to have their namehash (and NFT) changed. I have no idea what you would do regarding collisions.

In this situation, there’s nothing wrong with using the fully-qualified or minimally-qualified or even a mix of emoji – as long as they normalize to the same value, they’re the same.

norm("RAFFY.ETH") === norm("raffy.eth")
norm("🧟‍♂.eth") === norm("🧟♂.eth")

4B.) Another possibility would deriving a rule which lets ZW exist inside emoji context, which would leave all of the fully-qualified names untouched. However, you’d need to apply the reverse transformation to minimally-qualified emojis (to make the fully-qualified during normalization) and again deal with with the collision issue.

nick.eth · November 24, 2021, 2:48am

You don’t have 💩.eth - you have [ba967c160905ade030f84952644a963994eeaed3881a6b8a4e9c8cbe452ad7a2].eth. The frontend definitely shouldn’t let you register that, though.

It’s not clear to me how you got from .eth to registering this in the UI, though, without manually copying-and-pasting a labelhash. Can you elaborate?

I think this would be ideal, as it’d avoid breaking existing perfectly reasonable emoji names, as well as preventing deceptive uses of ZWJ.

sirverik.eth · November 24, 2021, 3:03am

What about disallowing names with ZWJs whenever a ZWJ is adjacent to an alphanumeric or hypen character?

EDIT: this helps with emojis but will not work with languages that mix latin letters with special characters.

raffy · November 24, 2021, 3:09am

Correct, my wording was bad. I do not own .eth. I saw that the dapp code supports encodedNameHash, so I computed labelhash(💩), and then registered [ba967c160905ade030f84952644a963994eeaed3881a6b8a4e9c8cbe452ad7a2].eth, because I was curious what it would do if it didn’t know the actual name. I was extra curious when I didn’t pay a large fee (but then I read the contracts and saw how that works.)

However, it shows up as “.eth” for me, because the dapp memoizes previously attempted labels. decodedNameHash then knows that [ba967c160905ade030f84952644a963994eeaed3881a6b8a4e9c8cbe452ad7a2] corresponds to .

The token I have is 6e0abe02c46fd98fe8652e10cf2717b988cfdd12484cd2d150ccf7f34bbaf215 which is the labelhash of [ba967c160905ade030f84952644a963994eeaed3881a6b8a4e9c8cbe452ad7a2].

raffy · November 24, 2021, 12:04pm

I added support for Unicode v14.0.0 emoji that contain ZWNJ to @adraffy/ens-normalize.js (1.0.6). These changes are reflected in my ENS Resolver app.

My library currently applies IDNA 2008 rules with CONTEXTJ but also retains any emoji from the recommended set and upgrades any combinations that were entered minimally-qualified. This effectively preserves existing namehashes by injecting missing ZWJ during normalization. ZWJ are ignored outside these contexts.

ens_normalize("🧟‍♀️") == "🧟‍♀️" == ens_normalize("🧟♀")

royalfork · November 26, 2021, 9:40pm

From an earlier draft of UTS-46, one reviewer brought up the emoji case:

If CheckJoiners, the label must satisfy the ContextJ rules from Appendix A, in The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) [IDNA2008] , except that if EmojiVersion≠0, ZWJ characters are allowed if they are within Emoji ZWJ Sequences specified for Unicode Emoji Version=EmojiVersion.

Not sure if there was any additional internal Unicode discussions around this, but seems like it didn’t make it into a final draft.

How would it treat “.eth” vs “.eth”? Are these 2 distinct domains?

raffy · November 26, 2021, 9:59pm

ens_normalize(👨‍👩‍👦.eth) == ens_normalize(👨👩👦.eth)

Namehash: 6a6e9485869136355dfca02a926456f0f66316e92a53fe6b9ad732a9f55baa13

ENS Resolver: Joined | Split