Zero-width characters pose a security risk and existential threat to ENS

This is a very nice piece of work, and I think it could be the foundation for a better way of normalising names for ENS. Thereā€™s a couple of things weā€™d need to make that so:

  1. Clear, explicit documentation describing the normalisation process, such that anyone else can implement it from scratch; itā€™s not viable for people to rely on a single JS library everywhere. Preferably, pseudocode that starts from the primitive of a compliant UTS-46 implementation.
  2. Tests over all existing ENS names to see which namesā€™ resolution will be affected and how.

If youā€™re prepared to handle #1, I can take care of #2.

I released an update that has an optional boolean which ignores (rather than throws) on disallowed characters. I also added another layer of compression and got the minified file down to ~25KB (17KB gzip).

Iā€™ve added a few comments and citations regarding the algorithm and sequence of operations.

I also included the start of a bunch of tests:

  • known.js tracks things that Iā€™m specifically aware of and test-known.js makes sure they match.
  • goofy-labels.txt is a complete list of non-trivial ENS registered (thanks to @nick.eth) and check-goofy.js generates goofy.html for normalizations that donā€™t match.
  • opensea.js pulls known (name, token, owner)'s and can generate opensea-label-hash.json, from which check-opensea.js generates opensea.html for label-hashes that donā€™t match.
  • compare-ethers.js compares ens_normalize() to ethers nameprep() using known.js and generates compare-ethers.html

Before we can deploy this weā€™ll need documentation thatā€™s comprehensive enough someone can recreate the algorithm from scratch independently, and test vectors they can use to check their implementation. Iā€™m happy to help with that.

It seems clear that a lot of these names were not normalised - and hence not resolvable - in the first place. Would it be possible to filter the list for names that are normalised according to the current Ethers implementation, and then only show those that have a different normalisation under yours?

Not sure if youā€™ve seen this before, but Unicode provides some test cases for IDNA mapping: https://www.unicode.org/Public/idna/14.0.0/IdnaTestV2.txt

I have a test harness, but all of the code needs ported to Javascript. I can give a quick summary of the results as-of the latest version (1.0.7). This will eventually be in one of the automated reports in my repo.

  • 6235 examples in IdnaTestV2.txt
  • 453 valid match
  • 6235 error match (needs checked)
  • 9 output differences [1]
  • 8 error for adraffy, valid for idna [2]
  • 684 valid for adraffy, error for idna [3]

[1]. 5 of these are due to an off-by one bug in my code ContextJ at the start of a string. This reduces the list to 4, which all involve FFFD. Here is an example, Iā€™m not sure if it makes sense:

  • Input: [120795, 65294, 63992]
  • adraffy: [51, 46, 31520]
  • idna: [51, 46, 65533]

63992 (F9F8) is mapped to 31520 (7B20). Iā€™m not sure what makes that disallowed and marked as FFFD. Iā€™ll look at this tomorrow.

[2]. These are all due to adraffy disallowing FFFD. I think this essentially means theyā€™re the same.

[3]. Half of these are differences on ContextJ handling. For adraffy, I strip ZWs out of context, not error, since theyā€™re nearly impossible to edit for the end-user. The major issue remaining in this group are the BIDI rules, which I havenā€™t implemented yet.

This might not be the right thread, but what is the expected behavior for punycode (referred here as puny) in labels? The Unicode spec gives advice that is contradictory to normalization w/r/t equality. There is no mention of puny in EIP-137.

From what I understand, any label could be puny as-is. Puny is simply Unicode encoded into friendly ASCII, prefixed ā€œxnā€“ā€. Puny is a mechanism to retrofit Unicode onto DNS names.

The issue is that if we want to bridge ENS on DNS or any legacy system, there needs to be a unique representation.

For example, if a label is Unicode, youā€™d need to convert it to puny to fit into these systems, however the equivalent puny could already be registered as a name, meaning resolution canā€™t tell the difference between a Unicode-encoded label and a literal puny label.

The spec says that puny shouldnā€™t be mapped which means we reintroduce the ZW and complex Emoji issue through puny.

To me, this means that puny should be expanded, then normalized. Any non-ASCII label can then be uniquely converted to puny to fit on DNS. This is how my library currently functions, maybe this is wrong.

ENS doesnā€™t use punycode - so punycode names are just funny looking names starting with xnā€“. Compliant clients should not translate to or from punycode, just treat it as a normal text-based name like any other.

When it comes to DNS integration such as with eth.link, clients need to puny-decode input names before resolving them on ENS.

Register ā€œ:poop::poop::poop:.ethā€ and ā€œxnā€“ls8haa.ethā€ (puny of triple poop)

What is xn--ls8haa.eth.link?

The former - the DNS gateway will need to do puny-decoding before resolving, and so the punycoded name isnā€™t resolvable via DNS.