Zero-width characters pose a security risk and existential threat to ENS

nick.eth · November 27, 2021, 12:04am

This is a very nice piece of work, and I think it could be the foundation for a better way of normalising names for ENS. There’s a couple of things we’d need to make that so:

Clear, explicit documentation describing the normalisation process, such that anyone else can implement it from scratch; it’s not viable for people to rely on a single JS library everywhere. Preferably, pseudocode that starts from the primitive of a compliant UTS-46 implementation.
Tests over all existing ENS names to see which names’ resolution will be affected and how.

If you’re prepared to handle #1, I can take care of #2.

raffy · November 27, 2021, 7:00am

I released an update that has an optional boolean which ignores (rather than throws) on disallowed characters. I also added another layer of compression and got the minified file down to ~25KB (17KB gzip).

I’ve added a few comments and citations regarding the algorithm and sequence of operations.

I also included the start of a bunch of tests:

known.js tracks things that I’m specifically aware of and test-known.js makes sure they match.
goofy-labels.txt is a complete list of non-trivial ENS registered (thanks to @nick.eth) and check-goofy.js generates goofy.html for normalizations that don’t match.
opensea.js pulls known (name, token, owner)'s and can generate opensea-label-hash.json, from which check-opensea.js generates opensea.html for label-hashes that don’t match.
compare-ethers.js compares ens_normalize() to ethers nameprep() using known.js and generates compare-ethers.html

nick.eth · November 27, 2021, 11:46pm

Before we can deploy this we’ll need documentation that’s comprehensive enough someone can recreate the algorithm from scratch independently, and test vectors they can use to check their implementation. I’m happy to help with that.

It seems clear that a lot of these names were not normalised - and hence not resolvable - in the first place. Would it be possible to filter the list for names that are normalised according to the current Ethers implementation, and then only show those that have a different normalisation under yours?

royalfork · November 30, 2021, 6:14am

Not sure if you’ve seen this before, but Unicode provides some test cases for IDNA mapping: https://www.unicode.org/Public/idna/14.0.0/IdnaTestV2.txt

raffy · November 30, 2021, 12:50pm

I have a test harness, but all of the code needs ported to Javascript. I can give a quick summary of the results as-of the latest version (1.0.7). This will eventually be in one of the automated reports in my repo.

6235 examples in IdnaTestV2.txt
453 valid match
6235 error match (needs checked)
9 output differences [1]
8 error for adraffy, valid for idna [2]
684 valid for adraffy, error for idna [3]

[1]. 5 of these are due to an off-by one bug in my code ContextJ at the start of a string. This reduces the list to 4, which all involve FFFD. Here is an example, I’m not sure if it makes sense:

Input: [120795, 65294, 63992]
adraffy: [51, 46, 31520]
idna: [51, 46, 65533]

63992 (F9F8) is mapped to 31520 (7B20). I’m not sure what makes that disallowed and marked as FFFD. I’ll look at this tomorrow.

[2]. These are all due to adraffy disallowing FFFD. I think this essentially means they’re the same.

[3]. Half of these are differences on ContextJ handling. For adraffy, I strip ZWs out of context, not error, since they’re nearly impossible to edit for the end-user. The major issue remaining in this group are the BIDI rules, which I haven’t implemented yet.

raffy · December 1, 2021, 7:05am

This might not be the right thread, but what is the expected behavior for punycode (referred here as puny) in labels? The Unicode spec gives advice that is contradictory to normalization w/r/t equality. There is no mention of puny in EIP-137.

From what I understand, any label could be puny as-is. Puny is simply Unicode encoded into friendly ASCII, prefixed “xn–”. Puny is a mechanism to retrofit Unicode onto DNS names.

The issue is that if we want to bridge ENS on DNS or any legacy system, there needs to be a unique representation.

For example, if a label is Unicode, you’d need to convert it to puny to fit into these systems, however the equivalent puny could already be registered as a name, meaning resolution can’t tell the difference between a Unicode-encoded label and a literal puny label.

The spec says that puny shouldn’t be mapped which means we reintroduce the ZW and complex Emoji issue through puny.

To me, this means that puny should be expanded, then normalized. Any non-ASCII label can then be uniquely converted to puny to fit on DNS. This is how my library currently functions, maybe this is wrong.

nick.eth · December 1, 2021, 7:49pm

ENS doesn’t use punycode - so punycode names are just funny looking names starting with xn–. Compliant clients should not translate to or from punycode, just treat it as a normal text-based name like any other.

When it comes to DNS integration such as with eth.link, clients need to puny-decode input names before resolving them on ENS.

raffy · December 1, 2021, 10:30pm

Register “.eth” and “xn–ls8haa.eth” (puny of triple poop)

What is xn--ls8haa.eth.link?

nick.eth · December 1, 2021, 10:53pm

The former - the DNS gateway will need to do puny-decoding before resolving, and so the punycoded name isn’t resolvable via DNS.