Suggestion: Remove Punycode from ENS Name Normalization

I am developing a library to correctly normalize ENS names: @adraffy/ens-normalize.js. The ultimately goal is to have an on-chain contract that can do the normalization via eth_call by compressing the Unicode rules.

There’s been a few recent threads regarding normalization:

EIP-137 says ENS names should use UTS-46 normalization and makes suggestions regarding length and hyphen placement.

UTS-46 says Punycode labels should be expanded. It also disallows starting hyphen, ending hyphen, and a double-hyphen near the start.

IMO, UTS-46 has already been abandoned since there have been names registered with capitalization, unexpanded punycode, incorrect emoji handling (FE0F has been stripped), and incorrect zero-width handling (ZWJ 200C and ZWNJ 200D and ContextJ).

I’d like to suggest the following: the normalization spec should make no restrictions about hyphens or length. Each label is simply a normalized Unicode string according to IDNA2008. This disconnects ENS names from Punycode rules and would make the following labels valid: -test, test-, te--st, xn--💩.

If you want an ENS name that maps onto DNS, it simply needs to be punycode-able. For DNS decoding, any DNS label that has a puny prefix would be decoded.

Additionally, the normalization spec should establish how to translate any ENS name to a restricted charset like ASCII.

  • One suggestion would be allowing {HEX} like ES6 Unicode escape (without the \u), eg. "A{200D}B.eth". Any bracket would work: [({. Since brackets are a disallowed character, this translation would always be safe.
  • Another idea would be reserving a TLD like .node where namehash(normalize("<0x64-hex>.node")) = "<0x64-hex>".
1 Like

This is because the contract has no built in restrictions on what names can be registered; that’s imposed by clients.

Is there a practical benefit to this to offset the additional complexity of deviating from UTS-46?

I’m not saying they shouldn’t be registered – I’m saying they shouldn’t be reachable from typical user input. Or really, I’m trying to figure out what’s actually supposed to be reachable from user input.

Reduced complexity. By dropping punycode, every ASCII name is ASCII. Seems like a great property to have.

As an owner of a hyphenated domain name, I would be concerned about scammers attempting to use extra hyphens to ensnare unsuspecting users, because of a typographical input error. For example, a-b.eth vs. a–b.eth.

Definitely - but names with capital letters definitely aren’t reachable!

There doesn’t seem to be any interest in this topic.

Let me phrase it slightly differently:

My suggestion is that xn--ls8haa.eth is ASCII so it should be valid. We shouldn’t care about punycode. We shouldn’t expand punycode. No user should type or interact with punycode. I don’t even know if we should care about hyphens. But I don’t really care what the answer is, just that we state what the answer is.

For example, if we’re following UTS-46, then we’re supposed to decode punycode. But this introduces a problem:

With either Transitional or Nontransitional Processing, sources already in Punycode are validated without mapping. In particular, Punycode containing Deviation characters, such as href=“fuß.de” (for fuß.de) is not remapped. This provides a mechanism allowing explicit use of Deviation characters even during a transition period.

This means that punycoded labels can contain noncasefolded characters or emoji with FE0F. The best fix I’m aware of is to break from the spec, and require that puny decoded labels must match their mapped/ContextJ/Emoji-fixed replacements. Or simply pass the decoded label back through the normalizer because normalization should be idempotent.

My last suggestion would be to completely ignore punycode but just respect the hyphen rules (/^-/, /-$/, and ^*{2}--). This also breaks from the spec but has no conflicting consequences. I think I like this solution the best after consideration.

For a hypothetical scenerio: could say DNS maps literally to ENS and could say DNS maps through puny decoding.

This seems reasonable to me.

Didn’t ENS core try changing the front end registration/management from Unicode to Punycode in October and reverted everything back?

On a day in October I searched a single emoji (:us:.eth) and didn’t get that standard “error” 3 characters required, but the ENS registration converted the emoji to the punycode and I was allowed to register the same and had full access to management of :us:.eth after the registrations. I proceeded to register 3 more single emojis in good faith but then noticed the NFTs were not minting correctly as the emojis but as punycode, my NFTs resulted in no properties (character count or expiration date).

I raised the issue in discord and by the end of the day the front end change was rolled back and I lost access to the management of the emoji domains, nor did I have access to the [punycode].eth and the punycode are otherwise available for registration (I could be wrong but I don’t even think it was possible to register actual punycode before the change and reversion).

I’ve always wondered what happened, as all the DNS registrars that allow emojis utilize punycode on their front end registration. Would it make sense to reconsider this change on the front end or does it not ultimately address these normalization issues?

If we just respect hyphens and pass punycode straight through unaffected, then I don’t see any issue. eg. 💩💩💩.eth and it’s punycoded name xn--ls8haa.eth would be separate names. Invalid punycode would pass through too.

The registration front-end should probably suggest that if “xn–” is present, that ENS supports Unicode and that the intended Unicode name should be used instead, with a button to decode it, and a warning that if you proceed, you are registering that literal (and would be unreachable from DNS-translation.)

If you type xn--ls8haa.eth into any other ENS-accepting field, it should lookup that literal (no puny decode).

For any DNS to ENS bridge, the implementer can choose what xn--ls8haa.eth corresponds to, where the recommended translation is 💩💩💩.eth (puny decode before resolution). This decision happens outside of the normalization scope.

However, a curious user might want to know what actually is. Cloudflare sets the TXT record of a .link to the account "a=0x..." (which appears to be the IPFS DNSLink standard), but it doesn’t reveal the namehash. .limo doesn’t provide TXT records but the headers contains X-IPFS-Path which references an IPNS. Possibly there should be a tool that expects DNS input and confirms that the redirected contenthash matches the expected ENS contenthash. I will create a prototype.

I think the remaining point is: what is the suggested/recommended method for encoding an ENS name into [a-z0-9] for sharing, and should the normalizer understand this format?

  • My original suggestion was {61}.eth == a.eth (note: "a" = 0x61) but essentially is perversion of ES6 Unicode escape and thus a new construct.
  • URI escaping would work as % is disallowed: %61.eth == a.eth but has the consequence that most Unicode would span multiple escapes {200D}.eth == %E2%80%8D.eth. This is nice because a browser address field will also apply this translation.
  • Since % is escaped by URI escape but _ is both disallowed and unescaped, _XX could replace %XX like _E2_80_8D.eth or _X_ could replace {X} like _200D_.eth
  • Optionally, a label extension could be used, like ee-- to indicate that the following label has escaped Unicode, eg. %61.eth errors on disallowed % but ee--%61.eth auto-converts to a.eth via decodeURIComponent(input.slice(4))

Traditional URI escaping seems the most universal but requires % to be an allowable character in the contexts where this encoding would be useful (but the same would apply to the bracketed suggestion, one of [({ would need to be allowable.)

1 Like

No; this was a bug that was fixed as soon as we were aware of it.

If we have to do this, Punycode is as good as any solution - but where would this be necessary?

1 Like

Any digital communication channel that doesn’t support Unicode.

I think you want the experience that you can copypaste any Unicode or ASCII encoded-name, paste it into an ENS name field, and get the expected result.

If you use punycode, how does the user get the punycoded name back to an ENS name? How do you know it’s not literally “xn–a.eth”?

If you paste punycode into the normalizer, then we’re back to the problem above, where the normalizer needs to decode punycode but break from the UTS-46 standard and apply the remapping rules (+ContextJ and the ENS emoji rules) so the transform is idempotent. Which again, is completely fine, but different behavior than UTS-46.

I think we need a more specific and descriptive definition of “UTS-46 normalization/validation”, as used in EIP-137. Because the output of UTS-46 normalization is punycode, and the output of ENS normalization is unicode, it’s not accurate to say that ENS uses UTS-46 for normalization: it uses a subset of UTS-46. Can we just say that ENS uses IDNA2008 mapping for normalization? Or does that not encapsulate the complete ENS normalization process?

I checked the go-ens implementation, and it does:

var p = idna.New(idna.MapForLookup(), idna.StrictDomainName(true), idna.Transitional(false))
output, err = p.ToUnicode(input)

Does this implement the complete ENS normalization process?

Can we just use the \u2665 convention? Not sure if the normalizer should be required to parse this format, it feels out of scope. But clients may wish to decode \u prefixed literals before passing to the normalizer.

There are very few of those today, though!

How would you get that with any of the alternative proposals?

Looks like no ContextJ and ENS Emoji aren’t spec so no. That allows hyphens too but expands punycode.

I like \uXXXX except that some emoji span two UTF-16 units, so \uXXXX\uYYYY is required for a single character and \ itself is tricky because it is an escape in many contexts (so you get into backslash wars when passed through other steps for safe serialization, eg \\\u.) ES6 has \u{X} which is nicer but still has the \u so I suggested just dropping the \u and using {X}.

The normalizer should handle it because it’s invalid input and easy to identify. If any normalization process doesn’t support this format ({61}.eth or %61.eth) will error 100% of the time causing no issue.

If it does, it should normalize the fixed point of the unescape process (repeatedly performing unescape on the entire string until it doesn’t change.)

If %61.eth is used, most browsers will translate it:

We’d have the same result if we forbid -- in a name, no?

1 Like

Can you explain more? I’m not sure I follow.

If we forbid --, then it’ll be disallowed and will error, the same as using URL-encoding or curly braces.