UTS-46 Non-compliant Emoji

For resolution, it doesn’t matter if emojis are in the “perfect” or “corrupt” form. It’s like arguing if upper or lower-case is more correct. As long as the namehash is always done on the consistent representation, the internal representation is irrelevant, since you can still use either of the names (perfect or corrupt), since they reduce to the same thing.

However, In the reverse direction, the choice matters along with any other valid variation, like capitalization. As long as your chosen name normalizes to the same thing, there should be no requirement that it has to be the canonical form.

I haven’t looked at the reverse-mechanism yet, so I’m unsure how it works exactly. For example, can my reverse be rAfFy.eTh?

it doesn’t matter if emojis are in the “perfect” or “corrupt” form. It’s like arguing if upper or lower-case is more correct

With all due respect, this is nothing like that. More accurately, it’s like arguing whether upper case or some glitched out, un-typable character, is more correct. And there should be no argument; the difference is explicitly and unambiguously called out in the Unicode specifications.

the internal representation is irrelevant, since you can still use either of the names (perfect or corrupt), since they reduce to the same thing.

This is only true if we assume that the UTS-46 standard doesn’t change. Unfortunately, UTS-46 changes fairly regularly. If ENS wishes to adopt those changes, “old” normalizations which produce “corrupt” internal representations could be superseded by “new” normalizations which produce “correct” internal representations. In these cases, “previously valid names would become invalid”. This scenario isn’t merely theoretical, but perhaps likely: UTS-46 is progressively coming into compliance with UTS-51 with each successive release.

Less importantly, because normalization is “lossy”, it’s possible that 2 or more different names could normalize to the same thing. The registration of 1 name would prohibit the registration of other names. While this doesn’t necessarily “break” anything, it’s a potentially undesirable property of “corrupt” internal representations.

There’s a further issue of what the “canonical display” should be. Should clients display the “correct” version which the user has input, and perform the messy normalization under the hood? Or should they display the corrupted version after normalization, following the same process as all other ENS domains?

Not sure what the best way forward is, but there are certainly pitfalls to maintaining “corrupt” internal representations in the ENS registrar.

can my reverse be rAfFy.eTh?

Yes. Reverse resolution can return anything you want.

It’s exactly like that. Let’s say you prefer the name “aA”. Normalize(“aA”) is “aa”. namehash(“aa”) is registered. Any input name that normalizes to “aa” matches, eg. “Aa”, “AA”, “aA”.

How does the choice of perfect/corrupt/upper/lower matter here?

As I mentioned, the display of your name should be your choice as long as it normalizes to the same hash. There’s no such thing as a canonical display IMO.

I agree that the version of UTS-46 should certainly be specified and fixed, however upgrading/changing the algorithm is a different can of worms. Changes that result in no registered collisions seem fine but who pays for that state update? Do you allow a grace period where owners can specify their preferred names under the new spec before the hash is recomputed? Seems messy.


IMO, the problem that needs fixed is different/buggy/bad implementations of the normalization producing different results. This happens outside of ENS.

I suggested in discord that a contract that performs the normalization operation should exist on-chain. I wrote a compressed Javascript library that does EIP-137/UTS-46 (idna, puny decode, and validation) for 14.0.0 and have ported it Solidity. However, my current implementation does not do NFC and requires two eth_calls. I am working on a solution. I am also looking for a good test set of names.

Alternatives could include functions written in various languages stored on-chain (or ipfs) that get runtime interpreted (and can be locally cached), eg. new Function(await fetch(...).then(x => x.text())("rAfFy.eTh").

How does the choice of perfect/corrupt/upper/lower matter here?

The important distinction is normalize(“aA”) produces output which is both equivalent to the input, and valid unicode. On the other hand, normalize(“:policewoman:t2:”) produces output which is not equivalent to the input, and invalid unicode. This is an important distinction for 2 reasons:

  1. As you mention, as long as the corruption is applied consistently, resolution doesn’t currently break. But, the system is not future-proof, and big problems will arise if UTS-46/IDNA ever improves support for emoji.

  2. Many ENS apps (app.ens.domains, Metamask, and even DNS queries in your browser) follow a “normalize”->“resolve”->“show user normalized input and result” pattern. A normalization procedure which outputs invalid Unicode breaks this.

IMO, the problem that needs fixed is different/buggy/bad implementations of the normalization producing different results.

Even a perfect UTS-46 Version 14.0 implementation wouldn’t fix these issues (notwithstanding, I applaud your efforts to build an on-chain implementation. Thank you!). I think the original sin was to apply strict UTS-46 normalization to emoji characters. The concept of “normalizing” emoji doesn’t make much sense to me because emoji are unique and should just “normalize” to themselves.

Still digging into this (and other) ENS/emoji issues, but I hope we can find a solution which:

  • Allows the full suite of emoji to be confidently registered in ENS
  • Provides a seamless, bug-free experience to all ENS end users
  • Protects the rights and expectations of those who have already legitimately registered emoji-containing ENS domains.

I started this thread because I think we’re currently 0/3. But there’s a lot of really smart people around here, so “hope springs eternal” :blush:(<-- hope this renders ok)

1 Like

Why would you ever show the normalized result?

To prevent spoofing. This is one of the main motivations behind normalization in the first place. It’s helpful that tools show “googIe.eth” → “googie.eth”, so I can manually check the normalized result before blindly sending my payment.

1 Like

But that’s completely implementable as a client-side defense mechanism. Normalization is for determining equality.

Examples:

  • this name contains mixed capitalization
  • this name contains non-ascii
  • this name contains emoji
  • this name doesn’t match its reverse
  • this name is not the accounts primary
  • etc.

You asked why a normalized name would ever be shown to the user. Spoofing prevention is a possible reason. The engineers/designers behind app.ens.domains and Metamask who have opted to show normalized names may have other good reasons as well.

In the case of “keycap” characters, is the “normalized” version actually invalid?

For example there is a “Digit Six” emoji: :six:
This is constructed with 6 and U+FE0F, so I would assume this would abide by all the same rules enumerated here. I guess technically :six: is the “fully-qualified” version of 6.

But then, there is a completely separate “Keycap Digit Six” emoji: :six:
This one is constructed with the Digit Six emoji above, plus the U+20E3 “Combining Enclosing Keycap” character. So the full sequence for :six: is 6 + U+FE0F + U+20E3.

When this normalization process is applied to the Keycap Digit Six emoji, the FE0F is stripped out from the middle of the sequence, not the end. So now what you’re left with is 6 + U+20E3. Is this a valid sequence at this point?

For example, if you have the following HTML:

<p>6&#x20E3;</p>
<p>6&#xFE0F;&#x20E3;</p>

The first one will not render as :six:, only the second one.
keycap differences

And this matches up with what the manager UI currently shows:

Ultimately it doesn’t change the fact that as long as everyone follows the same normalization rules, it shouldn’t break resolution. If you enter :six::nine:.eth and registered via the ENS manager app, you will actually register 6⃣9⃣.eth. And then if you enter :six::nine:.eth in Metamask, you will actually be sending to 6⃣9⃣.eth. Okay, all is well I suppose.

But I guess the nuance here is that in order to abide by these normalization rules, you just need to come to terms with the (apparent) fact that the domain you registered is not actually the real Keycap Digit Sequence. And if you do manually register the correct full sequence against the smart contract, sorry, but the ENS community has decided by social contract (by using the same normalization rules) that your domain will never actually be resolved to from the actual Keycap characters.

So, I guess that sucks for anyone who manually registered :six::nine:.eth (or any other name with a keycap sequence) against the contract and someone else owns 6⃣9⃣.eth, because the full and correct sequence will effectively be useless in web3 client sites/dapps.

However like Nick and others have stated, if we change the normalization rules, now it could effectively screw over anyone holding 6⃣9⃣.eth because now :six::nine:.eth would no longer resolve to their domain whereas previously it would. And someone else paying attention to these threads could then come in and snipe that full correct version before the original owner (who thought they registered :six::nine:.eth) is aware anything changed.

FYI for the above, it looks like these forums automatically convert those keycap emojis, here’s what I see in the edit pane before saving just for reference, you can see the difference between the regular Digit Six and the Keycap Digit Six emojis:

3 Likes

The UTS-46 library MetaMask and Etherscan use for normalization does not support any characters past the Unicode 9.0 spec (released in 2016), so both of those platforms are several years behind. In comparison, the ENS app has support up to Unicode 13.0. Unicode 14.0 was released a few weeks ago, so I am sure the other platforms will slowly update. I would imagine emoji resolution is not high on the list of most devs.

In regards to the debate of ENS resolving fully-qualified emoji over minimally-qualified emoji, I think it is important to note that most emojis only have a fully qualified form, without the use of any variation selector. In the cases were an emoji might have versions of varying quality, the fully-qualified version is almost always qualified through the addition of Variation Selector-16 (U+FE0F). This codepoint is simply used to specify how a variant should be displayed. I would argue that the minimally qualified and fully qualified emoji version of a given emoji both have the same semantic meaning, and so they should both point to the same place. This would also be in line with the UTS-46 spec.

1 Like

I started making a tool: ENS Resolver

This uses my new compressed library for UTS-46 using Unicode v14.0.0. Currently it requires window.ethereum and has external dependency for keccak but I’ll fix that soon.

2 Likes

I put my UTS-46 library on Github: @adraffy/ens-normalize.js

I made a test for the latest emoji: ENS Emoji Test

The 11 errors seem pretty minor.

1 Like

Amazing work! Would you be prepared to productionise this - set up unit tests, docs, continuous build etc? Preferably even rewrite it in Typescript?

True Names would happily give you a grant for this, and we can start using it in place of the current library.

Sure, I can do that. I haven’t finished the solidity contract port yet. I also need to translate the compressor from Mathematica to JS so it can be used to compress future Unicode updates.

1 Like

Even without a Solidity version, this is incredibly helpful. And yes, making the ‘compressor’ part of the build process would be crucial!

Feel free to DM me so we can arrange some compensation for this valuable work.

1 Like

My browser converted this to https://xn--og8hvo.eth.link/, which doesn’t resolve. Maybe this is an eth.link/cloudflare issue? Is this supposed to work or are you speaking hypothetically?

I generally agree, and just so we’re all clear, there are a few cases this comes up:

  1. Copyright. There is the non-emoji “Latin” unicode character “©” 0x00A9 (simply the copyright code point). and the emoji character “©️” 0x00A9 0xFE0F. which reads “copyright, in emoji form”.
  2. Woman Superhero. “:woman_superhero:” 0x1F9B8 0x200D 0x2640 0xFE0F, which reads “superhero + (woman in emoji form)”. The woman symbol 0x2640 can be expressed in non-emoji form (:female_sign:), but as part of an emoji sequence, the “emoji” variation is required for proper display.

My only real concerns are:

  • If UTS-46 ever allows 0xFE0F, and ENS adopts the new standard, minimally qualified registrations would break.
  • Clients typically show normalized results to end-users. Displaying minimally qualified or unqualified emoji degrades the user experience.

Neither of these are deal-breakers per se, but I think ENS should be aware of them. If end-users aren’t meant to see normalized forms of ENS domains, clients should be discouraged from showing them.

Super cool! Upgrading from UTS-46 13.0 to 14.0 will definitely be a big help to support the newer emoji. To clarify some of the issues found:

  • England/Scotland/Wales flags are disallowed because they contain “tags”, disallowed in this part of the mapping section:
    E0020..E007F ; disallowed # 3.1 TAG SPACE..CANCEL TAG
    There are also “unofficial” US state flags which use these tags, so they’d also be disallowed. Not sure if ENS would ever want to include these as allowable in domains.
  • Japanese characters (and other miscellaneous characters) get mapped to different characters. Let’s take :u7533: https://emojipedia.org/japanese-application-button/
    1F238 ; mapped ; 7533 # 6.0 SQUARED CJK UNIFIED IDEOGRAPH-7533
    On your app, this is normalizing to “盄”, which appears to be a chinese character, but I’m not 100% sure. Is this an accurate normalization for that character?
  • Ⓜ️Ⓜ️Ⓜ️.eth normalized to uuu.eth, which is odd. The circled M emoji is weird in that is has a lowercase form that isn’t an emoji, so it’s a bit of an edge case, but I wouldn’t expect it to normalize to “u”.

Really great tool though, thanks for building it!

2 Likes

Cloudflare should puny-decode before resolving with ENS. I’ll talk to them about this.

We will definitely need to carefully evaluate new versions of the standard before rolling them out.

1 Like

I see the following:

  • 🏳️‍🌈.eth == 1f3f3 fe0f 200d 1f308 2e 65 74 68
  • xn--og8hvo.eth == 78 6e 2d 2d 6f 67 38 68 76 6f 2e 65 74 68 == 🏳🌈.eth

This is the IDNA step, but then there’s NFC, but this was an off-by-one bug in my decompression code. Thanks!

Ⓜ️Ⓜ️Ⓜ️.eth now goes to mmm.eth
1f238 now correctly maps to 7533

2 Likes

Have you entered your flag variants into my resolver demo?

These all resolve to the same thing.

2 Likes

Due to how many conflicts this causes with previously registered names, this behavior has been changed in my library and emoji are no longer upgraded. Essentially, ZWJ is optional, which results in multiple names for many emoji.

I think the example Nick shows me was:

  1. 😵💫😵💫😵💫.eth0x372973309f827B5c3864115cE121c96ef9cB1658
  2. 😵💫😵💫😵💫.eth0x0033AAdE458d0b39Ce0B1Ba2581859F2D5855555

These render identically for me on Windows+Firefox but look separate on my Mac:

Additionally, if these two examples represent AAA and BBB, then also AAB, ABA, ABB, BAA, BAB, BBA are valid, eg. 😵‍💫😵‍💫😵‍💫.eth

1 Like