UTS-46 Non-compliant Emoji

In the past week, thousands of “emoji-only” domains have been registered on ENS, and the second-hand market for valuable/unique/rare “emoji-only” domains is quickly growing. Because emoji is complicated, unicode is complicated, ENS is complicated, etc, several edge cases have emerged which may warrant additional consideration from the ENS development team. This thread describes one such edge case.

UTS-46 non-compliant emoji

Because UTS-51 compliant emoji characters were never designed to be compatible with the UTS-46 normalization procedures ENS has adopted, there are some “official” emoji which include unicode characters ignored or disallowed by UTS-46. When these emoji are input into app.ens.domains, UTS-46 normalization strips the problem characters. This process transforms the emoji from a “fully-qualified” emoji into a “minimally-qualified” or “unqualified” emoji (see http://www.unicode.org/reports/tr51/#def_fully_qualified_emoji for complete definitions of these terms, and https://unicode.org/Public/emoji/14.0/emoji-test.txt for a list of each classification). Depending on the user’s platform and system fonts, this normalization may or may not be visually apparent to the user. If the normalized domain is registered, the “minimally-qualified” or “unqualified” representation is stored within the ENS registrar, and subsequent resolution will also return the “minimally-qualified” or “unqualified” emoji.

This screenshot was taken on Chrome/Mac, where “minimally-qualified” emoji are supported:


This screenshot was taken on Firefox/Linux, where “minimally-qualified” emoji are not supported:

And this was taken after Metamask resolution on Chrome/Mac:
police-mm

What does the ENS team think of this? Specifically, should ENS continue to store and resolve “minimally-qualified” or “unqualified” emoji in its registrar? Should these type of emoji be encouraged or discouraged for use within ENS domains? Would ENS ever consider emoji-specific exceptions to UTS-46 normalization? If the Unicode Consortium ever brought UTS-51 into full compliance with UTS-46, would ENS adopt the new mapping tables?

I don’t think we should compromise the integrity of ENS to accommodate this narrow “emoji” use-case, but I also think “emoji” ENS domains are an interesting, exciting, and promising way with which the ENS platform can grow and expand to new users. I’m curious to hear other opinions on this.

Good analysis, thank you!

I think we need to be very careful about any changes that could change the normalisation of existing names, or make previously valid names invalid. Allowing previously forbidden characters is much easier, of course.

Changing or disallowing existing names could be doable with a compelling enough reason, but I think we’d want to be very cautious about doing this.

1 Like

From section 1.3.1 of UTR-46

Mapping typically involves mapping uppercase characters to their lowercase pairs, but it also involves other types of mappings between equivalent characters, such as mapping halfwidth katakana characters to normal katakana characters in Japanese.

This describes a process by which “imperfect” data is transformed into a “perfect” or “canonical” form. Emojis present a peculiar case whereby normalization actually transforms the “perfect” form into a “corrupt” form. I know unicode is vast, but are there any other unicode categories which have this “corruption on normalization” property?

I completely agree that caution should be exercised in making new policy, and sympathize with the “fixing the plane while it’s flying” nature of this problem. These are very hard decisions!

This has actually been brought up (not as extensively described as above) in the discord over the past couple of days. Great post btw

To oversimplify it seems like a situation where ENS is stuck between a politically correct solution that is “technically broken”, or a technically sound solution that could break some hearts.

Its still early and there is time to re-arrange things and damage control if that were the consensus, in any case in terms of time, there is likely a point of no return with the above.

For resolution, it doesn’t matter if emojis are in the “perfect” or “corrupt” form. It’s like arguing if upper or lower-case is more correct. As long as the namehash is always done on the consistent representation, the internal representation is irrelevant, since you can still use either of the names (perfect or corrupt), since they reduce to the same thing.

However, In the reverse direction, the choice matters along with any other valid variation, like capitalization. As long as your chosen name normalizes to the same thing, there should be no requirement that it has to be the canonical form.

I haven’t looked at the reverse-mechanism yet, so I’m unsure how it works exactly. For example, can my reverse be rAfFy.eTh?

it doesn’t matter if emojis are in the “perfect” or “corrupt” form. It’s like arguing if upper or lower-case is more correct

With all due respect, this is nothing like that. More accurately, it’s like arguing whether upper case or some glitched out, un-typable character, is more correct. And there should be no argument; the difference is explicitly and unambiguously called out in the Unicode specifications.

the internal representation is irrelevant, since you can still use either of the names (perfect or corrupt), since they reduce to the same thing.

This is only true if we assume that the UTS-46 standard doesn’t change. Unfortunately, UTS-46 changes fairly regularly. If ENS wishes to adopt those changes, “old” normalizations which produce “corrupt” internal representations could be superseded by “new” normalizations which produce “correct” internal representations. In these cases, “previously valid names would become invalid”. This scenario isn’t merely theoretical, but perhaps likely: UTS-46 is progressively coming into compliance with UTS-51 with each successive release.

Less importantly, because normalization is “lossy”, it’s possible that 2 or more different names could normalize to the same thing. The registration of 1 name would prohibit the registration of other names. While this doesn’t necessarily “break” anything, it’s a potentially undesirable property of “corrupt” internal representations.

There’s a further issue of what the “canonical display” should be. Should clients display the “correct” version which the user has input, and perform the messy normalization under the hood? Or should they display the corrupted version after normalization, following the same process as all other ENS domains?

Not sure what the best way forward is, but there are certainly pitfalls to maintaining “corrupt” internal representations in the ENS registrar.

can my reverse be rAfFy.eTh?

Yes. Reverse resolution can return anything you want.

It’s exactly like that. Let’s say you prefer the name “aA”. Normalize(“aA”) is “aa”. namehash(“aa”) is registered. Any input name that normalizes to “aa” matches, eg. “Aa”, “AA”, “aA”.

How does the choice of perfect/corrupt/upper/lower matter here?

As I mentioned, the display of your name should be your choice as long as it normalizes to the same hash. There’s no such thing as a canonical display IMO.

I agree that the version of UTS-46 should certainly be specified and fixed, however upgrading/changing the algorithm is a different can of worms. Changes that result in no registered collisions seem fine but who pays for that state update? Do you allow a grace period where owners can specify their preferred names under the new spec before the hash is recomputed? Seems messy.


IMO, the problem that needs fixed is different/buggy/bad implementations of the normalization producing different results. This happens outside of ENS.

I suggested in discord that a contract that performs the normalization operation should exist on-chain. I wrote a compressed Javascript library that does EIP-137/UTS-46 (idna, puny decode, and validation) for 14.0.0 and have ported it Solidity. However, my current implementation does not do NFC and requires two eth_calls. I am working on a solution. I am also looking for a good test set of names.

Alternatives could include functions written in various languages stored on-chain (or ipfs) that get runtime interpreted (and can be locally cached), eg. new Function(await fetch(...).then(x => x.text())("rAfFy.eTh").

How does the choice of perfect/corrupt/upper/lower matter here?

The important distinction is normalize(“aA”) produces output which is both equivalent to the input, and valid unicode. On the other hand, normalize(“:policewoman:t2:”) produces output which is not equivalent to the input, and invalid unicode. This is an important distinction for 2 reasons:

  1. As you mention, as long as the corruption is applied consistently, resolution doesn’t currently break. But, the system is not future-proof, and big problems will arise if UTS-46/IDNA ever improves support for emoji.

  2. Many ENS apps (app.ens.domains, Metamask, and even DNS queries in your browser) follow a “normalize”->“resolve”->“show user normalized input and result” pattern. A normalization procedure which outputs invalid Unicode breaks this.

IMO, the problem that needs fixed is different/buggy/bad implementations of the normalization producing different results.

Even a perfect UTS-46 Version 14.0 implementation wouldn’t fix these issues (notwithstanding, I applaud your efforts to build an on-chain implementation. Thank you!). I think the original sin was to apply strict UTS-46 normalization to emoji characters. The concept of “normalizing” emoji doesn’t make much sense to me because emoji are unique and should just “normalize” to themselves.

Still digging into this (and other) ENS/emoji issues, but I hope we can find a solution which:

  • Allows the full suite of emoji to be confidently registered in ENS
  • Provides a seamless, bug-free experience to all ENS end users
  • Protects the rights and expectations of those who have already legitimately registered emoji-containing ENS domains.

I started this thread because I think we’re currently 0/3. But there’s a lot of really smart people around here, so “hope springs eternal” :blush:(<-- hope this renders ok)

1 Like

Why would you ever show the normalized result?

To prevent spoofing. This is one of the main motivations behind normalization in the first place. It’s helpful that tools show “googIe.eth” → “googie.eth”, so I can manually check the normalized result before blindly sending my payment.

1 Like

But that’s completely implementable as a client-side defense mechanism. Normalization is for determining equality.

Examples:

  • this name contains mixed capitalization
  • this name contains non-ascii
  • this name contains emoji
  • this name doesn’t match its reverse
  • this name is not the accounts primary
  • etc.

You asked why a normalized name would ever be shown to the user. Spoofing prevention is a possible reason. The engineers/designers behind app.ens.domains and Metamask who have opted to show normalized names may have other good reasons as well.

Very happy to see continuing discussion on emoji registrations past, present and future.

In the last Twitter spaces one speaker mentioned some of these issues and brantly suggested a possible spaces dedicated to emojis. I’d like to follow up on that, spaces feel like board of directors’ meetings for the dao and this is an issue that comes up repeatedly in the community with many different points of view.

From a practical perspective I think everyone agrees there is at least a potential that sellers in secondary can exploit new buyers and harm ENS adoption.

Immediate solutions could be as simple as explanations/education on how users can verify an emoji name “is what it purports to be”, for example take the single character emoji domain :rainbow_flag:.eth.

Step 1: go to ens manager and type the domain name you think you are registering (“:rainbow_flag:”), confirm registration and registrants address.

Step 2: confirm token ID on etherscan. Searching “:rainbow_flag:.eth” results in “unregistered “ in the etherscan ens name lookup. But :rainbow_flag:.eth can be found on etherscan by looking up the registrants address (obtained in step 1) on etherscan.

Step 3: cross-reference the :rainbow_flag:.eth token ID (step 2) to any listing on marketplaces like OS

Step 4: verify the name “works” on your preferred services. To everyone’s point whether a given domain works/how it renders on any service is really a matter of the developer. Ex: :rainbow_flag:.eth is fully functional and recognized address of metamask and metamask mobile but I can’t import :rainbow_flag:.eth or send to :rainbow_flag:.eth on rainbow wallet.

One last thing I did was build a simple landing page deployed on IPFS connected to :rainbow_flag:.eth (add https:// and/or .limo accordingly) where I embedded the OS widget and link to :rainbow_flag:.eth on both etherscan and ens manager. Not sure how helpful that is, but maybe it can be used as an idea for verification others can build on.

In the case of “keycap” characters, is the “normalized” version actually invalid?

For example there is a “Digit Six” emoji: :six:
This is constructed with 6 and U+FE0F, so I would assume this would abide by all the same rules enumerated here. I guess technically :six: is the “fully-qualified” version of 6.

But then, there is a completely separate “Keycap Digit Six” emoji: :six:
This one is constructed with the Digit Six emoji above, plus the U+20E3 “Combining Enclosing Keycap” character. So the full sequence for :six: is 6 + U+FE0F + U+20E3.

When this normalization process is applied to the Keycap Digit Six emoji, the FE0F is stripped out from the middle of the sequence, not the end. So now what you’re left with is 6 + U+20E3. Is this a valid sequence at this point?

For example, if you have the following HTML:

<p>6&#x20E3;</p>
<p>6&#xFE0F;&#x20E3;</p>

The first one will not render as :six:, only the second one.
keycap differences

And this matches up with what the manager UI currently shows:

Ultimately it doesn’t change the fact that as long as everyone follows the same normalization rules, it shouldn’t break resolution. If you enter :six::nine:.eth and registered via the ENS manager app, you will actually register 6⃣9⃣.eth. And then if you enter :six::nine:.eth in Metamask, you will actually be sending to 6⃣9⃣.eth. Okay, all is well I suppose.

But I guess the nuance here is that in order to abide by these normalization rules, you just need to come to terms with the (apparent) fact that the domain you registered is not actually the real Keycap Digit Sequence. And if you do manually register the correct full sequence against the smart contract, sorry, but the ENS community has decided by social contract (by using the same normalization rules) that your domain will never actually be resolved to from the actual Keycap characters.

So, I guess that sucks for anyone who manually registered :six::nine:.eth (or any other name with a keycap sequence) against the contract and someone else owns 6⃣9⃣.eth, because the full and correct sequence will effectively be useless in web3 client sites/dapps.

However like Nick and others have stated, if we change the normalization rules, now it could effectively screw over anyone holding 6⃣9⃣.eth because now :six::nine:.eth would no longer resolve to their domain whereas previously it would. And someone else paying attention to these threads could then come in and snipe that full correct version before the original owner (who thought they registered :six::nine:.eth) is aware anything changed.

FYI for the above, it looks like these forums automatically convert those keycap emojis, here’s what I see in the edit pane before saving just for reference, you can see the difference between the regular Digit Six and the Keycap Digit Six emojis:

2 Likes

Doubling down on what was said above
“Step 4: verify the name “works” on your preferred services. To everyone’s point whether a given domain works/how it renders on any service is really a matter of the developer. Ex: :rainbow_flag:.eth is fully functional and recognized address of metamask and metamask mobile but I can’t import :rainbow_flag:.eth or send to :rainbow_flag:.eth on rainbow wallet.”

UPDATE: Every emoji on this list above E4.0 is currently not working on etherscan or metamask. https://unicode.org/Public/emoji/14.0/emoji-test.txt this seems to affect a humongous amount of Emojis

Is this on Metamask and Etherscan end and how could it be addressed ? If there is anything that we can do on our end we are happy to help.

The UTS-46 library MetaMask and Etherscan use for normalization does not support any characters past the Unicode 9.0 spec (released in 2016), so both of those platforms are several years behind. In comparison, the ENS app has support up to Unicode 13.0. Unicode 14.0 was released a few weeks ago, so I am sure the other platforms will slowly update. I would imagine emoji resolution is not high on the list of most devs.

In regards to the debate of ENS resolving fully-qualified emoji over minimally-qualified emoji, I think it is important to note that most emojis only have a fully qualified form, without the use of any variation selector. In the cases were an emoji might have versions of varying quality, the fully-qualified version is almost always qualified through the addition of Variation Selector-16 (U+FE0F). This codepoint is simply used to specify how a variant should be displayed. I would argue that the minimally qualified and fully qualified emoji version of a given emoji both have the same semantic meaning, and so they should both point to the same place. This would also be in line with the UTS-46 spec.

1 Like

I started making a tool: ENS Resolver

This uses my new compressed library for UTS-46 using Unicode v14.0.0. Currently it requires window.ethereum and has external dependency for keccak but I’ll fix that soon.

1 Like

I put my UTS-46 library on Github: @adraffy/ens-normalize.js

I made a test for the latest emoji: ENS Emoji Test

The 11 errors seem pretty minor.

Amazing work! Would you be prepared to productionise this - set up unit tests, docs, continuous build etc? Preferably even rewrite it in Typescript?

True Names would happily give you a grant for this, and we can start using it in place of the current library.

Sure, I can do that. I haven’t finished the solidity contract port yet. I also need to translate the compressor from Mathematica to JS so it can be used to compress future Unicode updates.

1 Like