UTS-46 Non-compliant Emoji

royalfork · November 18, 2021, 2:34am

In the past week, thousands of “emoji-only” domains have been registered on ENS, and the second-hand market for valuable/unique/rare “emoji-only” domains is quickly growing. Because emoji is complicated, unicode is complicated, ENS is complicated, etc, several edge cases have emerged which may warrant additional consideration from the ENS development team. This thread describes one such edge case.

UTS-46 non-compliant emoji

Because UTS-51 compliant emoji characters were never designed to be compatible with the UTS-46 normalization procedures ENS has adopted, there are some “official” emoji which include unicode characters ignored or disallowed by UTS-46. When these emoji are input into app.ens.domains, UTS-46 normalization strips the problem characters. This process transforms the emoji from a “fully-qualified” emoji into a “minimally-qualified” or “unqualified” emoji (see http://www.unicode.org/reports/tr51/#def_fully_qualified_emoji for complete definitions of these terms, and https://unicode.org/Public/emoji/14.0/emoji-test.txt for a list of each classification). Depending on the user’s platform and system fonts, this normalization may or may not be visually apparent to the user. If the normalized domain is registered, the “minimally-qualified” or “unqualified” representation is stored within the ENS registrar, and subsequent resolution will also return the “minimally-qualified” or “unqualified” emoji.

This screenshot was taken on Chrome/Mac, where “minimally-qualified” emoji are supported:

This screenshot was taken on Firefox/Linux, where “minimally-qualified” emoji are not supported:

And this was taken after Metamask resolution on Chrome/Mac:
police-mm

What does the ENS team think of this? Specifically, should ENS continue to store and resolve “minimally-qualified” or “unqualified” emoji in its registrar? Should these type of emoji be encouraged or discouraged for use within ENS domains? Would ENS ever consider emoji-specific exceptions to UTS-46 normalization? If the Unicode Consortium ever brought UTS-51 into full compliance with UTS-46, would ENS adopt the new mapping tables?

I don’t think we should compromise the integrity of ENS to accommodate this narrow “emoji” use-case, but I also think “emoji” ENS domains are an interesting, exciting, and promising way with which the ENS platform can grow and expand to new users. I’m curious to hear other opinions on this.

nick.eth · November 18, 2021, 2:48am

Good analysis, thank you!

I think we need to be very careful about any changes that could change the normalisation of existing names, or make previously valid names invalid. Allowing previously forbidden characters is much easier, of course.

Changing or disallowing existing names could be doable with a compelling enough reason, but I think we’d want to be very cautious about doing this.

royalfork · November 18, 2021, 3:11am

From section 1.3.1 of UTR-46

Mapping typically involves mapping uppercase characters to their lowercase pairs, but it also involves other types of mappings between equivalent characters, such as mapping halfwidth katakana characters to normal katakana characters in Japanese.

This describes a process by which “imperfect” data is transformed into a “perfect” or “canonical” form. Emojis present a peculiar case whereby normalization actually transforms the “perfect” form into a “corrupt” form. I know unicode is vast, but are there any other unicode categories which have this “corruption on normalization” property?

I completely agree that caution should be exercised in making new policy, and sympathize with the “fixing the plane while it’s flying” nature of this problem. These are very hard decisions!

GerichoMartin · November 18, 2021, 5:02am

This has actually been brought up (not as extensively described as above) in the discord over the past couple of days. Great post btw

To oversimplify it seems like a situation where ENS is stuck between a politically correct solution that is “technically broken”, or a technically sound solution that could break some hearts.

Its still early and there is time to re-arrange things and damage control if that were the consensus, in any case in terms of time, there is likely a point of no return with the above.

raffy · November 18, 2021, 6:39am

For resolution, it doesn’t matter if emojis are in the “perfect” or “corrupt” form. It’s like arguing if upper or lower-case is more correct. As long as the namehash is always done on the consistent representation, the internal representation is irrelevant, since you can still use either of the names (perfect or corrupt), since they reduce to the same thing.

However, In the reverse direction, the choice matters along with any other valid variation, like capitalization. As long as your chosen name normalizes to the same thing, there should be no requirement that it has to be the canonical form.

I haven’t looked at the reverse-mechanism yet, so I’m unsure how it works exactly. For example, can my reverse be rAfFy.eTh?

royalfork · November 18, 2021, 11:14pm

it doesn’t matter if emojis are in the “perfect” or “corrupt” form. It’s like arguing if upper or lower-case is more correct

With all due respect, this is nothing like that. More accurately, it’s like arguing whether upper case or some glitched out, un-typable character, is more correct. And there should be no argument; the difference is explicitly and unambiguously called out in the Unicode specifications.

the internal representation is irrelevant, since you can still use either of the names (perfect or corrupt), since they reduce to the same thing.

This is only true if we assume that the UTS-46 standard doesn’t change. Unfortunately, UTS-46 changes fairly regularly. If ENS wishes to adopt those changes, “old” normalizations which produce “corrupt” internal representations could be superseded by “new” normalizations which produce “correct” internal representations. In these cases, “previously valid names would become invalid”. This scenario isn’t merely theoretical, but perhaps likely: UTS-46 is progressively coming into compliance with UTS-51 with each successive release.

Less importantly, because normalization is “lossy”, it’s possible that 2 or more different names could normalize to the same thing. The registration of 1 name would prohibit the registration of other names. While this doesn’t necessarily “break” anything, it’s a potentially undesirable property of “corrupt” internal representations.

There’s a further issue of what the “canonical display” should be. Should clients display the “correct” version which the user has input, and perform the messy normalization under the hood? Or should they display the corrupted version after normalization, following the same process as all other ENS domains?

Not sure what the best way forward is, but there are certainly pitfalls to maintaining “corrupt” internal representations in the ENS registrar.

can my reverse be rAfFy.eTh?

Yes. Reverse resolution can return anything you want.

raffy · November 19, 2021, 12:24am

It’s exactly like that. Let’s say you prefer the name “aA”. Normalize(“aA”) is “aa”. namehash(“aa”) is registered. Any input name that normalizes to “aa” matches, eg. “Aa”, “AA”, “aA”.

How does the choice of perfect/corrupt/upper/lower matter here?

As I mentioned, the display of your name should be your choice as long as it normalizes to the same hash. There’s no such thing as a canonical display IMO.

I agree that the version of UTS-46 should certainly be specified and fixed, however upgrading/changing the algorithm is a different can of worms. Changes that result in no registered collisions seem fine but who pays for that state update? Do you allow a grace period where owners can specify their preferred names under the new spec before the hash is recomputed? Seems messy.

IMO, the problem that needs fixed is different/buggy/bad implementations of the normalization producing different results. This happens outside of ENS.

I suggested in discord that a contract that performs the normalization operation should exist on-chain. I wrote a compressed Javascript library that does EIP-137/UTS-46 (idna, puny decode, and validation) for 14.0.0 and have ported it Solidity. However, my current implementation does not do NFC and requires two eth_calls. I am working on a solution. I am also looking for a good test set of names.

Alternatives could include functions written in various languages stored on-chain (or ipfs) that get runtime interpreted (and can be locally cached), eg. new Function(await fetch(...).then(x => x.text())("rAfFy.eTh").

royalfork · November 19, 2021, 3:36am

How does the choice of perfect/corrupt/upper/lower matter here?

The important distinction is normalize(“aA”) produces output which is both equivalent to the input, and valid unicode. On the other hand, normalize(“”) produces output which is not equivalent to the input, and invalid unicode. This is an important distinction for 2 reasons:

As you mention, as long as the corruption is applied consistently, resolution doesn’t currently break. But, the system is not future-proof, and big problems will arise if UTS-46/IDNA ever improves support for emoji.
Many ENS apps (app.ens.domains, Metamask, and even DNS queries in your browser) follow a “normalize”->“resolve”->“show user normalized input and result” pattern. A normalization procedure which outputs invalid Unicode breaks this.

IMO, the problem that needs fixed is different/buggy/bad implementations of the normalization producing different results.

Even a perfect UTS-46 Version 14.0 implementation wouldn’t fix these issues (notwithstanding, I applaud your efforts to build an on-chain implementation. Thank you!). I think the original sin was to apply strict UTS-46 normalization to emoji characters. The concept of “normalizing” emoji doesn’t make much sense to me because emoji are unique and should just “normalize” to themselves.

Still digging into this (and other) ENS/emoji issues, but I hope we can find a solution which:

Allows the full suite of emoji to be confidently registered in ENS
Provides a seamless, bug-free experience to all ENS end users
Protects the rights and expectations of those who have already legitimately registered emoji-containing ENS domains.

I started this thread because I think we’re currently 0/3. But there’s a lot of really smart people around here, so “hope springs eternal” (<-- hope this renders ok)

raffy · November 19, 2021, 3:43am

Why would you ever show the normalized result?

royalfork · November 19, 2021, 3:50am

To prevent spoofing. This is one of the main motivations behind normalization in the first place. It’s helpful that tools show “googIe.eth” → “googie.eth”, so I can manually check the normalized result before blindly sending my payment.

raffy · November 19, 2021, 3:58am

But that’s completely implementable as a client-side defense mechanism. Normalization is for determining equality.

Examples:

this name contains mixed capitalization
this name contains non-ascii
this name contains emoji
this name doesn’t match its reverse
this name is not the accounts primary
etc.

royalfork · November 19, 2021, 4:34am

You asked why a normalized name would ever be shown to the user. Spoofing prevention is a possible reason. The engineers/designers behind app.ens.domains and Metamask who have opted to show normalized names may have other good reasons as well.

serenae · November 19, 2021, 4:16pm

In the case of “keycap” characters, is the “normalized” version actually invalid?

For example there is a “Digit Six” emoji:
This is constructed with 6 and U+FE0F, so I would assume this would abide by all the same rules enumerated here. I guess technically is the “fully-qualified” version of 6.

But then, there is a completely separate “Keycap Digit Six” emoji:
This one is constructed with the Digit Six emoji above, plus the U+20E3 “Combining Enclosing Keycap” character. So the full sequence for is 6 + U+FE0F + U+20E3.

When this normalization process is applied to the Keycap Digit Six emoji, the FE0F is stripped out from the middle of the sequence, not the end. So now what you’re left with is 6 + U+20E3. Is this a valid sequence at this point?

For example, if you have the following HTML:

<p>6&#x20E3;</p>
<p>6&#xFE0F;&#x20E3;</p>

The first one will not render as , only the second one.
keycap differences

And this matches up with what the manager UI currently shows:

Ultimately it doesn’t change the fact that as long as everyone follows the same normalization rules, it shouldn’t break resolution. If you enter .eth and registered via the ENS manager app, you will actually register 6⃣9⃣.eth. And then if you enter .eth in Metamask, you will actually be sending to 6⃣9⃣.eth. Okay, all is well I suppose.

But I guess the nuance here is that in order to abide by these normalization rules, you just need to come to terms with the (apparent) fact that the domain you registered is not actually the real Keycap Digit Sequence. And if you do manually register the correct full sequence against the smart contract, sorry, but the ENS community has decided by social contract (by using the same normalization rules) that your domain will never actually be resolved to from the actual Keycap characters.

So, I guess that sucks for anyone who manually registered .eth (or any other name with a keycap sequence) against the contract and someone else owns 6⃣9⃣.eth, because the full and correct sequence will effectively be useless in web3 client sites/dapps.

However like Nick and others have stated, if we change the normalization rules, now it could effectively screw over anyone holding 6⃣9⃣.eth because now .eth would no longer resolve to their domain whereas previously it would. And someone else paying attention to these threads could then come in and snipe that full correct version before the original owner (who thought they registered .eth) is aware anything changed.

FYI for the above, it looks like these forums automatically convert those keycap emojis, here’s what I see in the edit pane before saving just for reference, you can see the difference between the regular Digit Six and the Keycap Digit Six emojis:

yung-terminal · November 19, 2021, 11:20pm

The UTS-46 library MetaMask and Etherscan use for normalization does not support any characters past the Unicode 9.0 spec (released in 2016), so both of those platforms are several years behind. In comparison, the ENS app has support up to Unicode 13.0. Unicode 14.0 was released a few weeks ago, so I am sure the other platforms will slowly update. I would imagine emoji resolution is not high on the list of most devs.

In regards to the debate of ENS resolving fully-qualified emoji over minimally-qualified emoji, I think it is important to note that most emojis only have a fully qualified form, without the use of any variation selector. In the cases were an emoji might have versions of varying quality, the fully-qualified version is almost always qualified through the addition of Variation Selector-16 (U+FE0F). This codepoint is simply used to specify how a variant should be displayed. I would argue that the minimally qualified and fully qualified emoji version of a given emoji both have the same semantic meaning, and so they should both point to the same place. This would also be in line with the UTS-46 spec.

raffy · November 20, 2021, 12:49am

I started making a tool: ENS Resolver

This uses my new compressed library for UTS-46 using Unicode v14.0.0. Currently it ~~requires window.ethereum and~~ has external dependency for keccak but I’ll fix that soon.

raffy · November 22, 2021, 12:55am

I put my UTS-46 library on Github: @adraffy/ens-normalize.js

I made a test for the latest emoji: ENS Emoji Test

The 11 errors seem pretty minor.

nick.eth · November 22, 2021, 12:59am

Amazing work! Would you be prepared to productionise this - set up unit tests, docs, continuous build etc? Preferably even rewrite it in Typescript?

True Names would happily give you a grant for this, and we can start using it in place of the current library.

raffy · November 22, 2021, 1:39am

Sure, I can do that. I haven’t finished the solidity contract port yet. I also need to translate the compressor from Mathematica to JS so it can be used to compress future Unicode updates.

nick.eth · November 22, 2021, 1:55am

Even without a Solidity version, this is incredibly helpful. And yes, making the ‘compressor’ part of the build process would be crucial!

Feel free to DM me so we can arrange some compensation for this valuable work.

royalfork · November 23, 2021, 4:14am

My browser converted this to https://xn--og8hvo.eth.link/, which doesn’t resolve. Maybe this is an eth.link/cloudflare issue? Is this supposed to work or are you speaking hypothetically?

I generally agree, and just so we’re all clear, there are a few cases this comes up:

Copyright. There is the non-emoji “Latin” unicode character “©” 0x00A9 (simply the copyright code point). and the emoji character “©️” 0x00A9 0xFE0F. which reads “copyright, in emoji form”.
Woman Superhero. “” 0x1F9B8 0x200D 0x2640 0xFE0F, which reads “superhero + (woman in emoji form)”. The woman symbol 0x2640 can be expressed in non-emoji form (), but as part of an emoji sequence, the “emoji” variation is required for proper display.

My only real concerns are:

If UTS-46 ever allows 0xFE0F, and ENS adopts the new standard, minimally qualified registrations would break.
Clients typically show normalized results to end-users. Displaying minimally qualified or unqualified emoji degrades the user experience.

Neither of these are deal-breakers per se, but I think ENS should be aware of them. If end-users aren’t meant to see normalized forms of ENS domains, clients should be discouraged from showing them.

Super cool! Upgrading from UTS-46 13.0 to 14.0 will definitely be a big help to support the newer emoji. To clarify some of the issues found:

England/Scotland/Wales flags are disallowed because they contain “tags”, disallowed in this part of the mapping section:
E0020..E007F ; disallowed # 3.1 TAG SPACE..CANCEL TAG
There are also “unofficial” US state flags which use these tags, so they’d also be disallowed. Not sure if ENS would ever want to include these as allowable in domains.
Japanese characters (and other miscellaneous characters) get mapped to different characters. Let’s take 🈸 Japanese “Application” Button Emoji
1F238 ; mapped ; 7533 # 6.0 SQUARED CJK UNIFIED IDEOGRAPH-7533
On your app, this is normalizing to “盄”, which appears to be a chinese character, but I’m not 100% sure. Is this an accurate normalization for that character?
Ⓜ️Ⓜ️Ⓜ️.eth normalized to uuu.eth, which is odd. The circled M emoji is weird in that is has a lowercase form that isn’t an emoji, so it’s a bit of an edge case, but I wouldn’t expect it to normalize to “u”.

Really great tool though, thanks for building it!