ENS Name Normalization

raffy · December 26, 2021, 11:22pm

Well, I was just about to post that I had it all figured out, and then I was curious what would happen if I only gave my library UTS-51 rules and no UTS-46 rules (this should be a perfect emoji parser), and I ran into an issue with ZWJ when testing it:

💩💩 => valid
💩{200D}{200D}💩 => invalid
💩{200D}💩 => valid, but different than above
// invisible poo joiner!

// note: these are separate and already registered:
😵‍💫 = {1F635}{200D}{1F4AB}
😵💫 = {1F635}{1F4AB} 

// and even weirder, since "1" is an emoji
// this is also valid:
1{200D}1

I don’t see anyway of differentiating emoji_zwj_sequence (legal ZWJ sequences) from whats called the RGI_Emoji_ZWJ_Sequence (recommended for general interchange) without pinning the allowable sequences to a whitelisted set of permutations: https://adraffy.github.io/ens-normalize.js/build/unicode-raw/emoji-zwj-sequences.txt

Additionally, how should we treat deliberately text-styled emoji like 💩︎ = 💩{FE0E}?

// note: FE0E and FE0F are ignored

// (A) these should be the same
// (text styling without a joiner)
// each is 2 glyphs
😵︎💫 = {1F635}{FE0E}{1F4AB}
😵︎💫️ = {1F635}{FE0E}{1F4AB}{FE0F}
😵💫︎ = {1F635}{1F4AB}{FE0E}
😵︎💫︎ = {1F635}{FE0E}{1F4AB}{FE0E}

// (B) these should be the same (ZWJ sequence)
// but different from above:
// each is 1 glyph
😵‍💫 = {1F635}{200D}{1F4AB}
😵️‍💫 = {1F635}{FE0F}{200D}{1F4AB}

// but with deliberate text styling,
// the ZWJ sequence is supposed to terminate!
// (how these render varies per platform)
(1) 😵︎‍💫 = {1F635}{FE0E}{200D}{1F4AB} => technically invalid
(2) 😵︎‍💫️ = {1F635}{FE0E}{200D}{1F4AB}{FE0F} => technically invalid
// this one is okay, no joiner
(3) 😵︎💫︎ = {1F635}{FE0E}{1F4AB}{FE0F} => valid 

// depending on the Unicode implementation,
// the (1) edits as either:
(Edit1) [text-styled 1F635] + [200D] + [1F4AB] // 3 glyphs
(Edit2) [text-styled 1F635 + 200D + 1F4AB]     // 1 glyph
// (Edit1) and (Edit2) should correspond to (A) or (B) above

If we use the RGI_Emoji_ZWJ_Sequence whitelist, this problem is solved because no text-styled emoji appear in those sequences. Which would imply that text-styled emoji terminate, and are equivalent to their unstyled or emoji-styled versions: 💩︎ == 💩️{FE0F} == 💩(default style).

Edit: I believe my latest version gets everything correct w/ the exception of not knowing which emoji/characters should be enabled/disabled. It correctly handles the examples above.