ENS Name Normalization

raffy · April 26, 2022, 8:32am

I’m still uncertain if only catching confusables in the output/normalized form is sufficient but it greatly simplifies the problem. All you need is a mapping from normalized characters to scripts (to determine if a label is single-script), additional disallowed single characters (which are just extra IDNA rules), and a small list of 2-4 length character sequences (~400) that are also disallowed anywhere in the output label.

Here is an ornate example I found where you have confusability on the input (that isn’t covered by UTS-39 because it’s an emoji+text situation) but clearly different on the output (using proposed normalization, without additional logic.)

Text-styled regional indicators potentially render as small-cap ASCII:
Untitled


// default emoji-styled
1F1FA 1F1F8 = "🇺‌🇸" => "🇺🇸" // 2x regional (valid flag sequence)
1F1FA 1F1F9 = "🇺🇹"         // 2x regional (invalid flag sequence)

// explicit emoji styling
1F1FA FE0F 1F1F8 FE0F = "🇺🇸" // same

// explicit text styling
1F1FA FE0E 1F1F8 FE0E = "🇺︎🇸︎" // small-caps ASCII (no flag)

// confusing example
62 1F1F4 FE0E 1F1F8 FE0E 73 = "b🇴︎🇸︎s" vs "boss" // before normalization 
                            = "b🇴🇸s" // after normalization (FE0E is dropped)