ENS Name Normalization

raffy · August 10, 2022, 9:33pm

Combining marks stack/grow in different directions. Some build towers, some appear at the same spot, etc. Some marks obscure the underlying character. Some marks are very tiny.

i [69] vs ı̇ [131 307] (dottless i + dot)

Marks of the same class don’t reorder and form distinct names:

1̉̇ [31 309 307] vs 1̇̉ [31 307 309]

If you convert every valid registered name to NFD and then count sequences of adjacent combining marks, only 41K names have 1 mark, 400 names have 2, and 400 have 3+.

UAX 39 suggests disallowing duplicate marks and only allowing 4 per character. I mentioned this issue a few times in this thread but ultimately I think enforcement should happen at validation level, instead of encoding this logic into the normalization spec.

For Latin, probably 1 CM is sufficient (and many shouldn’t validate, eg. a̲ [332] COMBINING LOW LINE), unless we want some very complicated logic that rejects duplicates and non-stacking/overlapping variations and blacklists some combinations. I’m not exactly sure what marks are needed in other scripts but we should try to keep it as minimal as possible.

IDNA says labels shouldn’t start with a CM. Additionally, I think a CM following an emoji is invalid too.