I investigated the NormalizationTest issue a bit more. I compared my library to Python (unicodedata), Mathematica (CharacterNormalize), and various JS engines (Node, Chrome, Safari, Brave, Firefox) and realized its a total shitshow. My library and the latest version of Firefox are the only ones that pass the test. Pinning NFC to a specific Unicode spec appears to be the right choice.
I made a simple report that compares allowed characters (valid or mapped, minus emoji) between ENS0 (current) and IDNA 2008. I overlaid it with with my current-but-incomplete whitelist so (green = whitelist as-is, purple = whitelist mapped). These are the characters that need review and any input would be very helpful: IDNA: ENS0 vs 2008 Edit: I added the number of times each character shows up in a register name in brackets, eg. § [5]
, means 5 registered names use this character.
There also is a much larger list (the full disallowed list minus this list) that potentially contains characters that should of been enabled in the first place. However, this list too large for an HTML report. For example, underscore (_
).
I added an additional 25K registered labels to the comparison reports. Also, this service displays the last 1024 registered labels with normalization applied: Recent ENS Names