I split the Different Norms report into categories based on the type of mapping required (Arabic, Hyphen, etc.). If a name required more than one technique, I put in the “everything else” bucket. (For example, a name could have both an incorrect hyphen and apostrophe – fortunately, most only have a single issue.) I removed the trivial differences (like ASCII casing.)
Quick overview:
- 42% are Arabic digits
- 26% are improperly normalized emoji
- 12% are wrong hyphen
- 11% are wrong apostrophe
- 3% are circled/squared ASCII
I’m not particularly fond of the circle/double-circled/negative-circled mappings (I’ve mentioned this before) but I’m unsure what should be done. IDNA only maps the circled-letters (Ⓐ) and leaves the rest alone. I decided to map all of the remaining types except for the Negative Squared digits (which overlap with emoji.) Maybe they should just be disallowed instead? Two sets of negative circled digits is clearly wrong. Negative Circled and Squared letters are very similar.
The actual mappings rules can be found here: chars-mapped.js