I made some improvements to the Confusables. I also added a visual breakdown of the Scripts, with emoji removed. I also included the name if you hover over a character.
I am currently thinking about a Script-based approach to address homograph attacks, as my attempts to whitelist have not been very successful with so many characters. First, each script can be independently reduced to a non-confusable set. Then, each script can specify which scripts it can mix with. By reducing the problem from all characters, to just scripts that should be combined together, the script-script confusable surface is many orders of magnitude smaller.
A few of the script groupings are too sloppy. So I suggest creating a few artificial categories, like Emoji, Digits (that span all scripts), Symbols (split from Common), etc. and merge a few that get used frequently together (Han/Katakana/Hiragana). Common/Latin/Greek/Cyrillic scripts should get collapsed using an extremely aggressive version of the confusables, so there’s absolutely 0 confusables with ASCII-like characters.
Then, based on the labels registered so far, determine what kind of script-base rule permits the most names. eg “Latin|Emoji|Digits|Symbols” is a valid recipe.
I’ve computed some tallies that show what types of scripts show up in labels using the following process: map each character of a label to a script, aaαa
→ {Latin,Latin,Greek,Latin} then collect:
For the sorted case, you can see most labels aren’t that diverse:
Since 371K labels solely use the Latin script, I think starting with Emoji, Digits (0-9) and Latin (A-Z) and building up, using the confusable mapping to allow more characters, until all of those names are accepted is a good starting point. And then grow from there.